From 4561f1d25ba8b077b6adfab3495146500c91835c Mon Sep 17 00:00:00 2001
From: Henry de Valence <hdevalence@hdevalence.ca>
Date: Fri, 28 Aug 2020 14:19:18 -0700
Subject: [PATCH] rfc: initial inventory tracking (#952)

* rfc: initial inventory tracking

This just describes the design, not the design alternatives.

* rfc: finish inventory tracking rfc

Also assign it #3.  The async script verification RFC should have had a number
assigned before merging but it didn't.  I don't want to fix that in this PR
because I don't want those changes to block on each other.  The fix is to (1)
document the RFC flow better and (2) add issue templates for RFCs.

* rfc: touch up inventory tracking rfc

* rfc: prune inventory entries generationally.

Based on a suggestion by @yaahc.

* Update book/src/dev/rfcs/0003-inventory-tracking.md

Co-authored-by: Jane Lusby <jlusby42@gmail.com>
---
 book/src/SUMMARY.md                          |   3 +-
 book/src/dev/rfcs/0003-inventory-tracking.md | 252 +++++++++++++++++++
 2 files changed, 254 insertions(+), 1 deletion(-)
 create mode 100644 book/src/dev/rfcs/0003-inventory-tracking.md

diff --git a/book/src/SUMMARY.md b/book/src/SUMMARY.md
index 5d211e35..67f779fc 100644
--- a/book/src/SUMMARY.md
+++ b/book/src/SUMMARY.md
@@ -10,8 +10,9 @@
   - [Contribution Guide](CONTRIBUTING.md)
   - [Design Overview](dev/overview.md)
   - [Zebra RFCs](dev/rfcs.md)
-    - [RFC Template](dev/rfcs/0000-template.md)
     - [Pipelinable Block Lookup](dev/rfcs/0001-pipelinable-block-lookup.md)
+    - [Parallel Verification](dev/rfcs/0002-parallel-verification.md)
+    - [Inventory Tracking](dev/rfcs/0003-inventory-tracking.md)
     - [Asynchronous Script Verification](dev/rfcs/XXXX-asynchronous-script-verification.md)
   - [Diagrams](dev/diagrams.md)
     - [Network Architecture](dev/diagrams/zebra-network.md)
diff --git a/book/src/dev/rfcs/0003-inventory-tracking.md b/book/src/dev/rfcs/0003-inventory-tracking.md
new file mode 100644
index 00000000..ff7bac4d
--- /dev/null
+++ b/book/src/dev/rfcs/0003-inventory-tracking.md
@@ -0,0 +1,252 @@
+- Feature Name: `inventory_tracking`
+- Start Date: 2020-08-25
+- Design PR: [ZcashFoundation/zebra#952](https://github.com/ZcashFoundation/zebra/pull/952)
+- Zebra Issue: [ZcashFoundation/zebra#960](https://github.com/ZcashFoundation/zebra/issues/960)
+
+# Summary
+[summary]: #summary
+
+The Bitcoin network protocol used by Zcash allows nodes to advertise data
+(inventory items) for download by other peers.  This RFC describes how we track
+and use this information.
+
+# Motivation
+[motivation]: #motivation
+
+In order to participate in the network, we need to be able to fetch new data
+that our peers notify us about.  Because our network stack abstracts away
+individual peer connections, and load-balances over available peers, we need a
+way to direct requests for new inventory only to peers that advertised to us
+that they have it.
+
+# Definitions
+[definitions]: #definitions
+
+- Inventory item: either a block or transaction.
+- Inventory hash: the hash of an inventory item, represented by the
+  [`InventoryHash`](https://doc-internal.zebra.zfnd.org/zebra_network/protocol/external/inv/enum.InventoryHash.html)
+  type.
+- Inventory advertisement: a notification from another peer that they have some inventory item.
+- Inventory request: a request to another peer for an inventory item.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+The Bitcoin network protocol used by Zcash provides a mechanism for nodes to
+gossip blockchain data to each other.  This mechanism is used to distribute
+(mined) blocks and (unmined) transactions through the network.  Nodes can
+advertise data available in their inventory by sending an `inv` message
+containing the hashes and types of those data items.  After receiving an `inv`
+message advertising data, a node can determine whether to download it.
+
+This poses a challenge for our network stack, which goes to some effort to
+abstract away details of individual peers and encapsulate all peer connections
+behind a single request/response interface representing "the network".
+Currently, the peer set tracks readiness of all live peers, reports readiness
+if at least one peer is ready, and routes requests across ready peers randomly
+using the ["power of two choices"][p2c] algorithm.
+
+However, while this works well for data that is already distributed across the
+network (e.g., existing blocks) it will not work well for fetching data
+*during* distribution across the network.  If a peer informs us of some new
+data, and we attempt to download it from a random, unrelated peer, we will
+likely fail.  Instead, we track recent inventory advertisements, and make a
+best-effort attempt to route requests to peers who advertised that inventory.
+
+[p2c]: https://www.eecs.harvard.edu/~michaelm/postscripts/mythesis.pdf
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+The inventory tracking system has several components:
+
+1.  A registration hook that monitors incoming messages for inventory advertisements;
+2.  An inventory registry that tracks inventory presence by peer;
+3.  Routing logic that uses the inventory registry to appropriately route requests.
+
+The first two components have fairly straightforward design decisions, but
+the third has considerably less obvious choices and tradeoffs.
+
+## Inventory Monitoring
+
+Zebra uses Tokio's codec mechanism to transform a byte-oriented I/O interface
+into a `Stream` and `Sink` for incoming and outgoing messages.  These are
+passed to the peer connection state machine, which is written generically over
+any `Stream` and `Sink`.  This construction makes it easy to "tap" the sequence
+of incoming messages using `.then` and `.with` stream and sink combinators.
+
+We already do this to record Prometheus metrics on message rates as well as to
+report message timestamps used for liveness checks and last-seen address book
+metadata.  The message timestamp mechanism is a good example to copy.  The
+handshake logic instruments the incoming message stream with a closure that
+captures a sender handle for a [mpsc] channel with a large buffer (currently 100
+timestamp entries). The receiver handle is owned by a separate task that shares
+an `Arc<Mutex<AddressBook>>` with other parts of the application.  This task
+waits for new timestamp entries, acquires a lock on the address book, and
+updates the address book.  This ensures that timestamp updates are queued
+asynchronously, without lock contention.
+
+Unlike the address book, we don't need to share the inventory data with other
+parts of the application, so it can be owned exclusively by the peer set.  This
+means that no lock is necessary, and the peer set can process advertisements in
+its `poll_ready` implementation.  This method may be called infrequently, which
+could cause the channel to fill.  However, because inventory advertisements are
+time-limited, in the sense that they're only useful before some item is fully
+distributed across the network, it's safe to handle excess entries by dropping
+them.  This behavior is provided by a [broadcast]/mpmc channel, which can be
+used in place of an mpsc channel.
+
+[mpsc]: https://docs.rs/tokio/0.2.22/tokio/sync/mpsc/index.html
+[broadcast]: https://docs.rs/tokio/0.2.22/tokio/sync/broadcast/index.html
+
+An inventory advertisement is an `(InventoryHash, SocketAddr)` pair.  The
+stream hook should check whether an incoming message is an `inv` message with
+only a small number (e.g., 1) inventory entries.  If so, it should extract the
+hash for each item and send it through the channel.  Otherwise, it should
+ignore the message contents.  Why?  Because `inv` messages are also sent in
+response to queries, such as when we request subsequent block hashes, and in
+that case we want to assume that the inventory is generally available rather
+than restricting downloads to a single peer.  However, items are usually
+gossiped individually (or potentially in small chunks; `zcashd` has an internal
+`inv` buffer subject to race conditions), so choosing a small bound such as 1
+is likely to work as a heuristic for when we should assume that advertised
+inventory is not yet generally available.
+
+## Inventory Registry
+
+The peer set's `poll_ready` implementation should extract all available
+`(InventoryHash, SocketAddr)` pairs from the channel, and log a warning event
+if the receiver is lagging.  The channel should be configured with a generous
+buffer size (such as 100) so that this is unlikely to happen in normal
+circumstances.  These pairs should be fed into an `InventoryRegistry` structure
+along these lines:
+
+```rust
+struct InventoryRegistry{
+    current: HashMap<InventoryHash, HashSet<SocketAddr>>,
+    prev: HashMap<InventoryHash, HashSet<SocketAddr>>,
+}
+
+impl InventoryRegistry {
+    pub fn register(&mut self, item: InventoryHash, addr: SocketAddr) {
+        self.0.entry(item).or_insert(HashSet::new).insert(addr);
+    }
+
+    pub fn rotate(&mut self) {
+        self.prev = std::mem::take(self.current)
+    }
+
+    pub fn peers(&self, item: InventoryHash) -> impl Iterator<Item=&SocketAddr> {
+        self.prev.get(item).chain(self.current.get(item)).flatten()
+    }
+}
+```
+
+This API allows pruning the inventory registry using `rotate`, which
+implements generational pruning of registry entries. The peer set should
+maintain a `tokio::time::Interval` with some interval parameter, and check in
+`poll_ready` whether the interval stream has any items, calling `rotate` for
+each one:
+
+```rust
+while let Poll::Ready(Some(_)) = timer.poll_next(cx) {
+    registry.rotate();
+}
+```
+By rotating for each available item in the interval stream, rather than just
+once, we ensure that if the peer set's `poll_ready` is not called for a long
+time, `rotate` will be called enough times to correctly flush old entries. 
+
+Inventory advertisements live in the registry for twice the length of the
+timer, so it should be chosen to be half of the desired lifetime for
+inventory advertisements. Setting the timer to 75 seconds, the block
+interval, seems like a reasonable choice.
+
+## Routing Logic
+
+At this point, the peer set has information on recent inventory advertisements.
+However, the `Service` trait only allows `poll_ready` to report readiness based
+on the service's data and the type of the request, not the content of the
+request.  This means that we must report readiness without knowing whether the
+request should be routed to a specific peer, and we must handle the case where
+`call` gets a request for an item only available at an unready peer.
+
+This RFC suggests the following routing logic.  First, check whether the
+request fetches data by hash.  If so, and `peers()` returns `Some(ref addrs)`,
+iterate over `addrs` and route the request to the first ready peer if there is
+one.  In all other cases, fall back to p2c routing.  Alternatives are suggested
+and discussed below.
+
+# Rationale and alternatives
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+The rationale is described above.  The alternative choices are primarily around
+the routing logic.
+
+Because the `Service` trait does not allow applying backpressure based on the
+*content* of a request, only based on the service's internal data (via the
+`&mut self` parameter of `Service::poll_ready`) and on the type of the
+request (which determines which `impl Service` is used). This means that it
+is impossible for us to apply backpressure until a service that can process a
+specific inventory request is ready, because until we get the request, we
+can't determine which peers might be required to process it.
+
+We could attempt to ensure that the peer set would be ready to process a
+specific inventory request would be to pre-emptively "reserve" a peer as soon
+as it advertises an inventory item. But this doesn't actually work to ensure
+readiness, because a peer could advertise two inventory items, and only be
+able to service one request at a time. It also potentially locks the peer
+set, since if there are only a few peers and they all advertise inventory,
+the service can't process any other requests.  So this approach does not work.
+
+Another alternative would be to do some kind of buffering of inventory
+requests that cannot immediately be processed by a peer that advertised that
+inventory. There are two basic sub-approaches here.
+
+In the first case, we could maintain an unbounded queue of yet-to-be
+processed inventory requests in the peer set, and every time `poll_ready` is
+called, we check whether a service that could serve those inventory requests
+became ready, and start processing the request if we can. This would provide
+the lowest latency, because we can dispatch the request to the first
+available peer. For instance, if peer A advertises inventory I, the peer set
+gets an inventory request for I, peer A is busy so the request is queued, and
+peer B advertises inventory I, we could dispatch the queued request to B
+rather than waiting for A.
+
+However, it's not clear exactly how we'd implement this, because this
+mechanism is driven by calls to `poll_ready`, and those might not happen. So
+we'd need some separate task that would drive processing the buffered task to
+completion, but this may not be able to do so by `poll_ready`, since that
+method requires owning the service, and the peer set will be owned by a
+`Buffer` worker.
+
+In the second case, we could select an unready peer that advertised the
+requested inventory, clone it, and move the cloned peer into a task that
+would wait for that peer to become ready and then make the request. This is
+conceptually much cleaner than the above mechanism, but it has the downside
+that we don't dispatch the request to the first ready peer. In the example
+above, if we cloned peer A and dispatched the request to it, we'd have to
+wait for A to become ready, even if the second peer B advertised the same
+inventory just after we dispatched the request to A. However, this is not
+presently possible anyways, because the `peer::Client`s that handle requests
+are not clonable. They could be made clonable (they send messages to the
+connection state machine over a mpsc channel), but we cannot make this change
+without altering our liveness mechanism, which uses bounds on the
+time-since-last-message to determine whether a peer connection is live and to
+prevent immediate reconnections to recently disconnected peers.
+
+A final alternative would be to fail inventory requests that we cannot route
+to a peer which advertised that inventory. This moves the failure forward in
+time, but preemptively fails some cases where the request might succeed --
+for instance, if the peer has inventory but just didn't tell us, or received
+the inventory between when we dispatch the request and when it receives our
+message.  It seems preferable to try and fail than to not try at all.
+
+In practice, we're likely to care about the gossip protocol and inventory
+fetching once we've already synced close to the chain tip. In this setting,
+we're likely to already have peer connections, and we're unlikely to be
+saturating our peer set with requests (as we do during initial block sync).
+This suggests that the common case is one where we have many idle peers, and
+that therefore we are unlikely to have dispatched any recent requests to the
+peer that advertised inventory. So our common case should be one where all of
+this analysis is irrelevant.