Update the peerset buffer size and sync timeout

Also add a bunch of comments and documentation for network-constrained nodes, and for testnet.
2020-09-08 20:04:01 +10:00 · 2020-09-08 20:04:01 +10:00 · 2a68ef5acb
parent b062a682b0
commit 2a68ef5acb
3 changed files with 42 additions and 9 deletions
--- a/zebra-network/src/config.rs
+++ b/zebra-network/src/config.rs
@ -26,6 +26,10 @@ pub struct Config {
    pub initial_testnet_peers: HashSet<String>,

    /// The initial target size for the peer set.
+    ///
+    /// If you have a slow network connection, and Zebra is having trouble
+    /// syncing, try reducing the peer set size. You can also reduce the peer
+    /// set size to reduce Zebra's bandwidth usage.
    pub peerset_initial_target_size: usize,

    /// How frequently we attempt to connect to a new peer.
@ -79,6 +83,20 @@ impl Default for Config {
            initial_mainnet_peers: mainnet_peers,
            initial_testnet_peers: testnet_peers,
            new_peer_interval: Duration::from_secs(60),
+
+            // The default peerset target size should be large enough to ensure
+            // nodes have a reliable set of peers. But it should also be limited
+            // to a reasonable size, to avoid queueing too many in-flight block
+            // downloads. A large queue of in-flight block downloads can choke a
+            // constrained local network connection.
+            //
+            // We assume that Zebra nodes have at least 10 Mbps bandwidth.
+            // Therefore, a maximum-sized block can take up to 2 seconds to
+            // download. So a full default peer set adds up to 100 seconds worth
+            // of blocks to the queue.
+            //
+            // But the peer set for slow nodes is typically much smaller, due to
+            // the handshake RTT timeout.
            peerset_initial_target_size: 50,
        }
    }
--- a/zebra-network/src/constants.rs
+++ b/zebra-network/src/constants.rs
@ -9,15 +9,24 @@ use zebra_chain::parameters::NetworkUpgrade;

 /// The buffer size for the peer set.
 ///
+/// This should be greater than 1 to avoid sender contention, but also reasonably
+/// small, to avoid queueing too many in-flight block downloads. (A large queue
+/// of in-flight block downloads can choke a constrained local network
+/// connection, or a small peer set on testnet.)
+///
 /// We assume that Zebra nodes have at least 10 Mbps bandwidth. Therefore, a
-/// maximum-sized block will take 2 seconds to download. Based on the current
-/// `BLOCK_DOWNLOAD_TIMEOUT`, this is the largest buffer size we can support.
-pub const PEERSET_BUFFER_SIZE: usize = 10;
+/// maximum-sized block can take up to 2 seconds to download. So the peer set
+/// buffer adds up to 6 seconds worth of blocks to the queue.
+pub const PEERSET_BUFFER_SIZE: usize = 3;

 /// The timeout for requests made to a remote peer.
 pub const REQUEST_TIMEOUT: Duration = Duration::from_secs(20);

 /// The timeout for handshakes when connecting to new peers.
+///
+/// This timeout should remain small, because it helps stop slow peers getting
+/// into the peer set. This is particularly important for network-constrained
+/// nodes, and on testnet.
 pub const HANDSHAKE_TIMEOUT: Duration = Duration::from_secs(4);

 /// We expect to receive a message from a live peer at least once in this time duration.
--- a/zebrad/src/commands/start/sync.rs
+++ b/zebrad/src/commands/start/sync.rs
@ -76,18 +76,24 @@ const TIPS_RETRY_TIMEOUT: Duration = Duration::from_secs(60);
 ///   - allow pending downloads and verifies to complete or time out.
 ///     Sync restarts don't cancel downloads, so quick restarts can overload
 ///     network-bound nodes with lots of peers, leading to further failures.
-///     (The total number of requests being processed by peers is only
-///     constrained by the number of peers.)
+///     (The total number of requests being processed by peers is the sum of
+///     the number of peers, and the peer request buffer size.)
+///
+///     We assume that Zebra nodes have at least 10 Mbps bandwidth. So a
+///     maximum-sized block can take up to 2 seconds to download. Therefore, we
+///     set this timeout to twice the default number of peers. (The peer request
+///     buffer size is small enough that any buffered requests will overlap with
+///     the post-restart ObtainTips.)
+///
 ///   - allow zcashd peers to process pending requests. If the node only has a
 ///     few peers, we want to clear as much peer state as possible. In
 ///     particular, zcashd sends "next block range" hints, based on zcashd's
 ///     internal model of our sync progress. But we want to discard these hints,
 ///     so they don't get confused with ObtainTips and ExtendTips responses.
 ///
-/// Make sure each sync run can download an entire checkpoint, even on instances
-/// with slow or unreliable networks. This is particularly important on testnet,
-/// which has a small number of slow peers.
-const SYNC_RESTART_TIMEOUT: Duration = Duration::from_secs(60);
+/// This timeout is particularly important on instances with slow or unreliable
+/// networks, and on testnet, which has a small number of slow peers.
+const SYNC_RESTART_TIMEOUT: Duration = Duration::from_secs(100);

 /// Helps work around defects in the bitcoin protocol by checking whether
 /// the returned hashes actually extend a chain tip.