* fix(security): Randomly drop connections when inbound service is overloaded
* Uses progressively higher drop probabilities
* Replaces Error::Overloaded with Fatal when internal services shutdown
* Applies suggestions from code review.
* Quickens initial drop probability decay and updates comment
* Applies suggestions from code review.
* Fixes drop connection probablity calc
* Update connection state metrics for different overload/error outcomes
* Split overload handler into separate methods
* Add unit test for drop probability function properties
* Add respond_error methods to zebra-test to help with type resolution
* Initial test that Overloaded errors cause some continues and some closes
* Tune the number of test runs and test timing
* Fix doctests and replace some confusing example requests
---------
Co-authored-by: arya2 <aryasolhi@gmail.com>
* Close the new connection if Zebra unexpectedly generates a duplicate random nonce
* Add a missing test module comment
* Avoid peer attacks that replay self-connection nonces to manipulate the nonce set (needs tests)
* Add a test that makes sure network self-connections fail
* Log an info level when self-connections fail (this should be rare)
* Just use plain blocks for mutex critical sections
* Add a missing space
* Fix the syntax of links in comments
* Fix a mistake in the docs
Co-authored-by: Alfredo Garcia <oxarbitrage@gmail.com>
* Remove unnecessary angle brackets from a link
* Revert the changes for links that serve as references
* Revert "Revert the changes for links that serve as references"
This reverts commit 8b091aa9fab453e7d3559a5d474e0879183b9bfb.
* Remove `<` `>` from links that serve as references
This reverts commit 046ef25620ae1a2140760ae7ea379deecb4b583c.
* Don't use `<` `>` in normal comments
* Don't use `<` `>` for normal comments
* Revert changes for comments starting with `//`
* Fix some warnings produced by `cargo doc`
* Fix some rustdoc warnings
* Fix some warnings
* Refactor some changes
* Fix some rustdoc warnings
* Fix some rustdoc warnings
* Resolve various TODOs
Co-authored-by: teor <teor@riseup.net>
Co-authored-by: Alfredo Garcia <oxarbitrage@gmail.com>
Co-authored-by: teor <teor@riseup.net>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* Fix the syntax of links in comments
* Fix a mistake in the docs
Co-authored-by: Alfredo Garcia <oxarbitrage@gmail.com>
* Remove unnecessary angle brackets from a link
* Revert the changes for links that serve as references
* Revert "Revert the changes for links that serve as references"
This reverts commit 8b091aa9fab453e7d3559a5d474e0879183b9bfb.
* Remove `<` `>` from links that serve as references
This reverts commit 046ef25620ae1a2140760ae7ea379deecb4b583c.
* Don't use `<` `>` in normal comments
* Don't use `<` `>` for normal comments
* Revert changes for comments starting with `//`
* Fix some warnings produced by `cargo doc`
* Fix some rustdoc warnings
* Fix some warnings
* Refactor some changes
Co-authored-by: Alfredo Garcia <oxarbitrage@gmail.com>
* Fix the syntax of links in comments
* Fix a mistake in the docs
Co-authored-by: Alfredo Garcia <oxarbitrage@gmail.com>
* Remove unnecessary angle brackets from a link
* Revert the changes for links that serve as references
* Revert "Revert the changes for links that serve as references"
This reverts commit 8b091aa9fab453e7d3559a5d474e0879183b9bfb.
* Remove `<` `>` from links that serve as references
This reverts commit 046ef25620ae1a2140760ae7ea379deecb4b583c.
* Don't use `<` `>` in normal comments
* Don't use `<` `>` for normal comments
* Revert changes for comments starting with `//`
Co-authored-by: Alfredo Garcia <oxarbitrage@gmail.com>
* fix(network): split synthetic NotFoundRegistry from message NotFoundResponse
* docs(network): Improve `notfound` message documentation
* refactor(network): Rename MustUseOneshotSender to MustUseClientResponseSender
```
fastmod MustUseOneshotSender MustUseClientResponseSender zebra*
```
* docs(network): fix a comment typo
* refactor(network): remove generics from MustUseClientResponseSender
* refactor(network): add an inventory collector to Client, but don't use it yet
* feat(network): register missing peer responses as missing inventory
We register this missing inventory based on peer responses,
or connection errors or timeouts.
Inbound message inventory tracking requires peers to send `notfound` messages.
But `zcashd` skips `notfound` for blocks, so we can't rely on peer messages.
This missing inventory tracking works regardless of peer `notfound` messages.
* refactor(network): rename ResponseStatus to InventoryResponse
```sh
fastmod ResponseStatus InventoryResponse zebra*
```
* refactor(network): rename InventoryStatus::inner() to to_inner()
* fix(network): remove a redundant runtime.enter() in a test
* doc(network): the exact time used to filter outbound peers doesn't matter
* fix(network): handle block requests slightly more efficiently
* doc(network): fix a typo
* fmt(network): `cargo fmt` after rename ResponseStatus to InventoryResponse
* doc(test): clarify some test comments
* test(network): test synthetic notfound from connection errors and peer inventory routing
* test(network): improve inbound test diagnostics
* feat(network): add a proptest-impl feature to zebra-network
* feat(network): add a test-only connect_isolated_with_inbound function
* test(network): allow a response on the isolated peer test connection
* test(network): fix failures in test synthetic notfound
* test(network): Simplify SharedPeerError test assertions
* test(network): test synthetic notfound from partially successful requests
* test(network): MissingInventoryCollector ignores local NotFoundRegistry errors
* fix(network): decrease the inventory rotation interval
This stops us waiting 3-4 sync resets (4 minutes) before we retry a missing block.
Now we wait 1-2 sync resets (2 minutes), which is still a reasonable rate limit.
This should speed up syncing near the tip, and on testnet.
* fmt(network): cargo fmt --all
* cleanup(network): remove unnecessary allow(dead_code)
* cleanup(network): stop importing the whole sync module into tests
* doc(network): clarify syncer inventory retry constraint
* doc(network): add a TODO for a fix to ensure API behaviour remains consistent
* doc(network): fix a function doc typo
* doc(network): clarify how we handle peers that don't send `notfound`
* docs(network): clarify a test comment
Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com>
Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* refactor(network): rename Advertised to Available
```sh
fastmod Advertised Available zebra*
fastmod advertised available zebra*
```
* refactor(network): allow different available and missing types inside an InventoryStatus
And rename it to ResponseStatus.
Split the methods between ResponseStatus and an InventoryStatus alias.
* refactor(network): add a block_hash convenience method to InventoryHash
* test(network): improve failure logs for connection tests
* fix(inbound): move address sanitization into the response future
* feat(network): send notfound when Zebra doesn't have a block or transaction
* doc(network): move module docs to the top of each module
This makes them more likely to get updated when the module changes.
* fix(network): stop sending unsupported missing inventory types to the registry
* test(network): inbound messages are forwarded to the registry
* test(inbound): test Peers requests to the inbound service, directly and via TCP
* test(network): notfound block responses are sent by the inbound service
* test(network): notfound tx responses are sent by the inbound service
* test(network): increase sync test mock service timeout
The code that these tests use hasn't actually changed much,
and they are only failing on some platforms (coverage, macOS).
So it seems like the extra concurrent inbound tests have pushed them
past their time limit.
(Perhaps due to TCP system calls, or extra serialization work.)
* doc(network): fix typo
Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com>
* test(network): remove unnecessary multi-threaded runtime from tests
This prevents `MockService<zebra_state>` timeouts
in the `sync_block_too_high_extend_tips` test,
at the cost of reducing coverage of different execution orders.
Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com>
* fix(network): add a send timeout to outbound peer messages
* test(network): test peer send and receive timeouts
And the equivalent success cases:
- spawn the run loop with no messages
- spawn the run loop and send and receive a message
* test(network): check for specific error types in the tests
And add an outbound error test that doesn't expect a response.
* test(network): use bounded fake peer connection channels
This lets us actually trigger send timeouts in the tests.
* refactor(network): rename some confusing types and variables
fastmod peer_inbound_tx peer_tx zebra*
fastmod peer_inbound_rx peer_rx zebra*
fastmod ClientSendTimeout ConnectionSendTimeout zebra*
fastmod ClientReceiveTimeout ConnectionReceiveTimeout zebra*
* doc(network test): explain the purpose of each peer connection test vector
* Refactor to create heartbeat sender function
Move the code that's part of the heartbeat task into a separate helper
function.
* Move `Client` initialization down
Keep it closer to where it's actually used, and make it easier to add
new fields to `Client` for the connection and heartbeat tasks.
* Add background task handles to `Client` type
Prepare it to be able to check for panics or errors from the background
tasks.
* Add dummy background tasks to `ClientTestHarness`
Spawn simple timeout tasks as mock connection and heartbeat tasks.
* Fix `PeerSet` tests that use `ClientTestHarness`
Building a `ClientTestHarness` requires a Tokio runtime to be set up, so
the calls were moved into the `async` block.
* Refactor to create `set_task_exited_error`
Make the code reusable for both background tasks.
* Check heartbeat task for errors
Periodically poll it to check if the task has unexpectedly stopped.
* Check if connection background task has stopped
The client service should stop if the connection background task has
exited, because then it's not able to receive any replies.
* Allow aborting mocked `Client` background tasks
Wrap the background tasks in `Abortable`, so that they can be aborted
through the `ClientTestHarness`.
* Test if stopped connection task is detected
Check that stopping the background connection task is something that the
`Client` instance detects and handles correctly.
* Test if stopped heartbeat task is detected
Check that stopping the background heartbeat task is something that the
`Client` instance detects and handles correctly.
* Allow setting custom background tasks
Will be used later to create background tasks that panic.
* Test if `Client` handles panics in connection task
Use a mock background connection task that panics immediately, and check
that the `Client` handles it gracefully.
* Test if `Client` handles panics in heartbeat task
Use a mock background heartbeat task that panics immediately, and check
that the `Client` handles it gracefully.
* Change ticket referenced by `TODO`
The previously linked issue was a broad plan to improve Zebra's shutdown
behavior, while the new issue is more specific, and can be scheduled
sooner.
Co-authored-by: teor <teor@riseup.net>
Co-authored-by: teor <teor@riseup.net>
* Justify that the ErrorSlot Mutex is deadlock-safe
* Document cancellation safety in the async RFC
* Document task starvation in the async RFC
Co-authored-by: Marek <mail@marek.onl>
* Drop peer services if their cancel handles are dropped
* Exit the client task if the heartbeat task exits
* Allow multiple errors on a connection without panicking
* Explain why we don't need to send an error when the request is cancelled
* Document connection fields
* Make sure connections don't hang due to spurious timer or channel usage
* Actually shut down the client when the heartbeat task exits
* Add tests for unready services
* Close all senders to peer when `Client` is dropped
* Return a Client error if the error slot has an error
* Add tests for peer Client service errors
* Make Client drop and error cleanups consistent
* Use a ClientDropped error when the Client struct is dropped
* Test channel and error state in peer Client tests
* Move all Connection cleanup into a single method
* Add tests for Connection
* fix typo in comment
Co-authored-by: Conrado Gouvea <conrado@zfnd.org>
Co-authored-by: Conrado Gouvea <conrado@zfnd.org>
Co-authored-by: Alfredo Garcia <oxarbitrage@gmail.com>
* Use a named CancelHeartbeatTask unit struct for the channel type
* Prefer cancel handles in selects, if both are ready
* Fix message metrics to just show the command name
* Add metrics for internal requests and responses
* Add internal requests and responses to the messages dashboard
* Add a canceled metric, and peer addresses to request and response metrics
* Add a canceled messages graph
* Add connection state metrics for currently open connections
* Fix the connection state graph with new metrics
* Always send an error before dropping pending responses
* Move error detail logging into `fail_with`
* Delete an unused timer future
* Make error strings in metrics less verbose
* Downgrade some error logs to info
* Remove a redundant expect
* Avoid unnecessary allocations for connection state metrics
* Fix missed updates to mempool and block gossip metrics
* Limit the number of outbound connections in the crawler
* Make zebra-network channel bounds depend on config.peerset_initial_target_size
* Bias Zebra towards outbound connections
And turn connection limits into `Config` methods.
* Downgrade some connection logs to debug
* Remove verbose or outdated fields in tracing logs
* Clarify connection limits
Includes:
- `fastmod OUTBOUND_PEER_BIAS_FRACTION OUTBOUND_PEER_BIAS_DENOMINATOR zebra*`
- clarify connection limit documentation
* Clarify inventory channel capacity
* Add zebra_network::initialize tests with limited numbers of peers
* Avoid cooperative async task starvation in the peer crawler and listener
If we don't yield in these loops, they can run for a long time before
tokio forces them to yield.
* Test the crawler with small connection limits
And use the multi-threaded runtime to avoid long hangs.
* Stop using the multi-threaded executor in tests where it's not needed
* Avoid starvation for every connection
Adds yields after inbound successes and initial peer connections.
* Add a crawler peer connection success test
* Add outbound connection limit tests
* Improve outbound tests
* Stop ignoring inbound message errors and handshake timeouts
To avoid hangs, Zebra needs to maintain the following invariants in the
handshake and heartbeat code:
- each handshake should run in a separate spawned task
(not yet implemented)
- every message, error, timeout, and shutdown must update the peer address state
- every await that depends on the network must have a timeout
Once the Connection is created, it should handle timeouts.
But we need to handle timeouts during handshake setup.
* Avoid hangs by adding a timeout to the candidate set update
Also increase the fanout from 1 to 2, to increase address diversity.
But only return permanent errors from `CandidateSet::update`, because
the crawler task exits if `update` returns an error.
Also log Peers response errors in the CandidateSet.
* Use the select macro in the crawler to reduce hangs
The `select` function is biased towards its first argument, risking
starvation.
As a side-benefit, this change also makes the code a lot easier to read
and maintain.
* Split CrawlerAction::Demand into separate actions
This refactor makes the code a bit easier to read, at the cost of
sometimes blocking the crawler on `candidates.next()`.
That's ok, because `next` only has a short (< 100 ms) delay. And we're
just about to spawn a separate task for each handshake.
* Spawn a separate task for each handshake
This change avoids deadlocks by letting each handshake make progress
independently.
* Move the dial task into a separate function
This refactor improves readability.
* Fix buggy future::select function usage
And document the correctness of the new code.
- Add an `ExitClient` transition, used when the internal client channel
is closed or dropped, and there are no more pending requests
- Ignore pending requests after an `ExitClient` transition
- Reject pending requests when the peer has caused an error
(the `Exit` and `ExitRequest` transitions)
- Remove `PeerError::ConnectionDropped`, because it is now handled by
`ExitClient`. (Which is an internal error, not a peer error.)
This cleans up the response processing logic a little bit along the way,
but the overall division of responsibility should be better documented
in a future commit.
This lets us distinguish between cases where the message was unsupported
(e.g., BIP11 messages), and cases where the message was uninterpretable
in context (e.g., unsolicited messages).
When the connection sees the client_rx channel close it knows it will never get
any more requests, and it should terminate. But instead of terminating, it
errored itself, and the method to error itself tries to pull all the
outstanding client requests from the channel in order to fail them before it
shuts down. This results in reading from a closed channel, causing a panic.
Instead we return cleanly rather than failing (since we know there are no
outstanding requests, as the channel is closed).
Bitcoin does this either with `getblocks` (returns up to 500 following block
hashes) or `getheaders` (returns up to 2000 following block headers, not
just hashes). However, Bitcoin headers are much smaller than Zcash
headers, which contain a giant Equihash solution block, and many Zcash
blocks don't have many transactions in them, so the block header is
often similarly sized to the block itself. Because we're
aiming to have a highly parallel network layer, it seems better to use
`getblocks` to implement `FindBlocks` (which is necessarily sequential)
and parallelize the processing of the block downloads.
Attempting to implement requests for block data revealed a problem with
the previous connection logic. Block data is requested by sending a
`getdata` message with hashes of the requested blocks; the peer responds
with a sequence of `block` messages with the blocks themselves.
However, this wasn't possible to handle with the previous connection
logic, which could only convert a single Bitcoin message into a
Response. Instead, we factor out the message handling logic into a
Handler, which can statefully accumulate arbitrary data into a Response
and signal completion. This is still pretty ugly but it does work.
As a side effect, the HeartbeatNonceMismatch error is removed; because
the Handler now tries to process messages until it comes to a Response,
it just ignores mismatched nonces (and will eventually time out).
The previous Mempool and Transaction requests were removed but could be
re-added in a different form later. Also, the `Get` prefixes are
removed from `Request` to tidy the name.