* Refactor to create heartbeat sender function
Move the code that's part of the heartbeat task into a separate helper
function.
* Move `Client` initialization down
Keep it closer to where it's actually used, and make it easier to add
new fields to `Client` for the connection and heartbeat tasks.
* Add background task handles to `Client` type
Prepare it to be able to check for panics or errors from the background
tasks.
* Add dummy background tasks to `ClientTestHarness`
Spawn simple timeout tasks as mock connection and heartbeat tasks.
* Fix `PeerSet` tests that use `ClientTestHarness`
Building a `ClientTestHarness` requires a Tokio runtime to be set up, so
the calls were moved into the `async` block.
* Refactor to create `set_task_exited_error`
Make the code reusable for both background tasks.
* Check heartbeat task for errors
Periodically poll it to check if the task has unexpectedly stopped.
* Check if connection background task has stopped
The client service should stop if the connection background task has
exited, because then it's not able to receive any replies.
* Allow aborting mocked `Client` background tasks
Wrap the background tasks in `Abortable`, so that they can be aborted
through the `ClientTestHarness`.
* Test if stopped connection task is detected
Check that stopping the background connection task is something that the
`Client` instance detects and handles correctly.
* Test if stopped heartbeat task is detected
Check that stopping the background heartbeat task is something that the
`Client` instance detects and handles correctly.
* Allow setting custom background tasks
Will be used later to create background tasks that panic.
* Test if `Client` handles panics in connection task
Use a mock background connection task that panics immediately, and check
that the `Client` handles it gracefully.
* Test if `Client` handles panics in heartbeat task
Use a mock background heartbeat task that panics immediately, and check
that the `Client` handles it gracefully.
* Change ticket referenced by `TODO`
The previously linked issue was a broad plan to improve Zebra's shutdown
behavior, while the new issue is more specific, and can be scheduled
sooner.
Co-authored-by: teor <teor@riseup.net>
Co-authored-by: teor <teor@riseup.net>
* Justify that the ErrorSlot Mutex is deadlock-safe
* Document cancellation safety in the async RFC
* Document task starvation in the async RFC
Co-authored-by: Marek <mail@marek.onl>
* Drop peer services if their cancel handles are dropped
* Exit the client task if the heartbeat task exits
* Allow multiple errors on a connection without panicking
* Explain why we don't need to send an error when the request is cancelled
* Document connection fields
* Make sure connections don't hang due to spurious timer or channel usage
* Actually shut down the client when the heartbeat task exits
* Add tests for unready services
* Close all senders to peer when `Client` is dropped
* Return a Client error if the error slot has an error
* Add tests for peer Client service errors
* Make Client drop and error cleanups consistent
* Use a ClientDropped error when the Client struct is dropped
* Test channel and error state in peer Client tests
* Move all Connection cleanup into a single method
* Add tests for Connection
* fix typo in comment
Co-authored-by: Conrado Gouvea <conrado@zfnd.org>
Co-authored-by: Conrado Gouvea <conrado@zfnd.org>
Co-authored-by: Alfredo Garcia <oxarbitrage@gmail.com>
* Use a named CancelHeartbeatTask unit struct for the channel type
* Prefer cancel handles in selects, if both are ready
* Fix message metrics to just show the command name
* Add metrics for internal requests and responses
* Add internal requests and responses to the messages dashboard
* Add a canceled metric, and peer addresses to request and response metrics
* Add a canceled messages graph
* Add connection state metrics for currently open connections
* Fix the connection state graph with new metrics
* Always send an error before dropping pending responses
* Move error detail logging into `fail_with`
* Delete an unused timer future
* Make error strings in metrics less verbose
* Downgrade some error logs to info
* Remove a redundant expect
* Avoid unnecessary allocations for connection state metrics
* Fix missed updates to mempool and block gossip metrics
* Limit the number of outbound connections in the crawler
* Make zebra-network channel bounds depend on config.peerset_initial_target_size
* Bias Zebra towards outbound connections
And turn connection limits into `Config` methods.
* Downgrade some connection logs to debug
* Remove verbose or outdated fields in tracing logs
* Clarify connection limits
Includes:
- `fastmod OUTBOUND_PEER_BIAS_FRACTION OUTBOUND_PEER_BIAS_DENOMINATOR zebra*`
- clarify connection limit documentation
* Clarify inventory channel capacity
* Add zebra_network::initialize tests with limited numbers of peers
* Avoid cooperative async task starvation in the peer crawler and listener
If we don't yield in these loops, they can run for a long time before
tokio forces them to yield.
* Test the crawler with small connection limits
And use the multi-threaded runtime to avoid long hangs.
* Stop using the multi-threaded executor in tests where it's not needed
* Avoid starvation for every connection
Adds yields after inbound successes and initial peer connections.
* Add a crawler peer connection success test
* Add outbound connection limit tests
* Improve outbound tests
* Stop ignoring inbound message errors and handshake timeouts
To avoid hangs, Zebra needs to maintain the following invariants in the
handshake and heartbeat code:
- each handshake should run in a separate spawned task
(not yet implemented)
- every message, error, timeout, and shutdown must update the peer address state
- every await that depends on the network must have a timeout
Once the Connection is created, it should handle timeouts.
But we need to handle timeouts during handshake setup.
* Avoid hangs by adding a timeout to the candidate set update
Also increase the fanout from 1 to 2, to increase address diversity.
But only return permanent errors from `CandidateSet::update`, because
the crawler task exits if `update` returns an error.
Also log Peers response errors in the CandidateSet.
* Use the select macro in the crawler to reduce hangs
The `select` function is biased towards its first argument, risking
starvation.
As a side-benefit, this change also makes the code a lot easier to read
and maintain.
* Split CrawlerAction::Demand into separate actions
This refactor makes the code a bit easier to read, at the cost of
sometimes blocking the crawler on `candidates.next()`.
That's ok, because `next` only has a short (< 100 ms) delay. And we're
just about to spawn a separate task for each handshake.
* Spawn a separate task for each handshake
This change avoids deadlocks by letting each handshake make progress
independently.
* Move the dial task into a separate function
This refactor improves readability.
* Fix buggy future::select function usage
And document the correctness of the new code.
- Add an `ExitClient` transition, used when the internal client channel
is closed or dropped, and there are no more pending requests
- Ignore pending requests after an `ExitClient` transition
- Reject pending requests when the peer has caused an error
(the `Exit` and `ExitRequest` transitions)
- Remove `PeerError::ConnectionDropped`, because it is now handled by
`ExitClient`. (Which is an internal error, not a peer error.)
This cleans up the response processing logic a little bit along the way,
but the overall division of responsibility should be better documented
in a future commit.
This lets us distinguish between cases where the message was unsupported
(e.g., BIP11 messages), and cases where the message was uninterpretable
in context (e.g., unsolicited messages).
When the connection sees the client_rx channel close it knows it will never get
any more requests, and it should terminate. But instead of terminating, it
errored itself, and the method to error itself tries to pull all the
outstanding client requests from the channel in order to fail them before it
shuts down. This results in reading from a closed channel, causing a panic.
Instead we return cleanly rather than failing (since we know there are no
outstanding requests, as the channel is closed).
Bitcoin does this either with `getblocks` (returns up to 500 following block
hashes) or `getheaders` (returns up to 2000 following block headers, not
just hashes). However, Bitcoin headers are much smaller than Zcash
headers, which contain a giant Equihash solution block, and many Zcash
blocks don't have many transactions in them, so the block header is
often similarly sized to the block itself. Because we're
aiming to have a highly parallel network layer, it seems better to use
`getblocks` to implement `FindBlocks` (which is necessarily sequential)
and parallelize the processing of the block downloads.
Attempting to implement requests for block data revealed a problem with
the previous connection logic. Block data is requested by sending a
`getdata` message with hashes of the requested blocks; the peer responds
with a sequence of `block` messages with the blocks themselves.
However, this wasn't possible to handle with the previous connection
logic, which could only convert a single Bitcoin message into a
Response. Instead, we factor out the message handling logic into a
Handler, which can statefully accumulate arbitrary data into a Response
and signal completion. This is still pretty ugly but it does work.
As a side effect, the HeartbeatNonceMismatch error is removed; because
the Handler now tries to process messages until it comes to a Response,
it just ignores mismatched nonces (and will eventually time out).
The previous Mempool and Transaction requests were removed but could be
re-added in a different form later. Also, the `Get` prefixes are
removed from `Request` to tidy the name.
Failure uses a distinct Fail trait rather than the standard library's
Error trait, which causes a lot of interoperability problems with tower
and other Error-using crates. Since failure was created, the standard
library's Error trait was improved, and its conveniences are now
available without the custom Fail trait using `thiserror` (for easy
error derives) and `anyhow` (for a better boxed Error).