* Simplify state service initialization in test
Use the test helper function to remove redundant code.
* Create `BestTipHeight` helper type
This type abstracts away the calculation of the best tip height based on
the finalized block height and the best non-finalized chain's tip.
* Add `best_tip_height` field to `StateService`
The receiver endpoint is currently ignored.
* Return receiver endpoint from service constructor
Make it available so that the best tip height can be watched.
* Update finalized height after finalizing blocks
After blocks from the queue are finalized and committed to disk, update
the finalized block height.
* Update best non-finalized height after validation
Update the value of the best non-finalized chain tip block height after
a new block is committed to the non-finalized state.
* Update finalized height after loading from disk
When `FinalizedState` is first created, it loads the state from
persistent storage, and the finalized tip height is updated. Therefore,
the `best_tip_height` must be notified of the initial value.
* Update the finalized height on checkpoint commit
When a checkpointed block is commited, it bypasses the non-finalized
state, so there's an extra place where the finalized height has to be
updated.
* Add `best_tip_height` to `Handshake` service
It can be configured using the `Builder::with_best_tip_height`. It's
currently not used, but it will be used to determine if a connection to
a remote peer should be rejected or not based on that peer's protocol
version.
* Require best tip height to init. `zebra_network`
Without it the handshake service can't properly enforce the minimum
network protocol version from peers. Zebrad obtains the best tip height
endpoint from `zebra_state`, and the test vectors simply use a dummy
endpoint that's fixed at the genesis height.
* Pass `best_tip_height` to proto. ver. negotiation
The protocol version negotiation code will reject connections to peers
if they are using an old protocol version. An old version is determined
based on the current known best chain tip height.
* Handle an optional height in `Version`
Fallback to the genesis height in `None` is specified.
* Reject connections to peers on old proto. versions
Avoid connecting to peers that are on protocol versions that don't
recognize a network update.
* Document why peers on old versions are rejected
Describe why it's a security issue above the check.
* Test if `BestTipHeight` starts with `None`
Check if initially there is no best tip height.
* Test if best tip height is max. of latest values
After applying a list of random updates where each one either sets the
finalized height or the non-finalized height, check that the best tip
height is the maximum of the most recently set finalized height and the
most recently set non-finalized height.
* Add `queue_and_commit_finalized` method
A small refactor to make testing easier. The handling of requests for
committing non-finalized and finalized blocks is now more consistent.
* Add `assert_block_can_be_validated` helper
Refactor to move into a separate method some assertions that are done
before a block is validated. This is to allow moving these assertions
more easily to simplify testing.
* Remove redundant PoW block assertion
It's also checked in
`zebra_state::service::check::block_is_contextually_valid`, and it was
getting in the way of tests that received a gossiped block before
finalizing enough blocks.
* Create a test strategy for test vector chain
Splits a chain loaded from the test vectors in two parts, containing the
blocks to finalize and the blocks to keep in the non-finalized state.
* Test committing blocks update best tip height
Create a mock blockchain state, with a chain of finalized blocks and a
chain of non-finalized blocks. Commit all the blocks appropriately, and
verify that the best tip height is updated.
Co-authored-by: teor <teor@riseup.net>
* Support a min protocol version during initial block download
But don't actually use the state height yet.
Also rename some functions and constants.
Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com>
* Security: stop gossiping failure and attempt times as last_seen times
Previously, Zebra had a single time field for peer addresses, which was
updated every time a peer was attempted, sent a message, or failed.
This is a security issue, because the `last_seen` time should be
"the last time [a peer] connected to that node", so that
"nodes can use the time field to avoid relaying old 'addr' messages".
So Zebra was sending incorrect peer information to other nodes.
As part of this change, we split the `last_seen` time into the
following fields:
- untrusted_last_seen: gossiped from other peers
- last_response: time we got a response from a directly connected peer
- last_attempt: time we attempted to connect to a peer
- last_failure: time a connection with a peer failed
* Implement Arbitrary and strategies for MetaAddrChange
Also replace the MetaAddr Arbitrary impl with a derive.
* Write proptests for MetaAddr and MetaAddrChange
MetaAddr:
- the only times that get included in serialized MetaAddrs are
the untrusted last seen and responded times
MetaAddrChange:
- the untrusted last seen time is never updated
- the services are only updated if there has been a handshake
Add canonical addresses from inbound connections to the address book,
so that Zebra can use them for reconnection attempts.
Use the newly added `NeverAttemptedAlternate` state for these addresses,
so we try gossiped addresses first, then canonical addresses. This avoids
duplicate connections to inbound peers.
* Instrument the crawl task
When we created the crawl task, we forgot to instrument it with the
global span. This fix makes sure that the git and network span appears on
crawl logs.
* Instrument the connector
* Improve handshake instrumentation
Make some spans debug, so there are not too many spans.
* Add the address to initial peer connection errors
- stop putting inbound addresses in the address book
- drop address book entries that can't be used for outbound connections
- distinguish between temporary inbound and permanent outbound peer
addresses
- also create variants to handle proxy connections
(but don't use them yet)
- avoid tracking connection state for isolated connections
- document security constraints for the address book and peer set
* Stop ignoring inbound message errors and handshake timeouts
To avoid hangs, Zebra needs to maintain the following invariants in the
handshake and heartbeat code:
- each handshake should run in a separate spawned task
(not yet implemented)
- every message, error, timeout, and shutdown must update the peer address state
- every await that depends on the network must have a timeout
Once the Connection is created, it should handle timeouts.
But we need to handle timeouts during handshake setup.
* Avoid hangs by adding a timeout to the candidate set update
Also increase the fanout from 1 to 2, to increase address diversity.
But only return permanent errors from `CandidateSet::update`, because
the crawler task exits if `update` returns an error.
Also log Peers response errors in the CandidateSet.
* Use the select macro in the crawler to reduce hangs
The `select` function is biased towards its first argument, risking
starvation.
As a side-benefit, this change also makes the code a lot easier to read
and maintain.
* Split CrawlerAction::Demand into separate actions
This refactor makes the code a bit easier to read, at the cost of
sometimes blocking the crawler on `candidates.next()`.
That's ok, because `next` only has a short (< 100 ms) delay. And we're
just about to spawn a separate task for each handshake.
* Spawn a separate task for each handshake
This change avoids deadlocks by letting each handshake make progress
independently.
* Move the dial task into a separate function
This refactor improves readability.
* Fix buggy future::select function usage
And document the correctness of the new code.
- Add an `ExitClient` transition, used when the internal client channel
is closed or dropped, and there are no more pending requests
- Ignore pending requests after an `ExitClient` transition
- Reject pending requests when the peer has caused an error
(the `Exit` and `ExitRequest` transitions)
- Remove `PeerError::ConnectionDropped`, because it is now handled by
`ExitClient`. (Which is an internal error, not a peer error.)
Design:
- Add a `PeerAddrState` to each `MetaAddr`
- Use a single peer set for all peers, regardless of state
- Implement time-based liveness as an `AddressBook` method, rather than
a `PeerAddrState` variant
- Delete `AddressBook.by_state`
Implementation:
- Simplify `AddressBook` changes using `update` and `take` modifier
methods
- Simplify the `AddressBook` iterator implementation, replacing it with
methods that are more obviously correct
- Consistently collect peer set metrics
Documentation:
- Expand and update the peer set documentation
We can optimise later, but for now we want simple code that is more
obviously correct.
We can't rule out the connection state changing between the state checks
and any eventual failures, particularly in the presence of async code.
So we turn this panic into a warning.
zebra-network's Connection expects that `fail_with` is only called once
per connection, but the overload handling code continues to process the
current request after an overload error, potentially leading to further
failures.
Closes#1599
The `peer::Client` translates `Request`s into `ClientRequest`s, which
it sends to a background task. If the send is `Ok(())`, it will assume
that it is safe to unconditionally poll the `Receiver` tied to the
`Sender` used to create the `ClientRequest`.
We enforce this invariant via the type system, by converting
`ClientRequest`s to `InProgressClientRequest`s when they are received by
the background task. These conversions are implemented by
`ClientRequestReceiver`.
Changes:
* Revert `ClientRequest` so it uses a `oneshot::Sender`
* Add `InProgressClientRequest`, which is the same as `ClientRequest`,
but has a `MustUseOneshotSender`
* `impl From<ClientRequest> for InProgressClientRequest`
* Add a new `ClientRequestReceiver` type that wraps a
`mpsc::Receiver<ClientRequest>`
* `impl Stream<InProgressClientRequest> for ClientRequestReceiver`,
converting the successful result of `inner.poll_next_unpin` into an
`InProgressClientRequest`
* Replace `client_rx: mpsc::Receiver<ClientRequest>` in `Connection`
with the new `ClientRequestReceiver` type
* `impl From<mpsc::Receiver<ClientRequest>> for ClientRequestReceiver`
This fix also changes heartbeat behaviour in the following ways:
* if the queue is full, the connection is closed. Previously, the sender
would wait until the queue had emptied
* if the queue flush fails, Zebra panics, because it can't send an error
on the ClientRequest sender, so the invariant is broken
Add a MustUseOneshotSender, which panics if its inner sender is unused.
Callers must call `send()` on the MustUseOneshotSender, or ensure that
the sender is canceled.
Replaces an unreliable panic in `Client::call()` with a reliable panic
when a must-use sender is dropped.
Previously, tx would be dropped before send if:
- the success case would have used tx to wait for further messages,
- but the response was actually an error.
Instead, send the error on `tx` and call `fail_with()` using the same
error.
To support this change, allow `fail_with()` to take a `PeerError` or a
`SharedPeerError`.
* Rewrite GetData handling to match the zcashd implementation
`zcashd` silently ignores missing blocks, but sends found transactions
followed by a `NotFound` message:
e7b425298f/src/main.cpp (L5497)
This is significantly different to the behaviour expected by the old
Zebra connection state machine, which expected `NotFound` for blocks.
Also change Zebra's GetData responses to peer request so they ignore
missing blocks.
* Stop hanging on incomplete transaction or block responses
Instead, if the peer sends an unexpected block, unexpected transaction,
or NotFound message:
1. end the request, and return a partial response containing any items
that were successfully received
2. if none of the expected blocks or transactions were received, return
an error, and close the connection
The cancellation implementation changes made to the connection state machine
mean that if a response oneshot is dropped, the connection will avoid
cancelling the request. So the heartbeat task does have to wait on the response.
This makes the span data more compact (e.g., `msg_as_req{msg=block}`) and
restores the Debug impl for Message to show all of the data contained in the
message. The full message is added as a single event at trace level in the
span to preserve the previous full-inspectability.
As we approach our alpha release we've decided we want to plan ahead for the user bug reports we will eventually receive. One of the bigger issues we foresee is determining exactly what version of the software users are running, and particularly how easy it may or may not be for users to accidentally discard this information when reporting bugs.
To defend against this, we've decided to include the exact git sha for any given build in the compiled artifact. This information will then be re-exported as a span early in the application startup process, so that all logs and error messages should include the sha as their very first span. We've also added this sha as issue metadata for `color-eyre`'s github issue url auto generation feature, which should make sure that the sha is easily available in bug reports we receive, even in the absence of logs.
Co-authored-by: teor <teor@riseup.net>
This change is mostly mechanical, with the exception of the changes to the
`tower-batch` middleware. This middleware was adapted from `tower::buffer`,
and the `tower::buffer` code was changed to implement its own bounded queue,
because Tokio 0.3 removed the `mpsc::Sender::poll_send` method. See
ddc64e8d4d
for more context on the Tower changes. To match Tower as closely as possible
in order to be able to upstream `tower-batch`, those changes are copied from
`tower::Buffer` to `tower-batch`.
These messages might be unsolicited, or they might be a response to a
request we already canceled. So don't fail the whole connection, just
drop the message and move on.
We handle request cancellation in two places: before we transition into
the AwaitingResponse state, and while we are in AwaitingResponse. We
need both places, or else if we started processing a request, we
wouldn't process the cancellation until the timeout elapsed.
The first is a check that the oneshot is not already canceled.
For the second, we wait on a cancellation, either from a timeout or from
the tx channel closing.
This cleans up the response processing logic a little bit along the way,
but the overall division of responsibility should be better documented
in a future commit.
This lets us distinguish between cases where the message was unsupported
(e.g., BIP11 messages), and cases where the message was uninterpretable
in context (e.g., unsolicited messages).
This commit makes several related changes to the network code:
- adds a `TransactionsByHash(HashSet<transaction::Hash>)` request and
`Transactions(Vec<Arc<Transaction>>)` response pair that allows
fetching transactions from a remote peer;
- adds a `PushTransaction(Arc<Transaction>)` request that pushes an
unsolicited transaction to a remote peer;
- adds an `AdvertiseTransactions(HashSet<transaction::Hash>)` request
that advertises transactions by hash to a remote peer;
- adds an `AdvertiseBlock(block::Hash)` request that advertises a block
by hash to a remote peer;
Then, it modifies the connection state machine so that outbound
requests to remote peers are handled properly:
- `TransactionsByHash` generates a `getdata` message and collects the
results, like the existing `BlocksByHash` request.
- `PushTransaction` generates a `tx` message, and returns `Nil` immediately.
- `AdvertiseTransactions` and `AdvertiseBlock` generate an `inv`
message, and return `Nil` immediately.
Next, it modifies the connection state machine so that messages
from remote peers generate requests to the inbound service:
- `getdata` messages generate `BlocksByHash` or `TransactionsByHash`
requests, depending on the content of the message;
- `tx` messages generate `PushTransaction` requests;
- `inv` messages generate `AdvertiseBlock` or `AdvertiseTransactions`
requests.
Finally, it refactors the request routing logic for the peer set to
handle advertisement messages, providing three routing methods:
- `route_p2c`, which uses p2c as normal (default);
- `route_inv`, which uses the inventory registry and falls back to p2c
(used for `BlocksByHash` or `TransactionsByHash`);
- `route_all`, which broadcasts a request to all ready peers (used for
`AdvertiseBlock` and `AdvertiseTransactions`).