Commit Graph

177 Commits

Author SHA1 Message Date
teor 4b8b65a627
Avoid spurious acceptance test failures by decreasing the peer crawler timeout (#2905)
* Improve logging for initial peer connections

* Decrease the initial peer crawl timeout to make tests more reliable

Co-authored-by: Conrado Gouvea <conrado@zfnd.org>
2021-10-19 15:29:03 +00:00
teor c8ad19080a
Improve logging for initial peer connections (#2896)
Co-authored-by: Alfredo Garcia <oxarbitrage@gmail.com>
2021-10-18 18:43:12 +00:00
Alfredo Garcia 4280ef5003
Give more information to the user in the wrong port init warning (#2853)
* Update initialize.rs

* grammar

Co-authored-by: teor <teor@riseup.net>

Co-authored-by: teor <teor@riseup.net>
2021-10-12 01:13:13 +00:00
Janito Vaqueiro Ferreira Filho b714b2b3b6
Create a helper `MockService` type to help with writing tests that use mock `tower::Service`s (#2748)
* Implement initial service mocking helpers

Adds a [`MockService`] type, which can be configured and built for usage
in unit tests or proptests. The mocked service can then be used to
intercept requests and respond indivdiually to them.

* Use `MockService in the `mempool::Crawler` test

Refactor it to remove the helper mock function, and use the new
`MockService` helper type.

* Use `MockService` in `CandidateSet` test vectors

Refactor to remove the manual mocking of the peer set service.

* Panic if a response is not sent by `MockService`

Change the current semantics to require all `MockService` usages to
respond to every intercepted request.

A `must_use` attribute was added to the `ResponseSender` so that the
compiler can warn when this doesn't happen.

* Allow generic error types in `MockService`

Replace the hard-coded `BoxError` as the `Service`'s error type with a
generic type parameter. This allows mocking services in locations that
require specific error types.

* Add a `ResponseSender::request` getter

Allow inspecting the request again before responding, and using
information from the request in the response.

Co-authored-by: Conrado Gouvea <conrado@zfnd.org>
2021-09-21 17:44:59 +00:00
teor b6fe816473
Add a `ChainTipChange` type to `await` chain tip changes (#2715)
* Rename ChainTipReceiver to CurrentChainTip

`fastmod ChainTipReceiver CurrentChainTip zebra*`

* Update chain tip documentation and variable names

* Basic chain tip change implementation, without resets

Also includes the following name changes:
```
fastmod CurrentChainTip LatestChainTip zebra*
fastmod chain_tip_receiver latest_chain_tip zebra*
```

* Clarify the difference between `LatestChainTip` and `ChainTipChange`
2021-09-01 22:31:16 +00:00
teor d2e14b22f9
Refactor BestTipHeight into a generic ChainTip sender and receiver (#2676)
* Rename BestTipHeight so it can be generalised to ChainTipSender

`fastmod BestTipHeight ChainTipSender zebra*`

For senders:
`fastmod best_tip_height chain_tip_sender zebra*`

For receivers:
`fastmod best_tip_height chain_tip_receiver zebra*`

* Rename best_tip_height module to chain_tip

* Wrap the chain tip watch channel in a ChainTipReceiver type

* Create a ChainTip trait to avoid tricky crate dependencies

And add convenience impls for optional and empty chain tips.

* Use the ChainTip trait in zebra-network

* Replace `Option<ChainTip>` with `NoChainTip`

Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com>

Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com>
2021-08-27 11:34:33 +10:00
teor c608260256
Support witnessed transaction IDs in zebra-network requests and responses (#2638)
* Rename internal network requests for wide transaction IDs

fastmod TransactionsByHash TransactionsById zebra*
fastmod AdvertiseTransactions AdvertiseTransactionIds zebra*
fastmod MempoolTransactions MempoolTransactionIds zebra*
fastmod TransactionHashes TransactionIds zebra*

* Update network transaction request/response comments

* Rename a transaction hash method for wide transaction IDs

fastmod transaction_hashes transaction_ids zebra-network

* Add UnminedTxId methods and conversions for InventoryHash

* Map WtxIds to unmined transaction network messages

Also, use UnminedTxId and UnminedTx in:
* Zebra's internal request and response format, and
* external Zcash network protocol messages.

* Enable WtxId mempool inventory tracking for peers

* Further clarify transaction IDs

* Use Witnessed rather than Wide for transaction IDs

And rename narrow to legacy when it only applies to v1-v4 transactions.
Otherwise, rename it to mined ID.

* Rename a missed binding
* Remove an incorrectly named binding

Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com>
2021-08-18 22:55:24 +00:00
Janito Vaqueiro Ferreira Filho 4c4dbfe7cd
Reject connections from outdated peers (#2519)
* Simplify state service initialization in test

Use the test helper function to remove redundant code.

* Create `BestTipHeight` helper type

This type abstracts away the calculation of the best tip height based on
the finalized block height and the best non-finalized chain's tip.

* Add `best_tip_height` field to `StateService`

The receiver endpoint is currently ignored.

* Return receiver endpoint from service constructor

Make it available so that the best tip height can be watched.

* Update finalized height after finalizing blocks

After blocks from the queue are finalized and committed to disk, update
the finalized block height.

* Update best non-finalized height after validation

Update the value of the best non-finalized chain tip block height after
a new block is committed to the non-finalized state.

* Update finalized height after loading from disk

When `FinalizedState` is first created, it loads the state from
persistent storage, and the finalized tip height is updated. Therefore,
the `best_tip_height` must be notified of the initial value.

* Update the finalized height on checkpoint commit

When a checkpointed block is commited, it bypasses the non-finalized
state, so there's an extra place where the finalized height has to be
updated.

* Add `best_tip_height` to `Handshake` service

It can be configured using the `Builder::with_best_tip_height`. It's
currently not used, but it will be used to determine if a connection to
a remote peer should be rejected or not based on that peer's protocol
version.

* Require best tip height to init. `zebra_network`

Without it the handshake service can't properly enforce the minimum
network protocol version from peers. Zebrad obtains the best tip height
endpoint from `zebra_state`, and the test vectors simply use a dummy
endpoint that's fixed at the genesis height.

* Pass `best_tip_height` to proto. ver. negotiation

The protocol version negotiation code will reject connections to peers
if they are using an old protocol version. An old version is determined
based on the current known best chain tip height.

* Handle an optional height in `Version`

Fallback to the genesis height in `None` is specified.

* Reject connections to peers on old proto. versions

Avoid connecting to peers that are on protocol versions that don't
recognize a network update.

* Document why peers on old versions are rejected

Describe why it's a security issue above the check.

* Test if `BestTipHeight` starts with `None`

Check if initially there is no best tip height.

* Test if best tip height is max. of latest values

After applying a list of random updates where each one either sets the
finalized height or the non-finalized height, check that the best tip
height is the maximum of the most recently set finalized height and the
most recently set non-finalized height.

* Add `queue_and_commit_finalized` method

A small refactor to make testing easier. The handling of requests for
committing non-finalized and finalized blocks is now more consistent.

* Add `assert_block_can_be_validated` helper

Refactor to move into a separate method some assertions that are done
before a block is validated. This is to allow moving these assertions
more easily to simplify testing.

* Remove redundant PoW block assertion

It's also checked in
`zebra_state::service::check::block_is_contextually_valid`, and it was
getting in the way of tests that received a gossiped block before
finalizing enough blocks.

* Create a test strategy for test vector chain

Splits a chain loaded from the test vectors in two parts, containing the
blocks to finalize and the blocks to keep in the non-finalized state.

* Test committing blocks update best tip height

Create a mock blockchain state, with a chain of finalized blocks and a
chain of non-finalized blocks. Commit all the blocks appropriately, and
verify that the best tip height is updated.

Co-authored-by: teor <teor@riseup.net>
2021-08-08 23:52:52 +00:00
Janito Vaqueiro Ferreira Filho b68202c68a
Security: Zebra should stop gossiping unreachable addresses to other nodes, Action: re-deploy all nodes (#2392)
* Rename some methods and constants for clarity

Using the following commands:

```
fastmod '\bis_ready_for_attempt\b' is_ready_for_connection_attempt
  # One instance required a tweak, because of the ASCII diagram.
fastmod '\bwas_recently_live\b' has_connection_recently_responded
fastmod '\bwas_recently_attempted\b' was_connection_recently_attempted
fastmod '\bwas_recently_failed\b' has_connection_recently_failed
fastmod '\bLIVE_PEER_DURATION\b' MIN_PEER_RECONNECTION_DELAY
```

* Use `Instant::elapsed` for conciseness

Instead of `Instant::now().saturating_duration_since`. They're both
equivalent, and `elapsed` only panics if the `Instant` is somehow
synthetically generated.

* Allow `Duration32` to be created in other crates

Export the `Duration32` from the `zebra_chain::serialization` module.

* Add some new `Duration32` constructors

Create some helper `const` constructors to make it easy to create
constant durations. Add methods to create a `Duration32` from seconds,
minutes and hours.

* Avoid gossiping unreachable peers

When sanitizing the list of peers to gossip, remove those that we
haven't seen in more than three hours.

* Test if unreachable addresses aren't gossiped

Create a property test with random addreses inserted into an
`AddressBook`, and verify that the sanitized list of addresses does not
contain any addresses considered unreachable.

* Test if new alternate address isn't gossipable

Create a new alternate peer, because that type of `MetaAddr` does not
have `last_response` or `untrusted_last_seen` times. Verify that the
peer is not considered gossipable.

* Test if local listener is gossipable

The `MetaAddr` representing the local peer's listening address should
always be considered gossipable.

* Test if gossiped peer recently seen is gossipable

Create a `MetaAddr` representing a gossiped peer that was reported to be
seen recently. Check that the peer is considered gossipable.

* Test peer reportedly last seen in the future

Create a `MetaAddr` representing a peer gossiped and reported to have
been last seen in a time that's in the future. Check that the peer is
considered gossipable, to check that the fallback calculation is working
as intended.

* Test gossiped peer reportedly seen long ago

Create a `MetaAddr` representing a gossiped peer that was reported to
last have been seen a long time ago. Check that the peer is not
considered gossipable.

* Test if just responded peer is gossipable

Create a `MetaAddr` representing a peer that has just responded and
check that it is considered gossipable.

* Test if recently responded peer is gossipable

Create a `MetaAddr` representing a peer that last responded within the
duration a peer is considered reachable. Verify that the peer is
considered gossipable.

* Test peer that responded long ago isn't gossipable

Create a `MetaAddr` representing a peer that last responded outside the
duration a peer is considered reachable. Verify that the peer is not
considered gossipable.
2021-06-29 05:12:27 +00:00
teor 9cb7ee4d0e
Release Blocker? Disable IPv6 tests when $ZEBRA_SKIP_IPV6_TESTS is set (#2405)
* Disable IPv6 tests when $ZEBRA_SKIP_IPV6_TESTS is set

This allows users to disable IPv6 tests in environments where IPv6 is not
configured.

* Add network test env var constants

* Replace env strings with constants

fastmod '"ZEBRA_SKIP_NETWORK_TESTS"' zebra_test::net::ZEBRA_SKIP_NETWORK_TESTS
fastmod '"ZEBRA_SKIP_IPV6_TESTS"' zebra_test::net::ZEBRA_SKIP_IPV6_TESTS

* Add functions to skip network tests

* Replace test network env var checks with test function

fastmod --fixed-strings 'env::var_os(zebra_test::net::ZEBRA_SKIP_NETWORK_TESTS).is_some()' 'zebra_test::net::zebra_skip_network_tests()'
fastmod --fixed-strings 'env::var_os(zebra_test::net::ZEBRA_SKIP_IPV6_TESTS).is_some()' 'zebra_test::net::zebra_skip_ipv6_tests()'

* Remove redundant logging and use statements
2021-06-29 11:20:32 +10:00
teor bcd5f2c50d
Gossip dynamic local listener ports to peers (#2277)
* Gossip dynamically allocated listener ports to peers

Previously, Zebra would either gossip port `0`, which is invalid, or skip
gossiping its own dynamically allocated listener port.

* Improve "no configured peers" warning

And downgrade from error to warning, because inbound-only nodes are a
valid use case.

* Move random_known_port to zebra-test

* Add tests for dynamic local listener ports and the AddressBook

Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com>
2021-06-23 07:59:06 +10:00
teor 1a57023eac
Security: Use canonical SocketAddrs to avoid duplicate peer connections, Feature: Send local listener to peers (#2276)
* Always send our local listener with the latest time

Previously, whenever there was an inbound request for peers, we would
clone the address book and update it with the local listener.

This had two impacts:
- the listener could conflict with an existing entry,
  rather than unconditionally replacing it, and
- the listener was briefly included in the address book metrics.

As a side-effect, this change also makes sanitization slightly faster,
because it avoids some useless peer filtering and sorting.

* Skip listeners that are not valid for outbound connections

* Filter sanitized addresses Zebra based on address state

This fix correctly prevents Zebra gossiping client addresses to peers,
but still keeps the client in the address book to avoid reconnections.

* Add a full set of DateTime32 and Duration32 calculation methods

* Refactor sanitize to use the new DateTime32/Duration32 methods

* Security: Use canonical SocketAddrs to avoid duplicate connections

If we allow multiple variants for each peer address, we can make multiple
connections to that peer.

Also make sure sanitized MetaAddrs are valid for outbound connections.

* Test that address books contain the local listener address

Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com>
2021-06-22 02:16:59 +00:00
teor 4d22a0bae9
Security: Limit reconnection rate to individual peers (#2275)
* Security: Limit reconnection rate to individual peers

Reconnection Rate

Limit the reconnection rate to each individual peer by applying the
liveness cutoff to the attempt, responded, and failure time fields.
If any field is recent, the peer is skipped.

The new liveness cutoff skips any peers that have recently been attempted
or failed. (Previously, the liveness check was only applied if the peer
was in the `Responded` state, which could lead to repeated retries of
`Failed` peers, particularly in small address books.)

Reconnection Order

Zebra prefers more useful peer states, then the earliest attempted,
failed, and responded times, then the most recent gossiped last seen
times.

Before this change, Zebra took the most recent time in all the peer time
fields, and used that time for liveness and ordering. This led to
confusion between trusted and untrusted data, and success and failure
times.

Unlike the previous order, the new order:
- tries all peers in each state, before re-trying any peer in that state,
  and
- only checks the the gossiped untrusted last seen time
  if all other times are equal.

* Preserve the later time if changes arrive out of order

* Update CandidateSet::next documentation

* Update CandidateSet state diagram

* Fix variant names in comments

* Explain why timestamps can be left out of MetaAddrChanges

* Add a simple test for the individual peer retry limit

* Only generate valid Arbitrary PeerServices values

* Add an individual peer retry limit AddressBook and CandidateSet test

* Stop deleting recently live addresses from the address book

If we delete recently live addresses from the address book, we can get a
new entry for them, and reconnect too rapidly.

* Rename functions to match similar tokio API

* Fix docs for service sorting

* Clarify a comment

* Cleanup a variable and comments

* Remove blank lines in the CandidateSet state diagram

* Add a multi-peer proptest that checks outbound attempt fairness

* Fix a comment typo

Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com>

* Simplify time maths in MetaAddr

* Create a Duration32 type to simplify calculations and comparisons

* Rename variables for clarity

* Split a string constant into multiple lines

* Make constants match rustdoc order

Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com>
2021-06-18 09:30:44 -03:00
teor 3932661a93
Qualify std::sync::Mutex in the unit tests (#2304)
Also add a missing zebra_test::init().
2021-06-15 10:01:56 -03:00
teor 3f7410d073
Security: stop gossiping failure and attempt times as last_seen times (#2273)
* Security: stop gossiping failure and attempt times as last_seen times

Previously, Zebra had a single time field for peer addresses, which was
updated every time a peer was attempted, sent a message, or failed.

This is a security issue, because the `last_seen` time should be
"the last time [a peer] connected to that node", so that
"nodes can use the time field to avoid relaying old 'addr' messages".
So Zebra was sending incorrect peer information to other nodes.

As part of this change, we split the `last_seen` time into the
following fields:
- untrusted_last_seen: gossiped from other peers
- last_response: time we got a response from a directly connected peer
- last_attempt: time we attempted to connect to a peer
- last_failure: time a connection with a peer failed

* Implement Arbitrary and strategies for MetaAddrChange

Also replace the MetaAddr Arbitrary impl with a derive.

* Write proptests for MetaAddr and MetaAddrChange

MetaAddr:
- the only times that get included in serialized MetaAddrs are
  the untrusted last seen and responded times

MetaAddrChange:
- the untrusted last seen time is never updated
- the services are only updated if there has been a handshake
2021-06-15 13:31:16 +10:00
teor 86f23f7960
Security: only apply the outbound connection rate-limit to actual connections (#2278)
* Only advance the outbound connection timer when it returns an address

Previously, we were advancing the timer even when we returned `None`.
This created large wait times when there were no eligible peers.

* Refactor to avoid overlapping sleep timers

* Add a maximum next peer delay test

Also refactor peer numbers into constants.

* Make the number of proptests overridable by the standard env var

Also cleanup the test constants.

* Test that skipping peer connections also skips their rate limits

* Allow an extra second after each sleep on loaded machines

macOS VMs seem to need this extra time to pass their tests.

* Restart test time bounds from the current time

This change avoids test failures due to cumulative errors.

Also use a single call to `Instant::now` for each test round.
And print the times when the tests fail.

* Stop generating invalid outbound peers in proptests

The candidate set proptests will fail if enough generated peers are
invalid for outbound connections.
2021-06-15 08:29:17 +10:00
Janito Vaqueiro Ferreira Filho a2d3078fcb
Replace usage of atomics with `tokio::sync::watch` (#2272)
Rust atomics have an API that's very easy to use incorrectly, leading to
hard to find bugs. For that reason, it's best to avoid it unless there's
a good reason not to.
2021-06-11 12:25:06 +10:00
Janito Vaqueiro Ferreira Filho e8d5f6978d
Rate limit `GetAddr` messages to any peer, Credit: Equilibrium (#2254)
* Rename field to `wait_next_handshake`

Make the name a bit more clear regarding to the field's purpose.

* Move `MIN_PEER_CONNECTION_INTERVAL` to `constants`

Move it to the `constants` module so that it is placed closer to other
constants for consistency and to make it easier to see any relationships
when changing them.

* Rate limit calls to `CandidateSet::update()`

This effectively rate limits requests asking for more peer addresses
sent to the same peer. A new `min_next_crawl` field was added to
`CandidateSet`, and `update` only sends requests for more peer addresses
if the call happens after the instant specified by that field. After
sending the requests, the field value is updated so that there is a
`MIN_PEER_GET_ADDR_INTERVAL` wait time until the next `update` call
sends requests again.

* Include `update_initial` in rate limiting

Move the rate limiting code from `update` to `update_timeout`, so that
both `update` and `update_initial` get rate limited.

* Test `CandidateSet::update` rate limiting

Create a `CandidateSet` that uses a mocked `PeerService`. The mocked
service always returns an empty list of peers, but it also checks that
the requests only happen after expected instants, determined by the
fanout amount and the rate limiting interval.

* Refactor to create a `mock_peer_service` helper

Move the code from the test to a utility function so that another test
will be able to use it as well.

* Check number of times service was called

Use an `AtomicUsize` shared between the service and the test body that
the service increments on every call. The test can then verify if the
service was called the number of times it expected.

* Test calling `update` after `update_initial`

The call to `update` should be skipped because the call to
`update_initial` should also be considered in the rate limiting.

* Mention that call to `update` may be skipped

Make it clearer that in this case the rate limiting causes calls to be
skipped, and not that there's an internal sleep that happens.

Also remove "to the same peers", because it's more general than that.

Co-authored-by: teor <teor@riseup.net>
2021-06-09 09:42:45 +10:00
Janito Vaqueiro Ferreira Filho aaef94c2bf
Prevent burst of reconnection attempts (#2251)
* Rate-limit new outbound peer connections

Set the rate-limiting sleep timer to use a delay added to the maximum
between the next peer connection instant and now. This ensures that the
timer always sleeps at least the time used for the delay.

This change fixes rate-limiting new outbound peer connections, since
before there could be a burst of attempts until the deadline progressed
to the current instant.

Fixes #2216

* Create `MetaAddr::alternate_node_strategy` helper

Creates arbitrary `MetaAddr`s as if they were network nodes that sent
their listening address.

* Test outbound peer connection rate limiting

Tests if connections are rate limited to 10 per second, and also tests
that sleeping before continuing with the attempts still respets the rate
limit and does not result in a burst of reconnection attempts.
2021-06-07 14:13:46 +10:00
teor ce45198c17
Fix comment typo: overflow -> underflow 2021-06-01 16:44:45 +10:00
teor 34702f22b6 clippy: remove needless clone and collect 2021-06-01 03:42:08 -03:00
Janito Vaqueiro Ferreira Filho 83ac1519e9 Add proptest for future `last_seen` correction
Given a generated list of gossiped peers, ensure that after running the
`validate_addrs` function none of the resulting peers have a `last_seen`
time that's after the specified limit.
2021-06-01 03:42:08 -03:00
Janito Vaqueiro Ferreira Filho 63672a2633 Test underflow handling
If the calculation to apply the compensation offset overflows or
underflows, the reported times are too distant apart, and could be sent
on purpose by a malicious peer, so all addresses from that peer should
be rejected.
2021-06-01 03:42:08 -03:00
Janito Vaqueiro Ferreira Filho f263d85aa4 Test `last_seen` time being equal to the limit
If the most recent `last_seen` time reported by a peer is exactly the
limit, the offset doesn't need to be applied because no times are in the
future.
2021-06-01 03:42:08 -03:00
Janito Vaqueiro Ferreira Filho f4a7026aa3 Test that offset is applied to all gossiped peers
Use some mock gossiped peers where some have `last_seen` times in the
past and some have times in the future. Check that all the peers have
an offset applied to them by the `validate_addrs` function.

This tests if the offset is applied to all peers that a malicious peer
gossiped to us.
2021-06-01 03:42:08 -03:00
Janito Vaqueiro Ferreira Filho 60f660e53f Test if validation doesn't offset past times
Use some mock gossiped peers that all have `last_seen` times in the
past and check that they don't have any changes to the `last_seen` times
applied by the `validate_addrs` function.
2021-06-01 03:42:08 -03:00
Janito Vaqueiro Ferreira Filho 3c9c920bbd Test if validation offsets times in the future
Use some mock gossiped peers that all have `last_seen` times in the
future and check that they all have a specific offset applied to them.
2021-06-01 03:42:08 -03:00
Janito Vaqueiro Ferreira Filho 82452621e0 Remove empty list of peers check
The `limit_last_seen_times` can now safely handle an empty list.
2021-06-01 03:42:08 -03:00
Janito Vaqueiro Ferreira Filho 966430d400 Update security note to be broader
Focus on what can go wrong, and not on the specific causes.

Co-authored-by: teor <teor@riseup.net>
2021-06-01 03:42:08 -03:00
Janito Vaqueiro Ferreira Filho f3419b7baf Handle overflow when applying offset
If an overflow occurs, the reported `last_seen` times are either very
wrong or malicious, so reject all addresses gossiped by that peer.
2021-06-01 03:42:08 -03:00
Janito Vaqueiro Ferreira Filho 5b8f33390c Add comment to describe purpose
Make it clear why all peers have the time offset applied to them.

Co-authored-by: teor <teor@riseup.net>
2021-06-01 03:42:08 -03:00
Janito Vaqueiro Ferreira Filho 9eac43a8bb Apply offset to all times received from a peer
If any of the times gossiped by a peer are in the future, apply the
necessary offset to all the times gossiped by that peer. This ensures
that all gossiped peers from a malicious peer are moved further back in
the queue.

Co-authored-by: teor <teor@riseup.net>
2021-06-01 03:42:08 -03:00
Janito Vaqueiro Ferreira Filho fa35c9b4f1 Only apply offset to times in the future
Times in the past don't have any security implications, so there's no
point in trying to apply the offset to them as well.
2021-06-01 03:42:08 -03:00
Janito Vaqueiro Ferreira Filho 876d515dd6 Improve documentation
- Make the security impact clearer and in a separate section.
- Instead of listing an assumption as almost a side-note, describe it
  clearly inside a `Panics` section.

Co-authored-by: teor <teor@riseup.net>
2021-06-01 03:42:08 -03:00
Janito Vaqueiro Ferreira Filho 54809a1b89 Don't trust reported peer `last_seen` times
Due to clock skew, the peers could end up at the front of the
reconnection queue or far at the back. The solution to this is to offset
the reported times by the difference between the most recent reported
sight (in the remote clock) and the current time (in the local clock).
2021-06-01 03:42:08 -03:00
Janito Vaqueiro Ferreira Filho 14ecc79f01 Use `DateTime32` in `validate_addrs` 2021-06-01 03:42:08 -03:00
Janito Vaqueiro Ferreira Filho b891a96a6d Improve ergonomics by returning `impl Iterator`
Returning `impl IntoIterator` means that the caller will always be
forced to call `.into_iter()`, and returning `impl Iterator` still
allows them to call `.into_iter()` because it becomes the identity
function.
2021-06-01 03:42:08 -03:00
teor 2685fc746e
Remove CandidateSet state and add last seen time limit to candidate_set::validate_addrs (#2177) 2021-05-21 02:21:13 +00:00
teor 752358d236
Fix some candidate set and meta addr doc links (#2174)
Suggested by jvff.
2021-05-21 11:40:14 +10:00
teor c7ea1395e7 Security: Fix CandidateSet timeout and fanout
* Refactor: Split CandidateSet::update into separate functions
* Security: Apply a timeout to the entire CandidateSet::update
* Security: Stop using very large fanout limits during initialization

Previously, Zebra used the number of resolved peer addresses.
So it was possible for all peers to fail, and for Zebra to hang on the
first update.

And Zebra could send a fanout for each initial peer, regardless
of whether their connection was successful.

Also:
- wait for at least one successful peer before trying an update
- warn if there are no successful initial peers
2021-05-21 06:51:34 +10:00
teor 92828bbb29 Reliability: send local listener address to peers
When peers ask for peer addresses, add our local listener address to the
set of addresses, sanitize, then truncate. Sanitize shuffles addresses,
so if there are lots of addresses in the address book, our address will
only be sent to some peers.
2021-05-18 14:02:19 +10:00
teor 458c26f1e3 Limit initial candidate set fanout to the number of initial peers
If there is a small number of initial peers, and they are slow, the
initial candidate set update can appear to hang. To avoid this issue,
limit the initial candidate set fanout to the number of initial peers.

Once the initial peers have sent us more peer addresses, there is no need
to limit the fanouts for future updates.

Reported by Niklas Long of Equilibrium.
2021-05-18 07:54:03 +10:00
teor b0b8b2f61a
Add extra instrumentation for initialize and handshakes (#2122)
* Instrument the crawl task

When we created the crawl task, we forgot to instrument it with the
global span. This fix makes sure that the git and network span appears on
crawl logs.

* Instrument the connector

* Improve handshake instrumentation

Make some spans debug, so there are not too many spans.

* Add the address to initial peer connection errors
2021-05-17 16:49:16 -04:00
teor a8a0d6450c Security: stop gossiping temporary inbound remote addresses to peers
- stop putting inbound addresses in the address book
- drop address book entries that can't be used for outbound connections
  - distinguish between temporary inbound and permanent outbound peer
    addresses
  - also create variants to handle proxy connections
    (but don't use them yet)
  - avoid tracking connection state for isolated connections
- document security constraints for the address book and peer set
2021-05-14 23:45:42 +10:00
Kirill Fomichev afac2c2846
Use the default port for configured listen addresses with no port (#2043)
* Allow use listen address in config without port

* update comments

* remove not used alias

* use Network::default_port

* Move tests and use toml instead json

* change error message

* Make match more readable

Co-authored-by: teor <teor@riseup.net>
2021-04-21 23:14:29 +00:00
teor 0203d1475a Refactor and document correctness for std::sync::Mutex<AddressBook> 2021-04-21 17:14:47 -04:00
teor 2ed8bb00cf Clarify CandidateSet state diagram
We get inbound connections on the listener port,
but the important part is the inbound connection
itself.
2021-04-21 01:37:43 -04:00
teor a417c7c8c7 Use meaningful names for select! variables 2021-04-13 23:56:16 -04:00
teor fb95de99a6 Refactor the dial result into a From impl 2021-04-13 18:52:49 -04:00
teor 375c8d8700
Fix a deadlock between the crawler and dialer, and other hangs (#1950)
* Stop ignoring inbound message errors and handshake timeouts

To avoid hangs, Zebra needs to maintain the following invariants in the
handshake and heartbeat code:
- each handshake should run in a separate spawned task
  (not yet implemented)
- every message, error, timeout, and shutdown must update the peer address state
- every await that depends on the network must have a timeout

Once the Connection is created, it should handle timeouts.
But we need to handle timeouts during handshake setup.

* Avoid hangs by adding a timeout to the candidate set update

Also increase the fanout from 1 to 2, to increase address diversity.

But only return permanent errors from `CandidateSet::update`, because
the crawler task exits if `update` returns an error.

Also log Peers response errors in the CandidateSet.

* Use the select macro in the crawler to reduce hangs

The `select` function is biased towards its first argument, risking
starvation.

As a side-benefit, this change also makes the code a lot easier to read
and maintain.

* Split CrawlerAction::Demand into separate actions

This refactor makes the code a bit easier to read, at the cost of
sometimes blocking the crawler on `candidates.next()`.

That's ok, because `next` only has a short (< 100 ms) delay. And we're
just about to spawn a separate task for each handshake.

* Spawn a separate task for each handshake

This change avoids deadlocks by letting each handshake make progress
independently.

* Move the dial task into a separate function

This refactor improves readability.

* Fix buggy future::select function usage

And document the correctness of the new code.
2021-04-07 10:25:10 -03:00
teor de6d1c93f3
Clarify a comment 2021-04-07 18:56:38 +10:00
teor 83b88f5b7a
Merge pull request #1972 from ZcashFoundation/peer-set-demand-deadlock-doc
Document peer set deadlock resistance
2021-04-01 22:50:17 -04:00
teor 306fa88214 Document the correctness of Poll::Pending wakeups 2021-03-27 08:55:49 -04:00
teor 1a159dfcb6 Add more methods for creating MetaAddrs
This refactor lets us remove `MetaAddr::update_last_seen()`.
2021-03-26 07:23:49 +10:00
teor 6fe81d8992 Make MetaAddr.last_seen into a private field 2021-03-26 07:23:49 +10:00
teor 5a30268d7a Log address metrics when the peer set has no ready peers 2021-03-17 10:47:04 +10:00
Jack Grigg e51f33a4b9 Use interoperable names for common metrics
These names match the equivalent metrics in zcashd, enabling common
metrics to be collected across both node types.
2021-03-17 09:38:07 +10:00
teor e50692bd51 CandidateSet: Add Listener Port Connections
Inbound connections on the Zcash protocol listener port
perform a handshake. If the handshake is successful, it
adds the peer to the AddressBook.
2021-03-09 23:05:18 -05:00
Jane Lusby 03aa6f671f
Implement outbound connection rate limiting - includes config rename with alias (#1855)
* Implement outbound connection rate limiting
* fix breaking change on config

Co-authored-by: teor <teor@riseup.net>
2021-03-10 01:36:05 +00:00
teor d4f2f27218
Add global span to spawned network tasks (#1761)
Closes #1575
2021-02-20 08:36:50 +10:00
teor e61b5e50a2
Diagnostics for CI port conflict failures (#1766)
Log a "Trying..." message before each listener opens, to see if the
delay is inside Zebra, or in the test harness or OS.

Also report the configured and actual ports where possible, for better
diagnostics.
2021-02-18 12:15:09 -03:00
teor 5424e1d8ba
Fix candidate set address state handling (#1709)
Design:
- Add a `PeerAddrState` to each `MetaAddr`
- Use a single peer set for all peers, regardless of state
- Implement time-based liveness as an `AddressBook` method, rather than
  a `PeerAddrState` variant
- Delete `AddressBook.by_state`

Implementation:
- Simplify `AddressBook` changes using `update` and `take` modifier
  methods
- Simplify the `AddressBook` iterator implementation, replacing it with
  methods that are more obviously correct
- Consistently collect peer set metrics

Documentation:
- Expand and update the peer set documentation

We can optimise later, but for now we want simple code that is more
obviously correct.
2021-02-18 11:18:32 +10:00
teor 86169f6412
Update PeerSet metrics after every change (#1727) 2021-02-18 07:06:59 +10:00
teor 8d1c498234 Log initial peer connection failures
And standardise another log message
2021-02-17 09:21:53 -05:00
teor e85441c914 Add a correctness comment to justify the revert 2021-02-16 05:52:54 +10:00
teor a02a00a3f5 Revert "Stop using CallAllUnordered in peer_set::add_initial_peers (#1705)"
This reverts commit 241c7ad849.
2021-02-16 05:52:54 +10:00
Alfredo Garcia 241c7ad849
Stop using CallAllUnordered in peer_set::add_initial_peers (#1705)
* use ServiceExt::oneshot and FuturesUnordered

Co-authored-by: teor <teor@riseup.net>
2021-02-09 08:16:02 +10:00
Alfredo Garcia 221512c733
Async DNS seeder lookups (#1662)
* replace to_socket_addrs
* refactor `resolve()` into `resolve_host()`
* use `resolve_host()` to resolve config peers
* add DNS_LOOKUP_TIMEOUT constant
* don't block the main thread in initialize
2021-02-03 12:20:26 +10:00
Alfredo Garcia 4b34482264
Add hints to port conflict and lock file panics (#1535)
* add hint for port error
* add issue filter for port panic
* add lock file hint
* add metrics endpoint port conflict hint
* add hint for tracing endpoint port conflict
* add acceptance test for resource conflics
* Split out common conflict test code into a function
* Add state, metrics, and tracing conflict tests

* Add a full set of stderr acceptance test functions

This change makes the stdout and stderr acceptance test interfaces
identical.

* move Zcash listener opening
* add todo about hint for disk full
* add constant for lock file
* match path in state cache
* don't match windows cache path

* Use Display for state path logs

Avoids weird escaping on Windows when using Debug

* Add Windows conflict error messages

* Turn PORT_IN_USE_ERROR into a regex

And add another alternative Windows-specific port error

Co-authored-by: teor <teor@riseup.net>
Co-authored-by: Jane Lusby <jane@zfnd.org>
2021-01-29 22:36:33 +10:00
Jane Lusby 15698245e1
Deduplicate metrics dependencies (#1561)
## Motivation

This PR is motivated by the regression identified in https://github.com/ZcashFoundation/zebra/issues/1349. That PR notes that the metrics stopped working for most of the crates other than `zebrad`.

## Solution

This PR resolves the regression by deduplicating the `metrics` crate dependency. During a recent change we upgraded the metrics version in `zebrad` and a couple other of our crates, but we never updated the dependencies in `zebra-state`, `zebra-consensus`, or `zebra-network`. This caused the metrics macros to attempt to retrieve the current metrics exporter through the wrong function. We would install the metrics exporter in `0.13`, but then attempt to look it up through the `0.12` crate, which contains a different instance of the metrics exporter static variable which is unset. Doing this causes the metrics macros to return `None` for the current exporter after which they just silently give up.

## Related Issues

closes https://github.com/ZcashFoundation/zebra/issues/1349

## Follow Up Work

I noticed we have quite a few duplicate dependencies in our tree. We might be able to save some compilation time by auditing those and deduplicating them as much as possible.

- https://github.com/ZcashFoundation/zebra/issues/1582
Co-authored-by: teor <teor@riseup.net>
2021-01-12 12:28:56 +10:00
teor d482900e7f Remove a redundant pattern match
Identified by clippy's redundant_pattern_match lint.
2020-12-13 22:10:05 -05:00
teor 8e2f08221f
Add peer set tracing and unreachable panics (#1468)
Add some extra tracing and panics to double-check our
assumptions about the peer set state machine.
2020-12-14 11:00:39 +10:00
teor 34518525a5 Improve peer set logging hints
Delete hints about configuring peers.
Delete hint for typical "no ready peers" behaviour.
2020-12-01 21:37:15 -08:00
Henry de Valence 00c4f4f0e6 network: record cause of handshake failure 2020-12-01 19:16:41 -08:00
teor 4d5ea4897c Log peer set ready and unready peers
* warn: if there are no peers at all
* info: if there are no ready peers
* trace: the number of ready and unready peers for every request

Log at most one warn or info log per minute, to avoid flooding the
terminal with log lines. Suppress warn and info logs for the first
minute, while the peer set is starting up.
2020-12-01 11:00:21 -05:00
teor 8d6ac8eece Placate clippy 2020-11-24 20:03:21 +10:00
Henry de Valence d90e709ce1 network: tidy peer set implementation
- rename functions more descriptively
- create a common `take_ready_service` function
- organize poll_ functions separately
2020-11-24 20:03:21 +10:00
Henry de Valence f36a4800b2 network: fix invariant violation in peer set
Closes #1183.

The peer set maintains a preselected ready service that it can use to
perform power-of-two-choices (p2c) routing of requests.  Ready services
are stored by key (socket address) in an `IndexMap`, and the preselected
service is represented by an `Option<usize>` indexing that map.  This
means that whenever the set of ready services changes (e.g., a service
is removed from the peer set, or a service is taken to be used to
process a request), the preselected index is invalidated.  The original
P2C-only implementation maintained this invariant but did not document
it.

The change to inventory-based routing introduced a bug by failing to
maintain this invariant and appropriately invalidate the preselected
index.  However, this was only noticeable approximately 1/N of the time
on the next request after an inventory-directed request, so the bug
occurred infrequently.  Luckily, the use of `.expect` caused the bug to
be an immediate panic, making it possible to identify by inspecting all
uses of the ready service map.
2020-11-24 20:03:21 +10:00
Henry de Valence add94c1c45 deps: move to tokio 0.3, tower 0.4
This change is mostly mechanical, with the exception of the changes to the
`tower-batch` middleware.  This middleware was adapted from `tower::buffer`,
and the `tower::buffer` code was changed to implement its own bounded queue,
because Tokio 0.3 removed the `mpsc::Sender::poll_send` method.  See

ddc64e8d4d

for more context on the Tower changes.  To match Tower as closely as possible
in order to be able to upstream `tower-batch`, those changes are copied from
`tower::Buffer` to `tower-batch`.
2020-11-20 10:08:16 -08:00
Henry de Valence 6dd7318d3b deps: use Tower 0.4 from git instead of 0.3.1.
This addresses at least three pain points:

- we were affected by bugs that were already fixed in git, but not in
  the released crate;
- we can use service combinators to transform requests and responses;
- we can use the hedge middleware.

The version in git is still marked as 0.3.1 but these changes will be
part of tower 0.4: https://github.com/tower-rs/tower/issues/431
2020-09-21 14:16:56 -07:00
Deirdre Connolly 33afeb37cb Add a comment about the short looo 2020-09-21 09:26:39 -07:00
Henry de Valence 6f3288814c network: avoid GetPeers timeout to accelerate init
The GetPeers requests sent while crawling the network are randomly
load-balanced over available peers.  But at the very beginning, they may
be both routed to the same peer, causing network initialization to be
delayed while the second one times out (since zcashd only ever responds
to the first addr message).

Only sending one GetPeers request per candidate set update means we
crawl the network a little more slowly, but avoids hanging on start.
2020-09-21 09:26:39 -07:00
Henry de Valence 170f588ffb network: document load-shedding behavior
This was part of the original design and is described in the Connection
internals, but we never documented it externally.
2020-09-18 18:34:25 -07:00
Henry de Valence 1d3892e1dc network: rename alias to BoxError
This is shorter and consistent with Tower (which is why we use it in the
first place).
2020-09-18 18:34:25 -07:00
Henry de Valence 3f150eb16e
network: implement transaction request handling. (#1016)
This commit makes several related changes to the network code:

- adds a `TransactionsByHash(HashSet<transaction::Hash>)` request and
  `Transactions(Vec<Arc<Transaction>>)` response pair that allows
  fetching transactions from a remote peer;

- adds a `PushTransaction(Arc<Transaction>)` request that pushes an
  unsolicited transaction to a remote peer;

- adds an `AdvertiseTransactions(HashSet<transaction::Hash>)` request
  that advertises transactions by hash to a remote peer;

- adds an `AdvertiseBlock(block::Hash)` request that advertises a block
  by hash to a remote peer;

Then, it modifies the connection state machine so that outbound
requests to remote peers are handled properly:

- `TransactionsByHash` generates a `getdata` message and collects the
  results, like the existing `BlocksByHash` request.

- `PushTransaction` generates a `tx` message, and returns `Nil` immediately.

- `AdvertiseTransactions` and `AdvertiseBlock` generate an `inv`
  message, and return `Nil` immediately.

Next, it modifies the connection state machine so that messages
from remote peers generate requests to the inbound service:

- `getdata` messages generate `BlocksByHash` or `TransactionsByHash`
  requests, depending on the content of the message;

- `tx` messages generate `PushTransaction` requests;

- `inv` messages generate `AdvertiseBlock` or `AdvertiseTransactions`
  requests.

Finally, it refactors the request routing logic for the peer set to
handle advertisement messages, providing three routing methods:

- `route_p2c`, which uses p2c as normal (default);
- `route_inv`, which uses the inventory registry and falls back to p2c
  (used for `BlocksByHash` or `TransactionsByHash`);
- `route_all`, which broadcasts a request to all ready peers (used for
  `AdvertiseBlock` and `AdvertiseTransactions`).
2020-09-08 10:16:29 -07:00
Henry de Valence cad38415b2
network: fix bug in inventory advertisement handling (#1022)
* network: fix bug in inventory advertisement handling

The RFC https://zebra.zfnd.org/dev/rfcs/0003-inventory-tracking.html described
the use of a `broadcast` channel in place of an `mpsc` channel to get
ring-buffer behavior, keeping a bound on the size of the channel but dropping
old entries when the channel is full.

However, it didn't explicitly describe how this works (the `broadcast` channel
returns a `RecvError::Lagged(u64)` to inform receivers that they lost
messages), so the lag-handling wasn't implemented and I didn't notice in
review.

Instead, the ? operator bubbled the lag error all the way up from
`InventoryRegistry::poll_inventory` through `<PeerSet as Service>::poll_ready`
through various Tower wrappers to users of the peer set.  The error propagation
is bad enough, because it caused client errors that shouldn't have happened,
but there's a worse interaction.

The `Service` contract distinguishes between request errors (from
`Service::call`, scoped to the request) and service errors (from
`Service::poll_ready`, scoped to the service).  The `Service` contract
specifies that once a service returns an error from `poll_ready`, the service
can be assumed to be failed permanently.

I believe (but haven't tested or carefully worked through the details) that
this caused various tower middleware to report the entire peer set service as
permanently failed due to a transient inventory "error" (more of an indicator),
and I suspect that this is the cause of #1003, where all of the sync
component's requests end up failing because the peer set reported that it
failed permanently.  I am able to reproduce #1003 locally before this change
and unable to reproduce it locally after this change, though I have not tested
exhaustively.

* network: add metric for dropped inventory advertisements

Co-authored-by: teor <teor@riseup.net>

Co-authored-by: teor <teor@riseup.net>
2020-09-07 21:24:31 -07:00
Jane Lusby 88557ddd0a address more comments 2020-09-01 21:01:38 -04:00
Jane Lusby 96c8809348
Implement Inventory Tracking RFC (#963)
* Add .cargo to the gitignore file

* Implement Inventory Tracking RFC

* checkpoint

* wire together the inventory registry

* add comment documenting condition

* make inventory registry optional
2020-09-01 14:28:54 -07:00
Henry de Valence fddba7a336 network: remove handshake::Builder::with_addr
Use the listen_addr field already specified in the config.

Also, derive Clone for Handshake<S>.

Co-authored-by: Jane Lusby <jane@zfnd.org>
2020-09-01 13:56:00 -07:00
Henry de Valence 1b5a824584 network: fix bug in BIP37 relay flag handling.
The relay flag in the version message is used in conjunction with BIP37 to
receive bloom-filtered transactions.  When it is set to false, transactions are
not relayed until a bloom filter is set.  Since we don't implement BIP37 (it's
not useful for shielded transactions), this means we'll never receive
transactions.
2020-09-01 13:56:00 -07:00
Henry de Valence 60a0b8c382 network: change Handshake::new to a Builder.
This allows more detailed control over the handshake parameters.
2020-09-01 13:56:00 -07:00
Henry de Valence 948b067808 chain: move Network, NetworkUpgrade to parameters
Also, avoid using star-imports of the enum variants, which pollutes the
namespace.
2020-08-17 11:46:34 -07:00
teor 109666cc48
fix: Tweak the the network listener log (#886) 2020-08-12 14:22:54 -07:00
Henry de Valence 299afe13df
zebra-network tweaks. (#877)
* network: move gossiped peer selection logic into address book.

* network: return BoxService from init.

* zebrad: add note on why we truncate thegossiped peer list

Co-authored-by: Jane Lusby <jlusby42@gmail.com>

* Remove unused .rustfmt.toml

Many of these options are never actually loaded by our CI because of a channel
mismatch, where they're not applied on stable but only on nightly (see the logs
from a rustfmt job).  This means that we can get different settings when
running `cargo fmt` on the nightly and stable channels, which was causing a CI
failure on this PR.  Reverting back to the default rustfmt settings avoids this
problem and keeps us in line with upstream rustfmt.  There's no loss to us
since we were using the defaults anyways.

Co-authored-by: Jane Lusby <jlusby42@gmail.com>
2020-08-11 13:07:44 -07:00
Alfredo Garcia 9c387521bd
Print endpoint addresses at startup (#867)
* print tracing and metrics endpoints in startup

* print network address in startup
2020-08-10 12:47:26 -07:00
Henry de Valence 3d46ab746a
Clean up options in network config section. (#839)
Closes #536.

This removes:

- the user-agent (we can add a mechanism to specify extra BIP14 components later, if any users ask us for that feature);
- the EWMA parameters (these were put in the config just to avoid making a choice);
- the peer connection timeout (we can change the default value if anyone ever has a problem with it);
- the peer set request buffer size (setting this too low can make the application deadlock);

The new peer interval is left in.
2020-08-06 11:29:00 -07:00
teor 6be0f8ed2f fix: Warn if the listener port is for the wrong network
We'll fix the underlying defaults in #660, with the rest of the
listeners.
2020-07-29 16:03:52 +10:00
Henry de Valence 217c25ef07 network: propagate tracing Spans through peer connection 2020-07-09 11:15:06 -07:00
Dimitris Apostolou ba81d7d4c0 Fix typos 2020-07-07 11:13:49 -07:00
Jane Lusby 685bdaf2df don't require absense of cancel handles
Prior to this change, we required that services that are canceled do not
have a cancel handle in the `cancel_handles` list, based on the
assumption that the handle must have been removed in the process of
canceling this service.

This doesn't holding up though, because it is currently possible for us
to have the same peer connect to us multiple times, the second connect
removes the cancel handle of the original connect and inserts it's own
cancel handle in its place. In this scenario, when the first service is
polled for readiness it will see that it has been canceled and go to
clean itself up, but when it asserts that it doesn't have a cancel
handle it will see the cancel handle of the second connect event, which
uses the same key as the first connect, and fail its debug assertion.

This change removes that debug assert on the assumption that it is okay
for a peer to connect multiple times consecutively, and that the correct
behavior in that case is to just cancel the first connection and
continue as normal.
2020-06-16 13:42:31 -07:00