Zebra

Commit Graph

Author	SHA1	Message	Date
Janito Vaqueiro Ferreira Filho	b68202c68a	Security: Zebra should stop gossiping unreachable addresses to other nodes, Action: re-deploy all nodes (#2392 ) * Rename some methods and constants for clarity Using the following commands: ``` fastmod '\bis_ready_for_attempt\b' is_ready_for_connection_attempt # One instance required a tweak, because of the ASCII diagram. fastmod '\bwas_recently_live\b' has_connection_recently_responded fastmod '\bwas_recently_attempted\b' was_connection_recently_attempted fastmod '\bwas_recently_failed\b' has_connection_recently_failed fastmod '\bLIVE_PEER_DURATION\b' MIN_PEER_RECONNECTION_DELAY ``` * Use `Instant::elapsed` for conciseness Instead of `Instant::now().saturating_duration_since`. They're both equivalent, and `elapsed` only panics if the `Instant` is somehow synthetically generated. * Allow `Duration32` to be created in other crates Export the `Duration32` from the `zebra_chain::serialization` module. * Add some new `Duration32` constructors Create some helper `const` constructors to make it easy to create constant durations. Add methods to create a `Duration32` from seconds, minutes and hours. * Avoid gossiping unreachable peers When sanitizing the list of peers to gossip, remove those that we haven't seen in more than three hours. * Test if unreachable addresses aren't gossiped Create a property test with random addreses inserted into an `AddressBook`, and verify that the sanitized list of addresses does not contain any addresses considered unreachable. * Test if new alternate address isn't gossipable Create a new alternate peer, because that type of `MetaAddr` does not have `last_response` or `untrusted_last_seen` times. Verify that the peer is not considered gossipable. * Test if local listener is gossipable The `MetaAddr` representing the local peer's listening address should always be considered gossipable. * Test if gossiped peer recently seen is gossipable Create a `MetaAddr` representing a gossiped peer that was reported to be seen recently. Check that the peer is considered gossipable. * Test peer reportedly last seen in the future Create a `MetaAddr` representing a peer gossiped and reported to have been last seen in a time that's in the future. Check that the peer is considered gossipable, to check that the fallback calculation is working as intended. * Test gossiped peer reportedly seen long ago Create a `MetaAddr` representing a gossiped peer that was reported to last have been seen a long time ago. Check that the peer is not considered gossipable. * Test if just responded peer is gossipable Create a `MetaAddr` representing a peer that has just responded and check that it is considered gossipable. * Test if recently responded peer is gossipable Create a `MetaAddr` representing a peer that last responded within the duration a peer is considered reachable. Verify that the peer is considered gossipable. * Test peer that responded long ago isn't gossipable Create a `MetaAddr` representing a peer that last responded outside the duration a peer is considered reachable. Verify that the peer is not considered gossipable.	2021-06-29 05:12:27 +00:00
teor	9cb7ee4d0e	Release Blocker? Disable IPv6 tests when $ZEBRA_SKIP_IPV6_TESTS is set (#2405 ) * Disable IPv6 tests when $ZEBRA_SKIP_IPV6_TESTS is set This allows users to disable IPv6 tests in environments where IPv6 is not configured. * Add network test env var constants * Replace env strings with constants fastmod '"ZEBRA_SKIP_NETWORK_TESTS"' zebra_test::net::ZEBRA_SKIP_NETWORK_TESTS fastmod '"ZEBRA_SKIP_IPV6_TESTS"' zebra_test::net::ZEBRA_SKIP_IPV6_TESTS * Add functions to skip network tests * Replace test network env var checks with test function fastmod --fixed-strings 'env::var_os(zebra_test::net::ZEBRA_SKIP_NETWORK_TESTS).is_some()' 'zebra_test::net::zebra_skip_network_tests()' fastmod --fixed-strings 'env::var_os(zebra_test::net::ZEBRA_SKIP_IPV6_TESTS).is_some()' 'zebra_test::net::zebra_skip_ipv6_tests()' * Remove redundant logging and use statements	2021-06-29 11:20:32 +10:00
teor	bcd5f2c50d	Gossip dynamic local listener ports to peers (#2277 ) * Gossip dynamically allocated listener ports to peers Previously, Zebra would either gossip port `0`, which is invalid, or skip gossiping its own dynamically allocated listener port. * Improve "no configured peers" warning And downgrade from error to warning, because inbound-only nodes are a valid use case. * Move random_known_port to zebra-test * Add tests for dynamic local listener ports and the AddressBook Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com>	2021-06-23 07:59:06 +10:00
teor	1a57023eac	Security: Use canonical SocketAddrs to avoid duplicate peer connections, Feature: Send local listener to peers (#2276 ) * Always send our local listener with the latest time Previously, whenever there was an inbound request for peers, we would clone the address book and update it with the local listener. This had two impacts: - the listener could conflict with an existing entry, rather than unconditionally replacing it, and - the listener was briefly included in the address book metrics. As a side-effect, this change also makes sanitization slightly faster, because it avoids some useless peer filtering and sorting. * Skip listeners that are not valid for outbound connections * Filter sanitized addresses Zebra based on address state This fix correctly prevents Zebra gossiping client addresses to peers, but still keeps the client in the address book to avoid reconnections. * Add a full set of DateTime32 and Duration32 calculation methods * Refactor sanitize to use the new DateTime32/Duration32 methods * Security: Use canonical SocketAddrs to avoid duplicate connections If we allow multiple variants for each peer address, we can make multiple connections to that peer. Also make sure sanitized MetaAddrs are valid for outbound connections. * Test that address books contain the local listener address Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com>	2021-06-22 02:16:59 +00:00
teor	4d22a0bae9	Security: Limit reconnection rate to individual peers (#2275 ) * Security: Limit reconnection rate to individual peers Reconnection Rate Limit the reconnection rate to each individual peer by applying the liveness cutoff to the attempt, responded, and failure time fields. If any field is recent, the peer is skipped. The new liveness cutoff skips any peers that have recently been attempted or failed. (Previously, the liveness check was only applied if the peer was in the `Responded` state, which could lead to repeated retries of `Failed` peers, particularly in small address books.) Reconnection Order Zebra prefers more useful peer states, then the earliest attempted, failed, and responded times, then the most recent gossiped last seen times. Before this change, Zebra took the most recent time in all the peer time fields, and used that time for liveness and ordering. This led to confusion between trusted and untrusted data, and success and failure times. Unlike the previous order, the new order: - tries all peers in each state, before re-trying any peer in that state, and - only checks the the gossiped untrusted last seen time if all other times are equal. * Preserve the later time if changes arrive out of order * Update CandidateSet::next documentation * Update CandidateSet state diagram * Fix variant names in comments * Explain why timestamps can be left out of MetaAddrChanges * Add a simple test for the individual peer retry limit * Only generate valid Arbitrary PeerServices values * Add an individual peer retry limit AddressBook and CandidateSet test * Stop deleting recently live addresses from the address book If we delete recently live addresses from the address book, we can get a new entry for them, and reconnect too rapidly. * Rename functions to match similar tokio API * Fix docs for service sorting * Clarify a comment * Cleanup a variable and comments * Remove blank lines in the CandidateSet state diagram * Add a multi-peer proptest that checks outbound attempt fairness * Fix a comment typo Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com> * Simplify time maths in MetaAddr * Create a Duration32 type to simplify calculations and comparisons * Rename variables for clarity * Split a string constant into multiple lines * Make constants match rustdoc order Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com>	2021-06-18 09:30:44 -03:00
teor	3932661a93	Qualify std::sync::Mutex in the unit tests (#2304 ) Also add a missing zebra_test::init().	2021-06-15 10:01:56 -03:00
teor	3f7410d073	Security: stop gossiping failure and attempt times as last_seen times (#2273 ) * Security: stop gossiping failure and attempt times as last_seen times Previously, Zebra had a single time field for peer addresses, which was updated every time a peer was attempted, sent a message, or failed. This is a security issue, because the `last_seen` time should be "the last time [a peer] connected to that node", so that "nodes can use the time field to avoid relaying old 'addr' messages". So Zebra was sending incorrect peer information to other nodes. As part of this change, we split the `last_seen` time into the following fields: - untrusted_last_seen: gossiped from other peers - last_response: time we got a response from a directly connected peer - last_attempt: time we attempted to connect to a peer - last_failure: time a connection with a peer failed * Implement Arbitrary and strategies for MetaAddrChange Also replace the MetaAddr Arbitrary impl with a derive. * Write proptests for MetaAddr and MetaAddrChange MetaAddr: - the only times that get included in serialized MetaAddrs are the untrusted last seen and responded times MetaAddrChange: - the untrusted last seen time is never updated - the services are only updated if there has been a handshake	2021-06-15 13:31:16 +10:00
teor	86f23f7960	Security: only apply the outbound connection rate-limit to actual connections (#2278 ) * Only advance the outbound connection timer when it returns an address Previously, we were advancing the timer even when we returned `None`. This created large wait times when there were no eligible peers. * Refactor to avoid overlapping sleep timers * Add a maximum next peer delay test Also refactor peer numbers into constants. * Make the number of proptests overridable by the standard env var Also cleanup the test constants. * Test that skipping peer connections also skips their rate limits * Allow an extra second after each sleep on loaded machines macOS VMs seem to need this extra time to pass their tests. * Restart test time bounds from the current time This change avoids test failures due to cumulative errors. Also use a single call to `Instant::now` for each test round. And print the times when the tests fail. * Stop generating invalid outbound peers in proptests The candidate set proptests will fail if enough generated peers are invalid for outbound connections.	2021-06-15 08:29:17 +10:00
Janito Vaqueiro Ferreira Filho	a2d3078fcb	Replace usage of atomics with `tokio::sync::watch` (#2272 ) Rust atomics have an API that's very easy to use incorrectly, leading to hard to find bugs. For that reason, it's best to avoid it unless there's a good reason not to.	2021-06-11 12:25:06 +10:00
Janito Vaqueiro Ferreira Filho	e8d5f6978d	Rate limit `GetAddr` messages to any peer, Credit: Equilibrium (#2254 ) * Rename field to `wait_next_handshake` Make the name a bit more clear regarding to the field's purpose. * Move `MIN_PEER_CONNECTION_INTERVAL` to `constants` Move it to the `constants` module so that it is placed closer to other constants for consistency and to make it easier to see any relationships when changing them. * Rate limit calls to `CandidateSet::update()` This effectively rate limits requests asking for more peer addresses sent to the same peer. A new `min_next_crawl` field was added to `CandidateSet`, and `update` only sends requests for more peer addresses if the call happens after the instant specified by that field. After sending the requests, the field value is updated so that there is a `MIN_PEER_GET_ADDR_INTERVAL` wait time until the next `update` call sends requests again. * Include `update_initial` in rate limiting Move the rate limiting code from `update` to `update_timeout`, so that both `update` and `update_initial` get rate limited. * Test `CandidateSet::update` rate limiting Create a `CandidateSet` that uses a mocked `PeerService`. The mocked service always returns an empty list of peers, but it also checks that the requests only happen after expected instants, determined by the fanout amount and the rate limiting interval. * Refactor to create a `mock_peer_service` helper Move the code from the test to a utility function so that another test will be able to use it as well. * Check number of times service was called Use an `AtomicUsize` shared between the service and the test body that the service increments on every call. The test can then verify if the service was called the number of times it expected. * Test calling `update` after `update_initial` The call to `update` should be skipped because the call to `update_initial` should also be considered in the rate limiting. * Mention that call to `update` may be skipped Make it clearer that in this case the rate limiting causes calls to be skipped, and not that there's an internal sleep that happens. Also remove "to the same peers", because it's more general than that. Co-authored-by: teor <teor@riseup.net>	2021-06-09 09:42:45 +10:00
Janito Vaqueiro Ferreira Filho	aaef94c2bf	Prevent burst of reconnection attempts (#2251 ) * Rate-limit new outbound peer connections Set the rate-limiting sleep timer to use a delay added to the maximum between the next peer connection instant and now. This ensures that the timer always sleeps at least the time used for the delay. This change fixes rate-limiting new outbound peer connections, since before there could be a burst of attempts until the deadline progressed to the current instant. Fixes #2216 * Create `MetaAddr::alternate_node_strategy` helper Creates arbitrary `MetaAddr`s as if they were network nodes that sent their listening address. * Test outbound peer connection rate limiting Tests if connections are rate limited to 10 per second, and also tests that sleeping before continuing with the attempts still respets the rate limit and does not result in a burst of reconnection attempts.	2021-06-07 14:13:46 +10:00
teor	ce45198c17	Fix comment typo: overflow -> underflow	2021-06-01 16:44:45 +10:00
teor	34702f22b6	clippy: remove needless clone and collect	2021-06-01 03:42:08 -03:00
Janito Vaqueiro Ferreira Filho	83ac1519e9	Add proptest for future `last_seen` correction Given a generated list of gossiped peers, ensure that after running the `validate_addrs` function none of the resulting peers have a `last_seen` time that's after the specified limit.	2021-06-01 03:42:08 -03:00
Janito Vaqueiro Ferreira Filho	63672a2633	Test underflow handling If the calculation to apply the compensation offset overflows or underflows, the reported times are too distant apart, and could be sent on purpose by a malicious peer, so all addresses from that peer should be rejected.	2021-06-01 03:42:08 -03:00
Janito Vaqueiro Ferreira Filho	f263d85aa4	Test `last_seen` time being equal to the limit If the most recent `last_seen` time reported by a peer is exactly the limit, the offset doesn't need to be applied because no times are in the future.	2021-06-01 03:42:08 -03:00
Janito Vaqueiro Ferreira Filho	f4a7026aa3	Test that offset is applied to all gossiped peers Use some mock gossiped peers where some have `last_seen` times in the past and some have times in the future. Check that all the peers have an offset applied to them by the `validate_addrs` function. This tests if the offset is applied to all peers that a malicious peer gossiped to us.	2021-06-01 03:42:08 -03:00
Janito Vaqueiro Ferreira Filho	60f660e53f	Test if validation doesn't offset past times Use some mock gossiped peers that all have `last_seen` times in the past and check that they don't have any changes to the `last_seen` times applied by the `validate_addrs` function.	2021-06-01 03:42:08 -03:00
Janito Vaqueiro Ferreira Filho	3c9c920bbd	Test if validation offsets times in the future Use some mock gossiped peers that all have `last_seen` times in the future and check that they all have a specific offset applied to them.	2021-06-01 03:42:08 -03:00
Janito Vaqueiro Ferreira Filho	82452621e0	Remove empty list of peers check The `limit_last_seen_times` can now safely handle an empty list.	2021-06-01 03:42:08 -03:00
Janito Vaqueiro Ferreira Filho	966430d400	Update security note to be broader Focus on what can go wrong, and not on the specific causes. Co-authored-by: teor <teor@riseup.net>	2021-06-01 03:42:08 -03:00
Janito Vaqueiro Ferreira Filho	f3419b7baf	Handle overflow when applying offset If an overflow occurs, the reported `last_seen` times are either very wrong or malicious, so reject all addresses gossiped by that peer.	2021-06-01 03:42:08 -03:00
Janito Vaqueiro Ferreira Filho	5b8f33390c	Add comment to describe purpose Make it clear why all peers have the time offset applied to them. Co-authored-by: teor <teor@riseup.net>	2021-06-01 03:42:08 -03:00
Janito Vaqueiro Ferreira Filho	9eac43a8bb	Apply offset to all times received from a peer If any of the times gossiped by a peer are in the future, apply the necessary offset to all the times gossiped by that peer. This ensures that all gossiped peers from a malicious peer are moved further back in the queue. Co-authored-by: teor <teor@riseup.net>	2021-06-01 03:42:08 -03:00
Janito Vaqueiro Ferreira Filho	fa35c9b4f1	Only apply offset to times in the future Times in the past don't have any security implications, so there's no point in trying to apply the offset to them as well.	2021-06-01 03:42:08 -03:00
Janito Vaqueiro Ferreira Filho	876d515dd6	Improve documentation - Make the security impact clearer and in a separate section. - Instead of listing an assumption as almost a side-note, describe it clearly inside a `Panics` section. Co-authored-by: teor <teor@riseup.net>	2021-06-01 03:42:08 -03:00
Janito Vaqueiro Ferreira Filho	54809a1b89	Don't trust reported peer `last_seen` times Due to clock skew, the peers could end up at the front of the reconnection queue or far at the back. The solution to this is to offset the reported times by the difference between the most recent reported sight (in the remote clock) and the current time (in the local clock).	2021-06-01 03:42:08 -03:00
Janito Vaqueiro Ferreira Filho	14ecc79f01	Use `DateTime32` in `validate_addrs`	2021-06-01 03:42:08 -03:00
Janito Vaqueiro Ferreira Filho	b891a96a6d	Improve ergonomics by returning `impl Iterator` Returning `impl IntoIterator` means that the caller will always be forced to call `.into_iter()`, and returning `impl Iterator` still allows them to call `.into_iter()` because it becomes the identity function.	2021-06-01 03:42:08 -03:00
teor	2685fc746e	Remove CandidateSet state and add last seen time limit to candidate_set::validate_addrs (#2177 )	2021-05-21 02:21:13 +00:00
teor	752358d236	Fix some candidate set and meta addr doc links (#2174 ) Suggested by jvff.	2021-05-21 11:40:14 +10:00
teor	c7ea1395e7	Security: Fix CandidateSet timeout and fanout * Refactor: Split CandidateSet::update into separate functions * Security: Apply a timeout to the entire CandidateSet::update * Security: Stop using very large fanout limits during initialization Previously, Zebra used the number of resolved peer addresses. So it was possible for all peers to fail, and for Zebra to hang on the first update. And Zebra could send a fanout for each initial peer, regardless of whether their connection was successful. Also: - wait for at least one successful peer before trying an update - warn if there are no successful initial peers	2021-05-21 06:51:34 +10:00
teor	92828bbb29	Reliability: send local listener address to peers When peers ask for peer addresses, add our local listener address to the set of addresses, sanitize, then truncate. Sanitize shuffles addresses, so if there are lots of addresses in the address book, our address will only be sent to some peers.	2021-05-18 14:02:19 +10:00
teor	458c26f1e3	Limit initial candidate set fanout to the number of initial peers If there is a small number of initial peers, and they are slow, the initial candidate set update can appear to hang. To avoid this issue, limit the initial candidate set fanout to the number of initial peers. Once the initial peers have sent us more peer addresses, there is no need to limit the fanouts for future updates. Reported by Niklas Long of Equilibrium.	2021-05-18 07:54:03 +10:00
teor	b0b8b2f61a	Add extra instrumentation for initialize and handshakes (#2122 ) * Instrument the crawl task When we created the crawl task, we forgot to instrument it with the global span. This fix makes sure that the git and network span appears on crawl logs. * Instrument the connector * Improve handshake instrumentation Make some spans debug, so there are not too many spans. * Add the address to initial peer connection errors	2021-05-17 16:49:16 -04:00
teor	a8a0d6450c	Security: stop gossiping temporary inbound remote addresses to peers - stop putting inbound addresses in the address book - drop address book entries that can't be used for outbound connections - distinguish between temporary inbound and permanent outbound peer addresses - also create variants to handle proxy connections (but don't use them yet) - avoid tracking connection state for isolated connections - document security constraints for the address book and peer set	2021-05-14 23:45:42 +10:00
Kirill Fomichev	afac2c2846	Use the default port for configured listen addresses with no port (#2043 ) * Allow use listen address in config without port * update comments * remove not used alias * use Network::default_port * Move tests and use toml instead json * change error message * Make match more readable Co-authored-by: teor <teor@riseup.net>	2021-04-21 23:14:29 +00:00
teor	0203d1475a	Refactor and document correctness for std::sync::Mutex<AddressBook>	2021-04-21 17:14:47 -04:00
teor	2ed8bb00cf	Clarify CandidateSet state diagram We get inbound connections on the listener port, but the important part is the inbound connection itself.	2021-04-21 01:37:43 -04:00
teor	a417c7c8c7	Use meaningful names for select! variables	2021-04-13 23:56:16 -04:00
teor	fb95de99a6	Refactor the dial result into a From impl	2021-04-13 18:52:49 -04:00
teor	375c8d8700	Fix a deadlock between the crawler and dialer, and other hangs (#1950 ) * Stop ignoring inbound message errors and handshake timeouts To avoid hangs, Zebra needs to maintain the following invariants in the handshake and heartbeat code: - each handshake should run in a separate spawned task (not yet implemented) - every message, error, timeout, and shutdown must update the peer address state - every await that depends on the network must have a timeout Once the Connection is created, it should handle timeouts. But we need to handle timeouts during handshake setup. * Avoid hangs by adding a timeout to the candidate set update Also increase the fanout from 1 to 2, to increase address diversity. But only return permanent errors from `CandidateSet::update`, because the crawler task exits if `update` returns an error. Also log Peers response errors in the CandidateSet. * Use the select macro in the crawler to reduce hangs The `select` function is biased towards its first argument, risking starvation. As a side-benefit, this change also makes the code a lot easier to read and maintain. * Split CrawlerAction::Demand into separate actions This refactor makes the code a bit easier to read, at the cost of sometimes blocking the crawler on `candidates.next()`. That's ok, because `next` only has a short (< 100 ms) delay. And we're just about to spawn a separate task for each handshake. * Spawn a separate task for each handshake This change avoids deadlocks by letting each handshake make progress independently. * Move the dial task into a separate function This refactor improves readability. * Fix buggy future::select function usage And document the correctness of the new code.	2021-04-07 10:25:10 -03:00
teor	de6d1c93f3	Clarify a comment	2021-04-07 18:56:38 +10:00
teor	83b88f5b7a	Merge pull request #1972 from ZcashFoundation/peer-set-demand-deadlock-doc Document peer set deadlock resistance	2021-04-01 22:50:17 -04:00
teor	306fa88214	Document the correctness of Poll::Pending wakeups	2021-03-27 08:55:49 -04:00
teor	1a159dfcb6	Add more methods for creating MetaAddrs This refactor lets us remove `MetaAddr::update_last_seen()`.	2021-03-26 07:23:49 +10:00
teor	6fe81d8992	Make MetaAddr.last_seen into a private field	2021-03-26 07:23:49 +10:00
teor	5a30268d7a	Log address metrics when the peer set has no ready peers	2021-03-17 10:47:04 +10:00
Jack Grigg	e51f33a4b9	Use interoperable names for common metrics These names match the equivalent metrics in zcashd, enabling common metrics to be collected across both node types.	2021-03-17 09:38:07 +10:00
teor	e50692bd51	CandidateSet: Add Listener Port Connections Inbound connections on the Zcash protocol listener port perform a handshake. If the handshake is successful, it adds the peer to the AddressBook.	2021-03-09 23:05:18 -05:00
Jane Lusby	03aa6f671f	Implement outbound connection rate limiting - includes config rename with alias (#1855 ) * Implement outbound connection rate limiting * fix breaking change on config Co-authored-by: teor <teor@riseup.net>	2021-03-10 01:36:05 +00:00
teor	d4f2f27218	Add global span to spawned network tasks (#1761 ) Closes #1575	2021-02-20 08:36:50 +10:00
teor	e61b5e50a2	Diagnostics for CI port conflict failures (#1766 ) Log a "Trying..." message before each listener opens, to see if the delay is inside Zebra, or in the test harness or OS. Also report the configured and actual ports where possible, for better diagnostics.	2021-02-18 12:15:09 -03:00
teor	5424e1d8ba	Fix candidate set address state handling (#1709 ) Design: - Add a `PeerAddrState` to each `MetaAddr` - Use a single peer set for all peers, regardless of state - Implement time-based liveness as an `AddressBook` method, rather than a `PeerAddrState` variant - Delete `AddressBook.by_state` Implementation: - Simplify `AddressBook` changes using `update` and `take` modifier methods - Simplify the `AddressBook` iterator implementation, replacing it with methods that are more obviously correct - Consistently collect peer set metrics Documentation: - Expand and update the peer set documentation We can optimise later, but for now we want simple code that is more obviously correct.	2021-02-18 11:18:32 +10:00
teor	86169f6412	Update PeerSet metrics after every change (#1727 )	2021-02-18 07:06:59 +10:00
teor	8d1c498234	Log initial peer connection failures And standardise another log message	2021-02-17 09:21:53 -05:00
teor	e85441c914	Add a correctness comment to justify the revert	2021-02-16 05:52:54 +10:00
teor	a02a00a3f5	Revert "Stop using CallAllUnordered in peer_set::add_initial_peers (#1705 )" This reverts commit `241c7ad849`.	2021-02-16 05:52:54 +10:00
Alfredo Garcia	241c7ad849	Stop using CallAllUnordered in peer_set::add_initial_peers (#1705 ) * use ServiceExt::oneshot and FuturesUnordered Co-authored-by: teor <teor@riseup.net>	2021-02-09 08:16:02 +10:00
Alfredo Garcia	221512c733	Async DNS seeder lookups (#1662 ) * replace to_socket_addrs * refactor `resolve()` into `resolve_host()` * use `resolve_host()` to resolve config peers * add DNS_LOOKUP_TIMEOUT constant * don't block the main thread in initialize	2021-02-03 12:20:26 +10:00
Alfredo Garcia	4b34482264	Add hints to port conflict and lock file panics (#1535 ) * add hint for port error * add issue filter for port panic * add lock file hint * add metrics endpoint port conflict hint * add hint for tracing endpoint port conflict * add acceptance test for resource conflics * Split out common conflict test code into a function * Add state, metrics, and tracing conflict tests * Add a full set of stderr acceptance test functions This change makes the stdout and stderr acceptance test interfaces identical. * move Zcash listener opening * add todo about hint for disk full * add constant for lock file * match path in state cache * don't match windows cache path * Use Display for state path logs Avoids weird escaping on Windows when using Debug * Add Windows conflict error messages * Turn PORT_IN_USE_ERROR into a regex And add another alternative Windows-specific port error Co-authored-by: teor <teor@riseup.net> Co-authored-by: Jane Lusby <jane@zfnd.org>	2021-01-29 22:36:33 +10:00
Jane Lusby	15698245e1	Deduplicate metrics dependencies (#1561 ) ## Motivation This PR is motivated by the regression identified in https://github.com/ZcashFoundation/zebra/issues/1349. That PR notes that the metrics stopped working for most of the crates other than `zebrad`. ## Solution This PR resolves the regression by deduplicating the `metrics` crate dependency. During a recent change we upgraded the metrics version in `zebrad` and a couple other of our crates, but we never updated the dependencies in `zebra-state`, `zebra-consensus`, or `zebra-network`. This caused the metrics macros to attempt to retrieve the current metrics exporter through the wrong function. We would install the metrics exporter in `0.13`, but then attempt to look it up through the `0.12` crate, which contains a different instance of the metrics exporter static variable which is unset. Doing this causes the metrics macros to return `None` for the current exporter after which they just silently give up. ## Related Issues closes https://github.com/ZcashFoundation/zebra/issues/1349 ## Follow Up Work I noticed we have quite a few duplicate dependencies in our tree. We might be able to save some compilation time by auditing those and deduplicating them as much as possible. - https://github.com/ZcashFoundation/zebra/issues/1582 Co-authored-by: teor <teor@riseup.net>	2021-01-12 12:28:56 +10:00
teor	d482900e7f	Remove a redundant pattern match Identified by clippy's redundant_pattern_match lint.	2020-12-13 22:10:05 -05:00
teor	8e2f08221f	Add peer set tracing and unreachable panics (#1468 ) Add some extra tracing and panics to double-check our assumptions about the peer set state machine.	2020-12-14 11:00:39 +10:00
teor	34518525a5	Improve peer set logging hints Delete hints about configuring peers. Delete hint for typical "no ready peers" behaviour.	2020-12-01 21:37:15 -08:00
Henry de Valence	00c4f4f0e6	network: record cause of handshake failure	2020-12-01 19:16:41 -08:00
teor	4d5ea4897c	Log peer set ready and unready peers * warn: if there are no peers at all * info: if there are no ready peers * trace: the number of ready and unready peers for every request Log at most one warn or info log per minute, to avoid flooding the terminal with log lines. Suppress warn and info logs for the first minute, while the peer set is starting up.	2020-12-01 11:00:21 -05:00
teor	8d6ac8eece	Placate clippy	2020-11-24 20:03:21 +10:00
Henry de Valence	d90e709ce1	network: tidy peer set implementation - rename functions more descriptively - create a common `take_ready_service` function - organize poll_ functions separately	2020-11-24 20:03:21 +10:00
Henry de Valence	f36a4800b2	network: fix invariant violation in peer set Closes #1183. The peer set maintains a preselected ready service that it can use to perform power-of-two-choices (p2c) routing of requests. Ready services are stored by key (socket address) in an `IndexMap`, and the preselected service is represented by an `Option<usize>` indexing that map. This means that whenever the set of ready services changes (e.g., a service is removed from the peer set, or a service is taken to be used to process a request), the preselected index is invalidated. The original P2C-only implementation maintained this invariant but did not document it. The change to inventory-based routing introduced a bug by failing to maintain this invariant and appropriately invalidate the preselected index. However, this was only noticeable approximately 1/N of the time on the next request after an inventory-directed request, so the bug occurred infrequently. Luckily, the use of `.expect` caused the bug to be an immediate panic, making it possible to identify by inspecting all uses of the ready service map.	2020-11-24 20:03:21 +10:00
Henry de Valence	add94c1c45	deps: move to tokio 0.3, tower 0.4 This change is mostly mechanical, with the exception of the changes to the `tower-batch` middleware. This middleware was adapted from `tower::buffer`, and the `tower::buffer` code was changed to implement its own bounded queue, because Tokio 0.3 removed the `mpsc::Sender::poll_send` method. See `ddc64e8d4d` for more context on the Tower changes. To match Tower as closely as possible in order to be able to upstream `tower-batch`, those changes are copied from `tower::Buffer` to `tower-batch`.	2020-11-20 10:08:16 -08:00
Henry de Valence	6dd7318d3b	deps: use Tower 0.4 from git instead of 0.3.1. This addresses at least three pain points: - we were affected by bugs that were already fixed in git, but not in the released crate; - we can use service combinators to transform requests and responses; - we can use the hedge middleware. The version in git is still marked as 0.3.1 but these changes will be part of tower 0.4: https://github.com/tower-rs/tower/issues/431	2020-09-21 14:16:56 -07:00
Deirdre Connolly	33afeb37cb	Add a comment about the short looo	2020-09-21 09:26:39 -07:00
Henry de Valence	6f3288814c	network: avoid GetPeers timeout to accelerate init The GetPeers requests sent while crawling the network are randomly load-balanced over available peers. But at the very beginning, they may be both routed to the same peer, causing network initialization to be delayed while the second one times out (since zcashd only ever responds to the first addr message). Only sending one GetPeers request per candidate set update means we crawl the network a little more slowly, but avoids hanging on start.	2020-09-21 09:26:39 -07:00
Henry de Valence	170f588ffb	network: document load-shedding behavior This was part of the original design and is described in the Connection internals, but we never documented it externally.	2020-09-18 18:34:25 -07:00
Henry de Valence	1d3892e1dc	network: rename alias to BoxError This is shorter and consistent with Tower (which is why we use it in the first place).	2020-09-18 18:34:25 -07:00
Henry de Valence	3f150eb16e	network: implement transaction request handling. (#1016 ) This commit makes several related changes to the network code: - adds a `TransactionsByHash(HashSet<transaction::Hash>)` request and `Transactions(Vec<Arc<Transaction>>)` response pair that allows fetching transactions from a remote peer; - adds a `PushTransaction(Arc<Transaction>)` request that pushes an unsolicited transaction to a remote peer; - adds an `AdvertiseTransactions(HashSet<transaction::Hash>)` request that advertises transactions by hash to a remote peer; - adds an `AdvertiseBlock(block::Hash)` request that advertises a block by hash to a remote peer; Then, it modifies the connection state machine so that outbound requests to remote peers are handled properly: - `TransactionsByHash` generates a `getdata` message and collects the results, like the existing `BlocksByHash` request. - `PushTransaction` generates a `tx` message, and returns `Nil` immediately. - `AdvertiseTransactions` and `AdvertiseBlock` generate an `inv` message, and return `Nil` immediately. Next, it modifies the connection state machine so that messages from remote peers generate requests to the inbound service: - `getdata` messages generate `BlocksByHash` or `TransactionsByHash` requests, depending on the content of the message; - `tx` messages generate `PushTransaction` requests; - `inv` messages generate `AdvertiseBlock` or `AdvertiseTransactions` requests. Finally, it refactors the request routing logic for the peer set to handle advertisement messages, providing three routing methods: - `route_p2c`, which uses p2c as normal (default); - `route_inv`, which uses the inventory registry and falls back to p2c (used for `BlocksByHash` or `TransactionsByHash`); - `route_all`, which broadcasts a request to all ready peers (used for `AdvertiseBlock` and `AdvertiseTransactions`).	2020-09-08 10:16:29 -07:00
Henry de Valence	cad38415b2	network: fix bug in inventory advertisement handling (#1022 ) * network: fix bug in inventory advertisement handling The RFC https://zebra.zfnd.org/dev/rfcs/0003-inventory-tracking.html described the use of a `broadcast` channel in place of an `mpsc` channel to get ring-buffer behavior, keeping a bound on the size of the channel but dropping old entries when the channel is full. However, it didn't explicitly describe how this works (the `broadcast` channel returns a `RecvError::Lagged(u64)` to inform receivers that they lost messages), so the lag-handling wasn't implemented and I didn't notice in review. Instead, the ? operator bubbled the lag error all the way up from `InventoryRegistry::poll_inventory` through `<PeerSet as Service>::poll_ready` through various Tower wrappers to users of the peer set. The error propagation is bad enough, because it caused client errors that shouldn't have happened, but there's a worse interaction. The `Service` contract distinguishes between request errors (from `Service::call`, scoped to the request) and service errors (from `Service::poll_ready`, scoped to the service). The `Service` contract specifies that once a service returns an error from `poll_ready`, the service can be assumed to be failed permanently. I believe (but haven't tested or carefully worked through the details) that this caused various tower middleware to report the entire peer set service as permanently failed due to a transient inventory "error" (more of an indicator), and I suspect that this is the cause of #1003, where all of the sync component's requests end up failing because the peer set reported that it failed permanently. I am able to reproduce #1003 locally before this change and unable to reproduce it locally after this change, though I have not tested exhaustively. * network: add metric for dropped inventory advertisements Co-authored-by: teor <teor@riseup.net> Co-authored-by: teor <teor@riseup.net>	2020-09-07 21:24:31 -07:00
Jane Lusby	88557ddd0a	address more comments	2020-09-01 21:01:38 -04:00
Jane Lusby	96c8809348	Implement Inventory Tracking RFC (#963 ) * Add .cargo to the gitignore file * Implement Inventory Tracking RFC * checkpoint * wire together the inventory registry * add comment documenting condition * make inventory registry optional	2020-09-01 14:28:54 -07:00
Henry de Valence	fddba7a336	network: remove handshake::Builder::with_addr Use the listen_addr field already specified in the config. Also, derive Clone for Handshake<S>. Co-authored-by: Jane Lusby <jane@zfnd.org>	2020-09-01 13:56:00 -07:00
Henry de Valence	1b5a824584	network: fix bug in BIP37 relay flag handling. The relay flag in the version message is used in conjunction with BIP37 to receive bloom-filtered transactions. When it is set to false, transactions are not relayed until a bloom filter is set. Since we don't implement BIP37 (it's not useful for shielded transactions), this means we'll never receive transactions.	2020-09-01 13:56:00 -07:00
Henry de Valence	60a0b8c382	network: change Handshake::new to a Builder. This allows more detailed control over the handshake parameters.	2020-09-01 13:56:00 -07:00
Henry de Valence	948b067808	chain: move Network, NetworkUpgrade to parameters Also, avoid using star-imports of the enum variants, which pollutes the namespace.	2020-08-17 11:46:34 -07:00
teor	109666cc48	fix: Tweak the the network listener log (#886 )	2020-08-12 14:22:54 -07:00
Henry de Valence	299afe13df	zebra-network tweaks. (#877 ) * network: move gossiped peer selection logic into address book. * network: return BoxService from init. * zebrad: add note on why we truncate thegossiped peer list Co-authored-by: Jane Lusby <jlusby42@gmail.com> * Remove unused .rustfmt.toml Many of these options are never actually loaded by our CI because of a channel mismatch, where they're not applied on stable but only on nightly (see the logs from a rustfmt job). This means that we can get different settings when running `cargo fmt` on the nightly and stable channels, which was causing a CI failure on this PR. Reverting back to the default rustfmt settings avoids this problem and keeps us in line with upstream rustfmt. There's no loss to us since we were using the defaults anyways. Co-authored-by: Jane Lusby <jlusby42@gmail.com>	2020-08-11 13:07:44 -07:00
Alfredo Garcia	9c387521bd	Print endpoint addresses at startup (#867 ) * print tracing and metrics endpoints in startup * print network address in startup	2020-08-10 12:47:26 -07:00
Henry de Valence	3d46ab746a	Clean up options in network config section. (#839 ) Closes #536. This removes: - the user-agent (we can add a mechanism to specify extra BIP14 components later, if any users ask us for that feature); - the EWMA parameters (these were put in the config just to avoid making a choice); - the peer connection timeout (we can change the default value if anyone ever has a problem with it); - the peer set request buffer size (setting this too low can make the application deadlock); The new peer interval is left in.	2020-08-06 11:29:00 -07:00
teor	6be0f8ed2f	fix: Warn if the listener port is for the wrong network We'll fix the underlying defaults in #660, with the rest of the listeners.	2020-07-29 16:03:52 +10:00
Henry de Valence	217c25ef07	network: propagate tracing Spans through peer connection	2020-07-09 11:15:06 -07:00
Dimitris Apostolou	ba81d7d4c0	Fix typos	2020-07-07 11:13:49 -07:00
Jane Lusby	685bdaf2df	don't require absense of cancel handles Prior to this change, we required that services that are canceled do not have a cancel handle in the `cancel_handles` list, based on the assumption that the handle must have been removed in the process of canceling this service. This doesn't holding up though, because it is currently possible for us to have the same peer connect to us multiple times, the second connect removes the cancel handle of the original connect and inserts it's own cancel handle in its place. In this scenario, when the first service is polled for readiness it will see that it has been canceled and go to clean itself up, but when it asserts that it doesn't have a cancel handle it will see the cancel handle of the second connect event, which uses the same key as the first connect, and fail its debug assertion. This change removes that debug assert on the assumption that it is okay for a peer to connect multiple times consecutively, and that the correct behavior in that case is to just cancel the first connection and continue as normal.	2020-06-16 13:42:31 -07:00
Jane Lusby	431f194c0f	propagate errors out of zebra_network::init (#435 ) Prior to this change, the service returned by `zebra_network::init` would spawn background tasks that could silently fail, causing unexpected errors in the zebra_network service. This change modifies the `PeerSet` that backs `zebra_network::init` to store all of the `JoinHandle`s for each background task it depends on. The `PeerSet` then checks this set of futures to see if any of them have exited with an error or a panic, and if they have it returns the error as part of `poll_ready`.	2020-06-09 12:24:28 -07:00
Jane Lusby	8c178c3ee4	fix panic in seed subcommand (#401 ) Co-authored-by: Jane Lusby <jane@zfnd.org> Prior to this change, the seed subcommand would consistently encounter a panic in one of the background tasks, but would continue running after the panic. This is indicative of two bugs. First, zebrad was not configured to treat panics as non recoverable and instead defaulted to the tokio defaults, which are to catch panics in tasks and return them via the join handle if available, or to print them if the join handle has been discarded. This is likely a poor fit for zebrad as an application, we do not need to maximize uptime or minimize the extent of an outage should one of our tasks / services start encountering panics. Ignoring a panic increases our risk of observing invalid state, causing all sorts of wild and bad bugs. To deal with this we've switched the default panic behavior from `unwind` to `abort`. This makes panics fail immediately and take down the entire application, regardless of where they occur, which is consistent with our treatment of misbehaving connections. The second bug is the panic itself. This was triggered by a duplicate entry in the initial_peers set. To fix this we've switched the storage for the peers from a `Vec` to a `HashSet`, which has similar properties but guarantees uniqueness of its keys.	2020-05-27 17:40:12 -07:00
Jane Lusby	b6b35364f3	cleanup warnings throughout codebase	2020-05-27 15:42:29 -04:00
Henry de Valence	3ed75cb626	Tweak peer set metrics. - Add a total peers metric to prevent races between measurements of ready/unready peers (which can cause the sum to be wrong). - Add an outbound request counter.	2020-02-21 06:48:25 -05:00
Henry de Valence	43b2d35dda	Crawl for more peers when we exhaust candidates.	2020-02-21 06:48:25 -05:00
Henry de Valence	afa2c2347f	fmt	2020-02-21 06:48:25 -05:00
Henry de Valence	00edcae0c2	Add metrics for the crawler and candidate set.	2020-02-14 20:14:05 -05:00
Henry de Valence	75d3d44fb3	Metrics MVP: add two metrics and export them to Prometheus. Co-authored-by: Deirdre Connolly <deirdre@zfnd.org>	2020-02-14 20:14:05 -05:00
Henry de Valence	8000f888fd	Connect to multiple peers concurrently. The previous outbound peer connection logic got requests to connect to new peers and processed them one at a time, making single connection attempts and retrying if the connection attempt failed. This was quite slow, because many connections fail, and we have to wait for timeouts. Instead, this logic connects to new peers concurrently (up to 50 at a time).	2020-02-14 18:23:41 -05:00
Henry de Valence	2c0f48b587	Refactor connection logic and try a block request. Attempting to implement requests for block data revealed a problem with the previous connection logic. Block data is requested by sending a `getdata` message with hashes of the requested blocks; the peer responds with a sequence of `block` messages with the blocks themselves. However, this wasn't possible to handle with the previous connection logic, which could only convert a single Bitcoin message into a Response. Instead, we factor out the message handling logic into a Handler, which can statefully accumulate arbitrary data into a Response and signal completion. This is still pretty ugly but it does work. As a side effect, the HeartbeatNonceMismatch error is removed; because the Handler now tries to process messages until it comes to a Response, it just ignores mismatched nonces (and will eventually time out). The previous Mempool and Transaction requests were removed but could be re-added in a different form later. Also, the `Get` prefixes are removed from `Request` to tidy the name.	2020-02-10 09:03:56 -08:00
Henry de Valence	8d58dd804f	Note that tracing causes clippy false positives Thanks @hawkw for pointing this out.	2020-02-05 12:42:32 -08:00
Henry de Valence	f04f4f0b98	Apply clippy fixes	2020-02-05 12:42:32 -08:00
Henry de Valence	2965187b91	Upgrade tokio, futures, hyper to released versions.	2019-12-13 17:42:15 -05:00
Henry de Valence	36cd6d6e06	cargo fmt	2019-11-27 23:53:36 -05:00
Henry de Valence	ad6525574b	Rename PeerConnector -> peer::Connector	2019-11-27 23:53:36 -05:00
Henry de Valence	778e49b127	Rename PeerHandshake -> peer::Handshake	2019-11-27 23:53:36 -05:00
Henry de Valence	77191e62f6	Remove outdated fixup note.	2019-11-27 23:53:36 -05:00
Henry de Valence	da78603d3a	Rename `PeerClient` to `peer::Client`.	2019-11-27 23:53:36 -05:00
Henry de Valence	4fbc8270a2	Move PeerSet initialization into a submodule.	2019-11-27 05:06:01 -05:00
Henry de Valence	47ec2e2689	Remove stub discover module.	2019-10-22 19:06:08 -07:00
Henry de Valence	c3ec235a5b	Suppress unused import warnings.	2019-10-22 19:06:08 -07:00
Henry de Valence	ed2ee9d42f	Add a PeerConnector wrapper around PeerHandshake	2019-10-22 19:06:08 -07:00
Henry de Valence	e1a35490af	Move the CandidateSet to its own file. Co-authored-by: Deirdre Connolly <deirdre@zfnd.org>	2019-10-22 19:06:08 -07:00
Henry de Valence	b1832ce593	Initial work to add a crawl-and-dial task. This responds to peerset demand by connecting to additional peers. Co-authored-by: Deirdre Connolly <deirdre@zfnd.org>	2019-10-22 19:06:08 -07:00
Henry de Valence	5847b490da	Move PeerSet setup logic into a peer_set::init()	2019-10-18 16:11:01 -07:00
Henry de Valence	f6e62b0f5e	Remove failure from zebra-chain, zebra-network. Failure uses a distinct Fail trait rather than the standard library's Error trait, which causes a lot of interoperability problems with tower and other Error-using crates. Since failure was created, the standard library's Error trait was improved, and its conveniences are now available without the custom Fail trait using `thiserror` (for easy error derives) and `anyhow` (for a better boxed Error).	2019-10-16 13:16:52 -04:00
Henry de Valence	ae1a164ff8	Beginning of peerset implementation. (#62 ) * Don't expose submodules of zebra_network::peer. * PeerSet, PeerDiscover stubs. Co-authored-by: Deirdre Connolly <deirdre@zfnd.org> * Initial work on PeerSet. This is adapted from the MIT-licensed tower-balance implementation. * Use PeerSet in the connect stub.	2019-10-10 18:15:24 -07:00

1 2 3 4 5

219 Commits