Commit Graph

91 Commits

Author SHA1 Message Date
Janito Vaqueiro Ferreira Filho ec207cfa95
Ignore unexpected block responses to fix error cascade when synchronizing blocks (#3374)
* Refactor setup of `Connection` test vectors

Add a `new_test_connection` helper function to create a `Connection`
instance that's ready for testing.

* Check that no inbound requests are sent

Return the mock inbound service from `new_test_connection` and assert
that no requests were sent to it in any test.

* Replace `&mut Vec<u8>` with an `mpsc` channel

Make it easier to run the connection task in the background, i.e.,
remove any lifetime constraints from the borrowed buffer so that
`Connection` is `'static`.

It's now also easier to assert on individual messages sent from the
`Connection` instance.

* Make `MockServiceBuilder::finish` public

Allow test functions to be generic when creating a `MockService`, so
that caller functions actually determine if the type of `MockService`
assertions.

* Move `new_test_connection` to parent module

Make it more generic so that it can be used later in property tests as
well.

* Derive `Eq` and `PartialEq` for network `Response`

Allow intercepted `Response` instances to be easily compared in tests.

* Test block request cancel causes an error cascade

This is the scenario that caused the block synchronizer to reset every
few minutes, which made it considerably slower.

* Ignore unexpected block responses

It's likely that it's just a response for a previously cancelled block
request.
2022-01-20 08:14:16 +00:00
teor 4c310c0671
3. Cleanup internal network request handler, fix unused request logging (#3295)
* Simplify connection internal request handling code

* Track unused inbound request messages correctly

* Use a custom enum for inbound message handler mappings

* Fix grammar

Co-authored-by: Marek <mail@marek.onl>

Co-authored-by: Marek <mail@marek.onl>
2022-01-07 08:05:24 +10:00
teor 144c532de4
Cache unsolicited address messages, and provide them to Zebra when requested (#3294) 2022-01-05 15:55:59 -05:00
teor 469fa6b917
1. Fix some address crawler timing issues (#3293)
* Stop holding completed messages until the next inbound message

* Add more info to network message block download debug logs

* Simplify address metrics logs

* Try handling inbound messages as responses, then try as a new request

* Improve address book logging

* Fix a race between the first heartbeat and getaddr requests

* Temporarily reduce the getaddr fanout to 1

* Update metrics when exiting the Connection run loop

* Downgrade some debug logs to trace
2022-01-04 18:43:30 -05:00
teor 8db0528165
Revert "Stop ignoring some completed Responses" (#3274)
* Revert "Stop ignoring some completed Responses"

This reverts commit 0383562e1098ee2b49a4b5dd1b37646e6512782f from PR #3120,
but keeps the metrics and logging changes since that commit.

* Document why the request handling needs to happen in this order
2021-12-21 11:55:20 -03:00
teor f176bb59a2
Stop ignoring some connection errors that could make the peer set hang (#3200)
* Drop peer services if their cancel handles are dropped

* Exit the client task if the heartbeat task exits

* Allow multiple errors on a connection without panicking

* Explain why we don't need to send an error when the request is cancelled

* Document connection fields

* Make sure connections don't hang due to spurious timer or channel usage

* Actually shut down the client when the heartbeat task exits

* Add tests for unready services

* Close all senders to peer when `Client` is dropped

* Return a Client error if the error slot has an error

* Add tests for peer Client service errors

* Make Client drop and error cleanups consistent

* Use a ClientDropped error when the Client struct is dropped

* Test channel and error state in peer Client tests

* Move all Connection cleanup into a single method

* Add tests for Connection

* fix typo in comment

Co-authored-by: Conrado Gouvea <conrado@zfnd.org>

Co-authored-by: Conrado Gouvea <conrado@zfnd.org>
Co-authored-by: Alfredo Garcia <oxarbitrage@gmail.com>
2021-12-15 14:52:44 +00:00
teor 1835ec2c8d
Add diagnostics for peer set hangs (#3203)
* Use a named CancelHeartbeatTask unit struct for the channel type

* Prefer cancel handles in selects, if both are ready

* Fix message metrics to just show the command name

* Add metrics for internal requests and responses

* Add internal requests and responses to the messages dashboard

* Add a canceled metric, and peer addresses to request and response metrics

* Add a canceled messages graph

* Add connection state metrics for currently open connections

* Fix the connection state graph with new metrics

* Always send an error before dropping pending responses

* Move error detail logging into `fail_with`

* Delete an unused timer future

* Make error strings in metrics less verbose

* Downgrade some error logs to info

* Remove a redundant expect

* Avoid unnecessary allocations for connection state metrics

* Fix missed updates to mempool and block gossip metrics
2021-12-14 21:11:03 +00:00
teor c55753d5bd
Add debug-level Zebra network message tracing (#3170)
* Add debug-level Zebra network message tracing

* Delete redundant spaces

Co-authored-by: Alfredo Garcia <oxarbitrage@gmail.com>

Co-authored-by: Alfredo Garcia <oxarbitrage@gmail.com>
2021-12-09 01:09:23 +00:00
teor ab471b0db0
Revert "Stop returning NotFound errors, use the response instead" (#3124)
* Revert "Stop returning NotFound errors, use the response instead"

This reverts commit 45871f6915c0b294502bf04917c42fdcd3b1075c.

* Fix clippy warnings

* Downgrade a frequent log to debug level
2021-12-01 05:09:54 +00:00
teor a358c410f5
Stop closing connections on unexpected messages, Credit: Equilibrium (#3120)
* Ignore unsupported messages from peers

* Ignore unknown message commands from peers

* Implement Display for Request, Response, Handler, connection::State

* Stop ignoring some completed `Response`s

* Stop returning NotFound errors, use the response instead

Co-authored-by: Alfredo Garcia <oxarbitrage@gmail.com>
2021-11-30 19:26:17 +00:00
teor 7457edcb86
Stop asking users to report peer errors, fix a common peer error (#3054)
* Stop treating inv with mixed item types as a connection error

* Remove unused connection errors

* Stop asking users to create bug reports for peer errors
2021-11-15 11:32:18 -03:00
Dimitris Apostolou afb8b3d477
Fix typos (#3055)
Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com>
2021-11-12 19:30:22 +00:00
Janito Vaqueiro Ferreira Filho 0960e4fb0b
Update to Tokio 1.13.0 (#2994)
* Update `tower` to version `0.4.9`

Update to latest version to add support for Tokio version 1.

* Replace usage of `ServiceExt::ready_and`

It was deprecated in favor of `ServiceExt::ready`.

* Update Tokio dependency to version `1.13.0`

This will break the build because the code isn't ready for the update,
but future commits will fix the issues.

* Replace import of `tokio::stream::StreamExt`

Use `futures::stream::StreamExt` instead, because newer versions of
Tokio don't have the `stream` feature.

* Use `IntervalStream` in `zebra-network`

In newer versions of Tokio `Interval` doesn't implement `Stream`, so the
wrapper types from `tokio-stream` have to be used instead.

* Use `IntervalStream` in `inventory_registry`

In newer versions of Tokio the `Interval` type doesn't implement
`Stream`, so `tokio_stream::wrappers::IntervalStream` has to be used
instead.

* Use `BroadcastStream` in `inventory_registry`

In newer versions of Tokio `broadcast::Receiver` doesn't implement
`Stream`, so `tokio_stream::wrappers::BroadcastStream` instead. This
also requires changing the error type that is used.

* Handle `Semaphore::acquire` error in `tower-batch`

Newer versions of Tokio can return an error if the semaphore is closed.
This shouldn't happen in `tower-batch` because the semaphore is never
closed.

* Handle `Semaphore::acquire` error in `zebrad` test

On newer versions of Tokio `Semaphore::acquire` can return an error if
the semaphore is closed. This shouldn't happen in the test because the
semaphore is never closed.

* Update some `zebra-network` dependencies

Use versions compatible with Tokio version 1.

* Upgrade Hyper to version 0.14

Use a version that supports Tokio version 1.

* Update `metrics` dependency to version 0.17

And also update the `metrics-exporter-prometheus` to version 0.6.1.
These updates are to make sure Tokio 1 is supported.

* Use `f64` as the histogram data type

`u64` isn't supported as the histogram data type in newer versions of
`metrics`.

* Update the initialization of the metrics component

Make it compatible with the new version of `metrics`.

* Simplify build version counter

Remove all constants and use the new `metrics::incement_counter!` macro.

* Change metrics output line to match on

The snapshot string isn't included in the newer version of
`metrics-exporter-prometheus`.

* Update `sentry` to version 0.23.0

Use a version compatible with Tokio version 1.

* Remove usage of `TracingIntegration`

This seems to not be available from `sentry-tracing` anymore, so it
needs to be replaced.

* Add sentry layer to tracing initialization

This seems like the replacement for `TracingIntegration`.

* Remove unnecessary conversion

Suggested by a Clippy lint.

* Update Cargo lock file

Apply all of the updates to dependencies.

* Ban duplicate tokio dependencies

Also ban git sources for tokio dependencies.

* Stop allowing sentry-tracing git repository in `deny.toml`

* Allow remaining duplicates after the tokio upgrade

* Use C: drive for CI build output on Windows

GitHub Actions uses a Windows image with two disk drives, and the
default D: drive is smaller than the C: drive. Zebra currently uses a
lot of space to build, so it has to use the C: drive to avoid CI build
failures because of insufficient space.

Co-authored-by: teor <teor@riseup.net>
2021-11-02 18:46:57 +00:00
teor 3e03d48799
Limit the number of outbound peer connections (#2944)
* Limit the number of outbound connections in the crawler

* Make zebra-network channel bounds depend on config.peerset_initial_target_size

* Bias Zebra towards outbound connections

And turn connection limits into `Config` methods.

* Downgrade some connection logs to debug

* Remove verbose or outdated fields in tracing logs

* Clarify connection limits

Includes:
- `fastmod OUTBOUND_PEER_BIAS_FRACTION OUTBOUND_PEER_BIAS_DENOMINATOR zebra*`
- clarify connection limit documentation

* Clarify inventory channel capacity

* Add zebra_network::initialize tests with limited numbers of peers

* Avoid cooperative async task starvation in the peer crawler and listener

If we don't yield in these loops, they can run for a long time before
tokio forces them to yield.

* Test the crawler with small connection limits

And use the multi-threaded runtime to avoid long hangs.

* Stop using the multi-threaded executor in tests where it's not needed

* Avoid starvation for every connection

Adds yields after inbound successes and initial peer connections.

* Add a crawler peer connection success test

* Add outbound connection limit tests

* Improve outbound tests
2021-10-27 21:28:51 +00:00
Janito Vaqueiro Ferreira Filho 2a1d4281c5
Manually pin `Sleep` futures (#2914)
* Wrap `Sleep` timer in a `Pin<Box<_>>`

The `Sleep` type doesn't implement `Unpin` in newer versions of Tokio.

* Wrap `Sleep` type in a `Pin<Box<_>>`

In newer Tokio versions the `Sleep` type doesn't implement `Unpin`, so
it needs to be manually pinned.
2021-10-22 16:06:03 -03:00
teor 4cdd12e2c4
Track the number of active inbound and outbound peer connections (#2912)
* Count the number of active inbound and outbound peer connections

And reduce the count when each connection fails.

* Fix a comment typo

Co-authored-by: Alfredo Garcia <oxarbitrage@gmail.com>

Co-authored-by: Alfredo Garcia <oxarbitrage@gmail.com>
2021-10-21 21:36:42 +00:00
teor e5f5ac9ce8
Fix or disable recent nightly clippy lints (#2817)
Co-authored-by: Conrado Gouvea <conrado@zfnd.org>
2021-10-01 15:26:06 +00:00
teor 047576273c
Stop converting `Message::Inv(TxId+)` into `Request::TransactionsById` (#2660)
`Message::Inv(TxId+)` is a transaction advertisement,
so it should be converted into `Request::AdvertiseTransactionIds`.

This is a copy-paste mistake from the original zebra-network
implementation.
2021-08-24 21:40:21 +00:00
teor c608260256
Support witnessed transaction IDs in zebra-network requests and responses (#2638)
* Rename internal network requests for wide transaction IDs

fastmod TransactionsByHash TransactionsById zebra*
fastmod AdvertiseTransactions AdvertiseTransactionIds zebra*
fastmod MempoolTransactions MempoolTransactionIds zebra*
fastmod TransactionHashes TransactionIds zebra*

* Update network transaction request/response comments

* Rename a transaction hash method for wide transaction IDs

fastmod transaction_hashes transaction_ids zebra-network

* Add UnminedTxId methods and conversions for InventoryHash

* Map WtxIds to unmined transaction network messages

Also, use UnminedTxId and UnminedTx in:
* Zebra's internal request and response format, and
* external Zcash network protocol messages.

* Enable WtxId mempool inventory tracking for peers

* Further clarify transaction IDs

* Use Witnessed rather than Wide for transaction IDs

And rename narrow to legacy when it only applies to v1-v4 transactions.
Otherwise, rename it to mined ID.

* Rename a missed binding
* Remove an incorrectly named binding

Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com>
2021-08-18 22:55:24 +00:00
teor a6e272bf1c
Fix a typo: BIP11 -> BIP111 (#2223) 2021-05-28 14:50:43 +02:00
teor 905b90d6a1 Refactor and document correctness for std::sync::Mutex in ErrorSlot 2021-04-21 16:39:06 -04:00
teor 375c8d8700
Fix a deadlock between the crawler and dialer, and other hangs (#1950)
* Stop ignoring inbound message errors and handshake timeouts

To avoid hangs, Zebra needs to maintain the following invariants in the
handshake and heartbeat code:
- each handshake should run in a separate spawned task
  (not yet implemented)
- every message, error, timeout, and shutdown must update the peer address state
- every await that depends on the network must have a timeout

Once the Connection is created, it should handle timeouts.
But we need to handle timeouts during handshake setup.

* Avoid hangs by adding a timeout to the candidate set update

Also increase the fanout from 1 to 2, to increase address diversity.

But only return permanent errors from `CandidateSet::update`, because
the crawler task exits if `update` returns an error.

Also log Peers response errors in the CandidateSet.

* Use the select macro in the crawler to reduce hangs

The `select` function is biased towards its first argument, risking
starvation.

As a side-benefit, this change also makes the code a lot easier to read
and maintain.

* Split CrawlerAction::Demand into separate actions

This refactor makes the code a bit easier to read, at the cost of
sometimes blocking the crawler on `candidates.next()`.

That's ok, because `next` only has a short (< 100 ms) delay. And we're
just about to spawn a separate task for each handshake.

* Spawn a separate task for each handshake

This change avoids deadlocks by letting each handshake make progress
independently.

* Move the dial task into a separate function

This refactor improves readability.

* Fix buggy future::select function usage

And document the correctness of the new code.
2021-04-07 10:25:10 -03:00
teor 9c3f236075 Stop sending blocks and transactions on error 2021-02-25 08:44:57 -08:00
teor 78f162733d Revert "leverage return value for propagating errors"
This reverts commit e6cb20e13f.
2021-02-24 13:07:31 -08:00
teor 72e2e83828 Revert "introduce Transition enum"
This reverts commit 6906f87ead.
2021-02-24 13:07:31 -08:00
teor a5e89f4f2b Revert "accidental drop on mustusesender"
This reverts commit 5ec8d09e0d.
2021-02-24 13:07:31 -08:00
teor d60226a3cf Revert "rustfmt"
This reverts commit 9d9734ea81.
2021-02-24 13:07:31 -08:00
teor 359015b2be Revert "Only reject pending client requests when the peer has errored"
This reverts commit e06705ed81.
2021-02-24 13:07:31 -08:00
teor 663ed6c842 Revert "Remove remaining references to fail_with"
This reverts commit 5e4bf804aa.
2021-02-24 13:07:31 -08:00
teor 3c225550ee Revert "rename transitions from Exit to Close"
This reverts commit cfc4717b98.
2021-02-24 13:07:31 -08:00
teor 86dc66dfa9 Revert "deduplicate match arms in handle_client_request"
This reverts commit 2adee7b31a.
2021-02-24 13:07:31 -08:00
teor 292a4391e2 Revert "update comments throughout connection.rs"
This reverts commit 651d352ce1.
2021-02-24 13:07:31 -08:00
teor fc44a97925 Revert "remove unnecessary Option around request timeout"
This reverts commit c3724031df.
2021-02-24 13:07:31 -08:00
teor 3b2077fcfd Revert "Apply suggestions from code review"
This reverts commit 736092abb8.
2021-02-24 13:07:31 -08:00
Jane Lusby 736092abb8 Apply suggestions from code review
Co-authored-by: teor <teor@riseup.net>
2021-02-19 14:11:35 -08:00
Jane Lusby c3724031df remove unnecessary Option around request timeout 2021-02-19 14:11:35 -08:00
Jane Lusby 651d352ce1 update comments throughout connection.rs 2021-02-19 14:11:35 -08:00
Jane Lusby 2adee7b31a deduplicate match arms in handle_client_request 2021-02-19 14:11:35 -08:00
Jane Lusby cfc4717b98 rename transitions from Exit to Close 2021-02-19 14:11:35 -08:00
teor 5e4bf804aa Remove remaining references to fail_with 2021-02-19 14:11:35 -08:00
teor e06705ed81 Only reject pending client requests when the peer has errored
- Add an `ExitClient` transition, used when the internal client channel
  is closed or dropped, and there are no more pending requests
- Ignore pending requests after an `ExitClient` transition
- Reject pending requests when the peer has caused an error
  (the `Exit` and `ExitRequest` transitions)
- Remove `PeerError::ConnectionDropped`, because it is now handled by
  `ExitClient`. (Which is an internal error, not a peer error.)
2021-02-19 14:11:35 -08:00
teor 9d9734ea81 rustfmt 2021-02-19 14:11:35 -08:00
Jane Lusby 5ec8d09e0d accidental drop on mustusesender 2021-02-19 14:11:35 -08:00
Jane Lusby 6906f87ead introduce Transition enum 2021-02-19 14:11:35 -08:00
Jane Lusby e6cb20e13f leverage return value for propagating errors 2021-02-19 14:11:35 -08:00
teor 983e94f9e4 Add a TODO for inbound error handling cleanup 2021-02-03 08:32:10 +10:00
teor b551d81f8d Explain why we stay connected on Inbound errors
We might be syncing using this peer, so it's ok to just ignore
any internal errors in their Inbound requests, and drop the
request.
2021-01-27 12:08:49 -08:00
teor 05fff8e6f7 Revert "Stop panicking when fail_with is called twice on a connection"
But keep the extra error information.
2021-01-18 00:23:36 -05:00
teor 4fe81da953 Improve logging for connection state errors 2021-01-18 00:23:36 -05:00
teor a6c1cd3c35 Stop panicking when fail_with is called twice on a connection
We can't rule out the connection state changing between the state checks
and any eventual failures, particularly in the presence of async code.

So we turn this panic into a warning.
2021-01-18 00:23:36 -05:00