Commit Graph

473 Commits

Author SHA1 Message Date
teor 2f0f379a9e
Standardise clippy lints and require docs (#2238)
* Standardise lints across Zebra crates, and add missing docs

The only remaining module with missing docs is `zebra_test::command`

* Todo -> TODO

* Clarify what a transcript ErrorChecker does

Also change `Error` -> `BoxError`

* TransError -> ExpectedTranscriptError

* Output Descriptions -> Output descriptions
2021-06-04 08:48:40 +10:00
teor 52dcaa2544 Stop ignoring lightweight git tags in panic metadata
Unfortunately, Zebra's first alpha release is an annotated tag, but
GitHub defaults to lightweight tags. (At least for pre-releases.)
2021-05-20 09:00:56 +10:00
teor bcc59d11c3 Refactor metadata so git vars must be optional
We don't test non-git builds, but we can use the type system to make sure
they are always optional.
2021-05-20 09:00:56 +10:00
teor b6c5ef8041 Add VERGEN_CARGO_PROFILE to the panic env vars
Some panics should only happen on debug profiles.
2021-05-20 09:00:56 +10:00
teor 62f053de9e Enable cargo env vars when there is no .git
But still disable git env vars.

This change requires vergen 5.1.4 or later.
2021-05-20 09:00:56 +10:00
teor 92828bbb29 Reliability: send local listener address to peers
When peers ask for peer addresses, add our local listener address to the
set of addresses, sanitize, then truncate. Sanitize shuffles addresses,
so if there are lots of addresses in the address book, our address will
only be sent to some peers.
2021-05-18 14:02:19 +10:00
teor 74e155ff9f
Spelling: gossipped -> gossiped (#2119) 2021-05-07 13:01:11 +02:00
teor 7e2c3a2fc7 Clarify a duplicate log message 2021-04-21 23:59:29 -04:00
Kirill Fomichev 5b2f1cdfd5
Add journald support through tracing-journald (#2034)
* Add journald support through tracing-journald

* change journald to use_journald

* more fixes
2021-04-22 09:31:06 +10:00
teor 96b3c94dbc
Add the new commit count and git hash to the version (#2038)
* Use the git version + new commit count + hash for the app version

This helps diagnose bugs in versions of Zebra built from git branches,
rather than git version tags.

* Fill in assert

* Also log semver string

* Fix syntax

* Handle vergen using the cargo package version or raw git tag

* s/Semver/SemVer/

Co-authored-by: Deirdre Connolly <durumcrustulum@gmail.com>
2021-04-21 22:14:36 +00:00
teor 0203d1475a Refactor and document correctness for std::sync::Mutex<AddressBook> 2021-04-21 17:14:47 -04:00
teor 79c0c4ec57 Stop assuming there will always be a git commit
Enable builds where:
* there is no google cloud git commit env var, and
* there is no `.git` directory.

By making all `vergen` env vars optional, and skipping any env vars that
don't exist.
2021-04-20 13:48:31 -04:00
Kirill Fomichev 43e792b9a4
Update to vergen 5, add branch, commit time, and build target to the panic metadata, automatically update app version from crate version (#2029)
* build(deps): bump vergen from 3.2.0 to 5.1.1

* fix hardcoded version for Tracing struct

* add additional metadata

* remove extra allocations for metadata

* Remove zebrad code version from release checklist

The zebrad code automatically uses the crate version now.

* Sort panic metadata into rough categories

Co-authored-by: teor <teor@riseup.net>
2021-04-20 06:48:14 +10:00
teor 1e4e5924ca clippy: factor common code out of an if-else block 2021-04-14 23:16:45 -04:00
teor a417c7c8c7 Use meaningful names for select! variables 2021-04-13 23:56:16 -04:00
Alfredo Garcia 5ec05e91e1 update version strings for v1.0.0-alpha.6 2021-04-08 18:48:34 -04:00
teor 306fa88214 Document the correctness of Poll::Pending wakeups 2021-03-27 08:55:49 -04:00
teor 829a6f11c5 Document the behaviour of the `select!` macro 2021-03-27 08:55:49 -04:00
Deirdre Connolly ca1d2de87d
Bump versions for v1.0.0-alpha.5 (#1932)
Zebra's latest alpha checkpoints on Canopy activation, continues our work on NU5, and fixes a security issue.

Some notable changes include:

## Added
- Log address book metrics when PeerSet or CandidateSet don't have many peers (#1906)
- Document test coverage workflow (#1919)
- Add a final job to CI, so we can easily require all the CI jobs to pass (#1927)

## Changed
- Zebra has moved its mandatory checkpoint from Sapling to Canopy (#1898, #1926)
  - This is a breaking change for users that depend on the exact height of the mandatory checkpoint.

## Fixed
- tower-batch: wake waiting workers on close to avoid hangs (#1908)
- Assert that pre-Canopy blocks use checkpointing (#1909)
- Fix CI disk space usage by disabling incremental compilation in coverage builds (#1923)

## Security
- Stop relying on unchecked length fields when preallocating vectors (#1925)
2021-03-22 22:05:01 -04:00
Alfredo Garcia d49eaab68e
Bump versions for zebrad 1.0.0-alpha.4 (#1913)
* Bump versions for zebrad 1.0.0-alpha.4

* add Cargo.lock
2021-03-16 21:12:37 -03:00
Jack Grigg bae9a7ecd5 Expose binary data in metrics
This enables slicing and aggregating metrics based on zebrad version:
https://www.robustperception.io/exposing-the-software-version-to-prometheus
2021-03-17 09:38:07 +10:00
teor d494af1e90 Document how the syncer resists memory DoS 2021-03-11 06:24:46 -05:00
teor c6358b157c Reduce inbound concurrency to limit memory usage
Inbound malicious blocks can use a large amount of RAM when
deserialized. Limit inbound concurrency, so that the total amount
of RAM remains small.
2021-03-11 06:24:46 -05:00
teor 7558f74c78 Bump versions for zebrad 1.0.0-alpha.3 2021-02-23 10:39:13 -05:00
teor e61b5e50a2
Diagnostics for CI port conflict failures (#1766)
Log a "Trying..." message before each listener opens, to see if the
delay is inside Zebra, or in the test harness or OS.

Also report the configured and actual ports where possible, for better
diagnostics.
2021-02-18 12:15:09 -03:00
teor 972103d797 Fix tracing macro syntax 2021-02-17 11:09:22 -05:00
teor 253d1c02b3 Make sync logging a bit less verbose
And tweak some log content
2021-02-17 11:09:22 -05:00
teor cc7d5bd2ad
Update comments for the inbound service (#1740) 2021-02-16 06:14:40 +10:00
teor 372a432179
Update the call_all comment in Inbound (#1737) 2021-02-16 06:14:16 +10:00
teor 0b76352468
Document a state_contains bug (#1715)
* Document a state_contains bug in the syncer and Inbound
2021-02-10 09:05:14 +10:00
Deirdre Connolly 0c5daa8410 Bump versions for zebrad 1.0.0-alpha.2
Including tower-batch bump to 0.2.0, tower-fallback to 0.2.0, zebra-script to 1.0.0-alpha.3
2021-02-09 16:14:29 -05:00
teor dce11358d7
Log when the syncer awaits peer readiness (#1714) 2021-02-10 07:09:27 +10:00
Alfredo Garcia d7c40af2a8
Fix shutdown panics (#1637)
* add a shutdown flag in zebra_chain::shutdown
* fix network panic on shutdown
* fix checkpoint panic on shutdown
2021-02-03 19:03:28 +10:00
teor 6679a124e3 Require Inbound setup handlers to provide a result
Rather than having them default to `Ok(())`, which is incorrect
for some error handlers.
2021-02-03 08:32:10 +10:00
teor 09c8c89462 Make sure FailedInit never escapes Inbound::poll_ready 2021-02-03 08:32:10 +10:00
teor 134a5e78bd Consistently use `network_setup` for the Inbound Setup 2021-02-03 08:32:10 +10:00
teor 1c8362fe01 Remove unused imports 2021-02-03 08:32:10 +10:00
Jane Lusby 4cf331562c combine network setup into an exhaustive match 2021-02-03 08:32:10 +10:00
Jane Lusby 4d6ef89248 avoid using async blocks to avoid lifetime bug with generators 2021-02-03 08:32:10 +10:00
Jane Lusby 685a592399 Add clonable wrapper around TryRecvError 2021-02-03 08:32:10 +10:00
teor 6ffeb670ed Log the failed response in an unreachable panic 2021-02-03 08:32:10 +10:00
teor eac4fd181a Add a Setup enum to manage Inbound network setup internal state
This change encodes a bunch of invariants in the type system,
and adds explicit failure states for:
* a closed oneshot,
* bugs in the initialization code.
2021-02-03 08:32:10 +10:00
teor 32b032204a Consistently return Response::Nil during setup
And log an info-level message as a diagnostic, in case setup takes a
long time.
2021-02-03 08:32:10 +10:00
teor 94eb91305b Stop using ServiceExt::call_all due to buffer bugs
ServiceExt::call_all leaks Tower::Buffer reservations, so we can't use
it in Zebra.

Instead, use a loop in the returned future.

See #1593 for details.
2021-02-03 08:32:10 +10:00
teor 64bc45cd2e Fix state readiness hangs for Inbound
Use `ServiceExt::oneshot` to perform state requests.

Explain that `ServiceExt::call_all` calls `poll_ready` internally.
Document a state service invariant imposed by `ServiceExt::call_all`.
2021-02-03 08:32:10 +10:00
teor 4d1a2fd02e Make the Inbound invariant clearer 2021-02-03 08:32:10 +10:00
teor 2a25b9ee72 Remove services that are never `call`ed from Inbound
Uses the `ServiceExt::oneshot` design pattern from #1593.
2021-02-03 08:32:10 +10:00
Alfredo Garcia 4b34482264
Add hints to port conflict and lock file panics (#1535)
* add hint for port error
* add issue filter for port panic
* add lock file hint
* add metrics endpoint port conflict hint
* add hint for tracing endpoint port conflict
* add acceptance test for resource conflics
* Split out common conflict test code into a function
* Add state, metrics, and tracing conflict tests

* Add a full set of stderr acceptance test functions

This change makes the stdout and stderr acceptance test interfaces
identical.

* move Zcash listener opening
* add todo about hint for disk full
* add constant for lock file
* match path in state cache
* don't match windows cache path

* Use Display for state path logs

Avoids weird escaping on Windows when using Debug

* Add Windows conflict error messages

* Turn PORT_IN_USE_ERROR into a regex

And add another alternative Windows-specific port error

Co-authored-by: teor <teor@riseup.net>
Co-authored-by: Jane Lusby <jane@zfnd.org>
2021-01-29 22:36:33 +10:00
teor 24f1b9bad1
Document the Inbound service in the start module (#1653) 2021-01-29 22:19:06 +10:00
teor 21b0360114 Limit concurrent inbound gossipped block requests
Uses the "load shed directly" design pattern from #1618.
2021-01-29 11:02:26 +10:00
teor 3d9888f736 Rewrite a sync comment 2021-01-29 11:02:26 +10:00
Deirdre Connolly 1b09538277
Bump versions for zebrad 1.0.0-alpha.1 (#1646)
* Bump versions where appropriate

Tested with cargo install --locked --path etc

* Remove fixed panics from 'Known Issues'

* Change to alpha release series in the README

Co-authored-by: teor <teor@riseup.net>
2021-01-27 20:31:39 -05:00
teor 391c53aa60 Move BoxError to zebrad's lib.rs
For consistency with other crates.
2021-01-27 12:14:27 -08:00
teor 9cdf41f5f4
Panic if the lookahead limit is misconfigured (#1589) 2021-01-14 14:06:30 +10:00
teor 92d95d4be5 Refactor inbound members into a consistent order
And add download comments
2021-01-13 20:46:25 -05:00
teor fb76eb2e6b Add download and verify timeouts to the inbound service 2021-01-13 20:46:25 -05:00
teor 973aec8ccc Refactor sync members into a consistent order
And add comments about correctness and usage.
2021-01-13 20:46:25 -05:00
teor c2893dce51 Warn when the user's configured lookahead limit is ignored 2021-01-13 20:46:25 -05:00
teor 3699bbdae6 Add some additional sync correctness constraints
And adjust the sync restart delay as a consequence.
2021-01-13 20:46:25 -05:00
teor cef0a492d8 Add a timeout to sync service block verification
This timeout stops the sync service hanging when it is missing required
blocks, but the lookahead queue is full of dependent verify tasks, so the
missing blocks never get downloaded.
2021-01-13 20:46:25 -05:00
teor c75cbdea79
Log configured network in every log message (#1568)
* Add the configured network to error reports
* Log the configured network at error level
* Create the global span immediately after activating tracing
And leak the span guard, so the span is always active.

* Include panic metadata in the report and URL
* Use `Main` and `Test` in the global span
`net=Mainnet` is a bit redundant
2021-01-12 07:46:56 +10:00
teor b1f14f47c6
Rewrite GetData handling to match the zcashd implementation (#1518)
* Rewrite GetData handling to match the zcashd implementation

`zcashd` silently ignores missing blocks, but sends found transactions
followed by a `NotFound` message:
e7b425298f/src/main.cpp (L5497)

This is significantly different to the behaviour expected by the old
Zebra connection state machine, which expected `NotFound` for blocks.

Also change Zebra's GetData responses to peer request so they ignore
missing blocks.

* Stop hanging on incomplete transaction or block responses

Instead, if the peer sends an unexpected block, unexpected transaction,
or NotFound message:
1. end the request, and return a partial response containing any items
   that were successfully received
2. if none of the expected blocks or transactions were received, return
   an error, and close the connection
2021-01-04 13:25:35 +10:00
teor 69fcf64d6c
Disable issue URLs for "duplicate hash" errors (#1517)
In our README, we tell users to ignore these errors, so we should also
disable the issue URL.

Also include the hash in the error. (We don't want the span active for
all messages, we just want the hash in the error.)
2020-12-16 08:14:42 +10:00
Alfredo Garcia 41833340c1
downgrade remaining version strings to 1.0.0-alpha.0 (#1488) 2020-12-15 11:21:00 +10:00
Deirdre Connolly 2d1698a120 Comment out Sentry stacktraces for now
While panic = abort, Sentry collects the same one-line stack trace for all panics,
making it incorrectly dedupe different errors into one.
2020-12-12 13:26:52 -05:00
Deirdre Connolly cff28f7ac8 Use the commit sha as the sentry release 2020-12-09 13:06:18 -05:00
Jane Lusby 400213e2b3 integrate sentry with our existing panic reporting logic 2020-12-09 13:06:18 -05:00
Deirdre Connolly f1ec1d626d Tidy for now 2020-12-09 13:06:18 -05:00
Deirdre Connolly 44e1051dee Debug 2020-12-09 13:06:18 -05:00
Deirdre Connolly 8b268e3f71 Don't keep guard around 2020-12-09 13:06:18 -05:00
Deirdre Connolly 25f6fd25b3 Test catching panic 2020-12-09 13:06:18 -05:00
Deirdre Connolly 6a17549945 Try sentry-tracing integration 2020-12-09 13:06:18 -05:00
Deirdre Connolly c03a3a2606 Pull DSN from runtime env, enable Sentry debug mode with RUST_LOG=debug 2020-12-09 13:06:18 -05:00
Deirdre Connolly 27e42f4ed5 Set up Sentry error collection via a feature flag 2020-12-09 13:06:18 -05:00
Deirdre Connolly 47d78d4cf4 Try sentry::init() 2020-12-09 13:06:18 -05:00
teor 16ffb1dbbf
Disable issue URLs on all timeouts (#1470)
This change helps prevent spurious bug reports.
2020-12-08 07:47:01 +10:00
Jane Lusby ef7e91c3c7
disable color-eyre colors if not connected to a tty (#1443)
* disable color-eyre colors if not connected to a tty
* check if color is disabled
2020-12-04 11:05:25 +10:00
Jane Lusby 90f944709b
fix git commit logic to work on gcloud (#1442) 2020-12-03 15:18:55 +10:00
teor 0e42d8b6c1 Always enable color_eyre, even when color is disabled
We want to automatically disable colors upstream in color_eyre,
and add a config that allows users to always turn off color.
2020-12-02 10:25:44 -08:00
teor bed34168c1 Automatically disable abscissa colors and color_eyre when writing to a file 2020-12-02 10:25:44 -08:00
teor 97d1a81b7c Automatically disable colors when tracing to a file 2020-12-02 10:25:44 -08:00
Henry de Valence f0db75e712 cargo fmt 2020-12-01 19:16:41 -08:00
Jane Lusby a91d0f0bb6
Include short sha in log messages and error urls (#1410)
As we approach our alpha release we've decided we want to plan ahead for the user bug reports we will eventually receive. One of the bigger issues we foresee is determining exactly what version of the software users are running, and particularly how easy it may or may not be for users to accidentally discard this information when reporting bugs.

To defend against this, we've decided to include the exact git sha for any given build in the compiled artifact. This information will then be re-exported as a span early in the application startup process, so that all logs and error messages should include the sha as their very first span. We've also added this sha as issue metadata for `color-eyre`'s github issue url auto generation feature, which should make sure that the sha is easily available in bug reports we receive, even in the absence of logs.

Co-authored-by: teor <teor@riseup.net>
2020-12-01 12:13:20 -08:00
Jane Lusby fceef849cf remove unused mutability to defuse deadlock 2020-12-01 11:03:13 -05:00
Henry de Valence 1df9284444 zebrad: add a use_color option to the tracing config.
This is useful for creating searchable logs without having to filter color codes after the fact.
2020-11-30 15:25:50 -08:00
Henry de Valence e8c16b172f zebrad: pass TracingSection to Tracing component 2020-11-30 15:25:50 -08:00
Alfredo Garcia 4544463059
Inbound `FindBlocks` and `FindHeaders` (#1347)
* implement inbound `FindBlocks`
* Handle inbound peer FindHeaders requests
* handle request before having any chain tip
* Split `find_chain_hashes` into smaller functions

Add a `max_len` argument to support `FindHeaders` requests.

Rewrite the hash collection code to use heights, so we can handle the
`stop` hash and "no intersection" cases correctly.

* Split state height functions into "any chain" and "best chain"
* Rename the best chain block method to `best_block`
* Move fmt utilities to zebra_chain::fmt
* Summarise Debug for some Message variants

Co-authored-by: teor <teor@riseup.net>
Co-authored-by: Jane Lusby <jlusby42@gmail.com>
2020-12-01 07:30:37 +10:00
Henry de Valence fa02b266ca clippy 2020-11-25 10:55:44 -08:00
Henry de Valence de8415dcb1 tidy spans 2020-11-25 10:55:44 -08:00
Henry de Valence 05837797b1 tidy imports 2020-11-25 10:55:44 -08:00
Henry de Valence 77bf327b07 fix errors (2) 2020-11-25 10:55:44 -08:00
Henry de Valence 527f4d39ed fix errors 2020-11-25 10:55:44 -08:00
Henry de Valence e645e3bf0c remove async 2020-11-25 10:55:44 -08:00
Henry de Valence 6569977549 test compile change 2020-11-25 10:55:44 -08:00
Alfredo Garcia 486e55104a create Downloads for Inbound 2020-11-25 10:55:44 -08:00
Henry de Valence 2a4a89c002 state,zebrad: tidy span levels for good INFO output
This provides useful and not too noisy output at INFO level.  We do an
info-level message on every block commit instead of trying to do one
message every N blocks, because this is useful both for initial block
sync as well as continuous state updates on new blocks.
2020-11-23 14:16:39 +10:00
Henry de Valence f0810b028d state,consensus,sync: shorten span lengths
These changes help reduce the size of the resulting spans, making the
output more compact.  Together they save about 30-40 characters.
2020-11-23 14:16:39 +10:00
teor d4da9609ee Update the max_concurrent_block_requests docs
In #1298, we decreased `max_concurrent_block_requests`,
but forgot to update the docs.
2020-11-20 10:08:57 -08:00
Henry de Valence ba3c19142c deps: update hyper, metrics to tokio 0.3
The metrics code becomes much simpler because the current version of the
metrics crate builds its own single-threaded runtime on a dedicated worker
thread, so no dependency on the main Zebra Tokio runtime is required.
2020-11-20 10:08:16 -08:00
Henry de Valence add94c1c45 deps: move to tokio 0.3, tower 0.4
This change is mostly mechanical, with the exception of the changes to the
`tower-batch` middleware.  This middleware was adapted from `tower::buffer`,
and the `tower::buffer` code was changed to implement its own bounded queue,
because Tokio 0.3 removed the `mpsc::Sender::poll_send` method.  See

ddc64e8d4d

for more context on the Tower changes.  To match Tower as closely as possible
in order to be able to upstream `tower-batch`, those changes are copied from
`tower::Buffer` to `tower-batch`.
2020-11-20 10:08:16 -08:00
Henry de Valence 4953f21670 fixup! zebrad: hack to skip alreadyverified errors 2020-11-18 03:09:06 -05:00
Henry de Valence d2fc01755b zebrad: more reasonable concurrent block limit
This helps prevent overloading the network with too many concurrent
block requests.  On a fast network, we're likely to still have enough
room to saturate our bandwidth.  In the worst case, with 2MB blocks,
downloading 50 blocks concurrently is 100MB of queued downloads.  If we
need to download this in 20 seconds to avoid peer connection timeouts,
the implied worst-case minimum speed is 5MB/s.  In practice, this
minimum speed will likely be much lower.
2020-11-17 14:56:27 -08:00
Henry de Valence aa7538ab15 zebrad: hack to skip alreadyverified errors 2020-11-17 14:56:27 -08:00
Henry de Valence e55392b61e zebrad: explicitly select the threaded scheduler. 2020-11-17 14:56:27 -08:00
Henry de Valence 6de824bd99 zebrad: remove block verification timeout
Because we set the lookahead limit to be at least twice the size of a checkpoint, we don't have a risk of timeouts.
2020-11-17 14:56:27 -08:00
Henry de Valence e9c847bbd7 zebrad: avoid a borrow in the ChainSync future 2020-11-17 14:56:27 -08:00
Henry de Valence b632a24436 zebrad: add diagnostics on cancelled download tasks 2020-11-17 14:56:27 -08:00
Henry de Valence ec411574ee zebrad: improve sync diagnostics 2020-11-17 14:56:27 -08:00
Henry de Valence e0c92167bc Revert "Hedge every syncer block download request"
This reverts commit 656bd24ba7.

The Hedge middleware keeps a pair of histograms, writing into one in the
current time interval and reading from the previous time interval's
data.  This means that the reverted change resulted in doubling all
block downloads until after at least the second measurement interval
(which means that the time measurements are also incorrect, as they're
operating under double the network load...)
2020-11-12 16:45:47 -05:00
Alfredo Garcia 128643d81e
Call `zebra_test::init` where needed. (#1227)
* Add missing `zebra_test::init()` to zebra-chain
* Add missing `zebra_test::init()` to zebra-consensus
* Add missing `zebra_test::init()` to zebra-network
* Add missing `zebra_test::init()` to zebra-state
* Add missing `zebra_test::init()` to zebra-test
* Add missing `zebra_test::init()` to zebrad
2020-11-10 10:29:25 +10:00
Henry de Valence 0ad648fb6a zebrad: make lookahead limit configurable.
Sets the default value to the previous lookahead limit.  My testing on
mainnet suggested that the newly lower value (changed when the
checkpoint frequency was decreased) is low enough to cause stalls, even
when using hedged requests.
2020-11-01 10:47:46 -08:00
teor 92c623eddf Log each genesis download
This change helps us diagnose sync hangs.
2020-10-28 11:31:04 -04:00
teor 656bd24ba7 Hedge every syncer block download request
Remove the minimum data points from the syncer hedge configuragtion.
When there are no data points, hedge sends the second request
immediately.

Where there are less than 1/(1-latency_percentile) data points (20),
hedge delays the second request by the highest recent download time.

This change should improve genesis and post-restart sync latency.
2020-10-28 11:31:04 -04:00
Henry de Valence 4c960c4e6d zebrad: treat duplicate downloads as an error
We should error if we notice that we're attempting to download the same
blocks multiple times, because that indicates that peers reported bad
information to us, or we got confused trying to interpret their
responses.
2020-10-26 12:05:35 -07:00
Henry de Valence 4127d086ea zebrad: clarify hedge layering motivation
Co-authored-by: teor <teor@riseup.net>
2020-10-26 12:05:35 -07:00
Henry de Valence 253bab042e sync: add a concurrency limit for block downloads 2020-10-26 12:05:35 -07:00
Henry de Valence 0a405c737d zebrad: check state in obtaintips, not extendtips.
The original sync algorithm split the sync process into two phases, one
that obtained prospective chain tips, and another that attempted to
extend those chain tips as far as possible until encountering an error
(at which point the prospective state is discarded and the process
restarts).

Because a previous implementation of this algorithm didn't properly
enforce linkage between segments of the chain while extending tips,
sometimes it would get confused and fail to discard responses that did
not extend a tip.  To mitigate this, a check against the state was
added.  However, this check can cause stalls while checkpointing,
because when a checkpoint is reached we may suddenly need to commit
thousands of blocks to the state.  Because the sync algorithm now has a
a `CheckedTip` structure that ensures that a new segment of hashes
actually extends an existing one, we don't need to check against the
state while extending a tip, because we don't get confused while
interpreting responses.

This change results in significantly smoother progress on mainnet.
2020-10-26 12:05:35 -07:00
Henry de Valence 65e0c22fbe state: don't pre-buffer the service
There's no reason to return a pre-Buffer'd service (there's no need for
internal access to the state service, as in zebra-network), but wrapping
it internally removes control of the buffer size from the caller.
2020-10-26 12:05:35 -07:00
Henry de Valence ce2ac3336f zebrad: add debug message before state check
This reveals that there may be contention in access to the state, as
this takes a long time.
2020-10-26 12:05:35 -07:00
Henry de Valence 91469faf3c zebrad: eliminate duplicate span in sync 2020-10-26 12:05:35 -07:00
Henry de Valence b5a43f4516 zebrad: remove implementation details from docs
The timeout behavior in zebra-network is an implementation detail, not a
feature of the public API.  So it shouldn't be mentioned in the doc
comments -- if we want timeout behavior, we have to layer it ourselves.
2020-10-26 12:05:35 -07:00
Henry de Valence 1d7309afe2 zebrad: correctly handle duplicates in DownloadSet
Using the cancel_handles, we can deduplicate requests.  This is
important to do, because otherwise when we insert the second cancel
handle, we'd drop the first one, cancelling an existing task for no
reason.
2020-10-26 12:05:35 -07:00
Henry de Valence 56fe4f4379 zebrad: unify sync restart logic
This lets us keep the main loop simple and just write `continue 'sync;`
to keep going.
2020-10-26 12:05:35 -07:00
Henry de Valence 12d25159c6 zebrad: use hedged requests in sync
The hedge middleware implements hedged requests, as described in _The
Tail At Scale_. The idea is that we auto-tune our retry logic according
to the actual network conditions, pre-emptively retrying requests that
exceed some latency percentile. This would hopefully solve the problem
where our timeouts are too long on mainnet and too slow on testnet.
2020-10-26 12:05:35 -07:00
Henry de Valence 5f229d1475 zebrad: use Downloads in sync
Try to use the better cancellation logic to revert to previous sync
algorithm.  As designed, the sync algorithm is supposed to proceed by
downloading state prospectively and handle errors by flushing the
pipeline and starting over.  This hasn't worked well, because we didn't
previously cancel tasks properly.  Now that we can, try to use something
in the spirit of the original sync algorithm.
2020-10-26 12:05:35 -07:00
Henry de Valence b90581a3d7 zebrad: create a Downloads Stream for syncing.
This makes two changes relative to the existing download code:

1.  It uses a oneshot to attempt to cancel the download task after it
    has started;

2.  It encapsulates the download creation and cancellation logic into a
    Downloads struct.
2020-10-26 12:05:35 -07:00
Henry de Valence b636660d6a zebrad: rename sync::Error alias to BoxError. 2020-10-26 12:05:35 -07:00
Henry de Valence cab96aa1a8
zebrad: clarify config help text (#1194) 2020-10-22 15:03:01 +10:00
Alfredo Garcia 21ad6ffc47
Reverse displayed endianness of transaction and block hashes (#1171)
* Reverse displayed endianness of transaction and block hashes
* fix zebra-checkpoints utility for new hash order
* Stop using "zebrad revhex" in zebrad-hash-lookup
* Rebuild checkpoint lists in new hash order
This change also adds additional checkpoints to the end of each list.

* Replace TransactionHash with transaction::Hash
This change should have been made in #905, but we missed Debug impls
and some docs.

Co-authored-by: Ramana Venkata <vramana@users.noreply.github.com>
Co-authored-by: teor <teor@riseup.net>
2020-10-22 07:54:02 +10:00
Henry de Valence eb43893de0 consensus: minimize API, clean docs
This reduces the API surface to the minimum required for functionality,
and cleans up module documentation.  The stub mempool module is deleted
entirely, since it will need to be redone later anyways.
2020-10-20 11:16:22 -04:00
Alfredo Garcia c0a14ecc8c
move genesis parameters to zebra-chain (#1151) 2020-10-12 14:08:23 -07:00
Jane Lusby 855f9b5bcb
Implement MVP of NonFinalizedState and integrate it with the state service (#1101)
* implement most of the chain functions
* implement fork
* fix outpoint handling in Chain struct
* update expect for work
* split utxo into two sets
* update the Chain definition
* remove allow attribute in zebra-state/lib.rs
* merge ChainSet type into MemoryState
* Add error messages to asserts
* export proptest impls for use in downstream crates
* add testjob for disabled feature in zebra-chain
* try to fix github actions syntax
* add module doc comment
* update RFC for utxos
* add missing header
* working proptest for Chain
* propagate back results over channel
* Start updating RFC to match changes
* implement queued block pruning
* and now it syncs wooo!
* remove empty modules
* setup config for proptests
* re-enable missing_docs lint
* update RFC to match changes in impl
* add documentation
* use more explicit variable names
2020-10-08 13:07:32 +10:00
Jane Lusby 40e22808c7
disable reporting url for timeout errors (#1087)
* disable reporting url for timeout errors

* revert newline removal

* switch to released color-eyre version
2020-09-21 16:15:09 -07:00
Henry de Valence fe61090a64 zebrad: make Inbound Poll::Ready before setup.
The Inbound service only needs the network setup for some requests, but
it can service other requests without it.  Making it return
Poll::Pending until the network setup finishes means that initial
network connections may view the Inbound service as overloaded and
attempt to load-shed.
2020-09-21 09:26:39 -07:00
Henry de Valence 9c021025a7 network: fill in remaining request/response pairs 2020-09-20 10:21:18 -07:00
Henry de Valence 4b35fea492 zebrad: document Inbound, ChainSync responsibilities 2020-09-18 18:34:25 -07:00
Henry de Valence 65877cb4b1 zebrad: make Inbound propagate backpressure 2020-09-18 18:34:25 -07:00
Henry de Valence 55f46967b2 zebrad: serve blocks from Inbound service
The original version of this commit ran into

https://github.com/rust-lang/rust/issues/64552

again.  Thanks to @yaahc for suggesting a workaround (using futures combinators
to avoid writing an async block).
2020-09-18 18:34:25 -07:00
Henry de Valence 170f588ffb network: document load-shedding behavior
This was part of the original design and is described in the Connection
internals, but we never documented it externally.
2020-09-18 18:34:25 -07:00
Henry de Valence 1d0ebf89c6 zebrad: move seed command into inbound component
Remove the seed command entirely, and make the behavior it provided
(responding to `Request::Peers`) part of the ordinary functioning of the
start command.

The new `Inbound` service should be expanded to handle all request
types.
2020-09-18 18:34:25 -07:00
Henry de Valence 1d3892e1dc network: rename alias to BoxError
This is shorter and consistent with Tower (which is why we use it in the
first place).
2020-09-18 18:34:25 -07:00
Jane Lusby ca648ff27c
Enable issue-url feature in color-eyre (#1072)
* Enable issue-url feature in color-eyre

* get version automatically

* and the url!
2020-09-17 15:09:18 -07:00
Henry de Valence 3133214e4f zebrad: use new state API 2020-09-11 13:37:49 -07:00
teor b1e1291f45 Log inbound peer requests at debug
Logging at info was a bit too verbose.

Also add a short log message.
2020-09-10 09:46:53 -07:00
Henry de Valence 24de90c900 zebrad: tidy sync imports 2020-09-10 09:45:52 -07:00
Henry de Valence 9b6e66c1b9 zebrad: rename Syncer to ChainSync
This name clarifies what is being synced and avoids an agent-noun
construction.
2020-09-10 09:45:52 -07:00
Henry de Valence 0bc79686b8 zebrad: move sync into components module.
Part of #1030.
2020-09-10 09:45:52 -07:00
teor adafe1d189 Restart sync after the first failed ObtainTips
The ObtainTips retry was redundant. The timeout wasn't much shorter, but
it made the code and sync logic more complicated.
2020-09-09 15:35:09 -07:00
teor 2a68ef5acb Update the peerset buffer size and sync timeout
Also add a bunch of comments and documentation for network-constrained
nodes, and for testnet.
2020-09-08 12:44:33 -07:00
teor b062a682b0 Refactor "waiting for pending blocks" log 2020-09-08 12:44:33 -07:00
teor e6e859dce2 Tweak sync timeouts
* increase the EWMA default and decay
* increase the block download retries
* increase the request and block download timeouts
* increase the sync timeout
2020-09-08 12:44:33 -07:00
teor ce12d4dadc Add timeouts for tip responses and block verify tasks 2020-09-08 12:44:33 -07:00
teor 379ce5c1b8 Retry obtain and extend tips on failure 2020-09-08 12:44:33 -07:00
Alfredo Garcia 454e75e7c0
Rename old references to BlockHeaderHash and BlockHeight (#1002)
* rename some references

* Apply suggestions from code review

Co-authored-by: Deirdre Connolly <durumcrustulum@gmail.com>
Co-authored-by: teor <teor@riseup.net>

Co-authored-by: Deirdre Connolly <durumcrustulum@gmail.com>
Co-authored-by: teor <teor@riseup.net>
2020-09-04 15:40:48 -07:00
teor 48497d4857
Ignore sync errors when the block is already verified (#980)
* Ignore sync errors when the block is already verified

If we get an error for a block that is already in our state, we don't
need to restart the sync. It was probably a duplicate download.

Also:

Process any ready tasks before reset, so the logs and metrics are
up to date. (But ignore the errors, because we're about to reset.)

Improve sync logging and metrics during the download and verify task.

* Remove duplicate hashes in logs
Co-authored-by: Jane Lusby <jlusby42@gmail.com>

* Log the sync hash span at warn level
Co-authored-by: Jane Lusby <jlusby42@gmail.com>
2020-09-04 08:13:00 +10:00
teor 437549d8e9
Always drop the final hash in peer responses (#991)
To workaround a zcashd bug that squashes responses together.
2020-09-04 08:09:34 +10:00
teor c770daa51f
If the first ExtendTips hash is bad, discard it and re-check (#992) 2020-09-04 08:08:19 +10:00
Alfredo Garcia 5485f4429a
Add config path to acceptance tests (#946)
* add and apply config mode to get_child

* remove option to read config from current directory

* remove argument from get_child
2020-09-03 13:13:23 -07:00
Jane Lusby ffdec0cb23
Remove in-memory state service (#974)
* Remove in-memory state service

* make the config compatible with toml again

* checkpoint commit to see how much I still have to revert

* back to the starting point...

* remove unused dependency

* reorganize error handling a bit

* need to make a new color-eyre release now

* reorder again because I have problems

* remove unnecessary helpers

* revert changes to config loading

* add back missing space

* Switch to released color-eyre version

* add back missing newline again...

* improve error message on unix when terminated by signal

* add context to last few asserts in acceptance tests

* instrument some of the helpers

* remove accidental extra space

* try to make this compile on windows

* reorg platform specific code

* hide on_disk module and fix broken link
2020-09-01 12:39:04 -07:00
teor 3fdfcb3179 fix: remove old tips that are behind new tips
This change makes sync less reliant on the exact order of ObtainTips and
ExtendTips responses.
2020-09-01 11:42:48 -04:00
teor a6d6e65940 fix: fix the flamegraph module comment 2020-09-01 11:40:18 -04:00
teor 78201b456d feature: Implement checkpoint_sync for checkpoint verification
* add CheckpointList::new_up_to(limit: NetworkUpgrade)
* if checkpoint_sync is false, limit checkpoints to Sapling
* update tests for CheckpointList and chain::init
2020-08-24 15:34:46 +10:00
teor 06f4a59664 feature: Add a checkpoint_sync config option
(The option doesn't do anything yet.)
2020-08-24 15:34:46 +10:00
teor b8e8d4f548 fix: Remove some deeply-nested instrument spans
Closes #923.
2020-08-20 14:52:39 -04:00
Henry de Valence 103b663c40 chain: rename BlockHeight to block::Height 2020-08-17 11:46:34 -07:00
Henry de Valence 61dea90e2f chain: rename BlockHeaderHash to block::Hash
This is the first in a sequence of changes that change the block:: items
to not include Block as a prefix in their name, in accordance with the
Rust API guidelines.
2020-08-17 11:46:34 -07:00
Henry de Valence 948b067808 chain: move Network, NetworkUpgrade to parameters
Also, avoid using star-imports of the enum variants, which pollutes the
namespace.
2020-08-17 11:46:34 -07:00
Henry de Valence 0d1f56ad2f chain: remove utils module
A catch-all utils module can really easily slip into being a place to stash
miscellaneous functions that don't really belong anywhere in particular.
2020-08-17 11:46:34 -07:00
Henry de Valence a79ce97957
Fix sync algorithm. (#887)
* checkpoint: reject older of duplicate verification requests.

If we get a duplicate block verification request, we should drop the older one
in favor of the newer one, because the older request is likely to have been
canceled.  Previously, this code would accept up to four duplicate verification
requests, then fail all subsequent ones.

* sync: add a timeout layer to block requests.

Note that if this timeout is too short, we'll bring down the peer set in a
retry storm.

* sync: restart syncing on error

Restart the syncing process when an error occurs, rather than ignoring it.
Restarting means we discard all tips and start over with a new block locator,
so we can have another chance to "unstuck" ourselves.

* sync: additional debug info

* sync: handle lookahead limit correctly.

Instead of extracting all the completed task results, the previous code pulled
results out until there were fewer tasks than the lookahead limit, then
stopped.  This meant that completed tasks could be left until the limit was
exceeded again.  Instead, extract all completed results, and use the number of
pending tasks to decide whether to extend the tip or wait for blocks to finish.

* network: add debug instrumentation to retry policy

* sync: instrument the spawned task

* sync: streamline ObtainTips/ExtendTips logic & tracing

This change does three things:

1.  It aligns the implementation of ObtainTips and ExtendTips so that they use
the same deduplication method.  This means that when debugging we only have one
deduplication algorithm to focus on.

2.  It streamlines the tracing output to not include information already
included in spans. Both obtain_tips and extend_tips have their own spans
attached to the events, so it's not necessary to add Scope: prefixes in
messages.

3.  It changes the messages to be focused on reporting the actual
events rather than the interpretation of the events (e.g., "got genesis hash in
response" rather than "peer could not extend tip").  The motivation for this
change is that when debugging, the interpretation of events is already known to
be incorrect, in the sense that the mental model of the code (no bug) does not
match its behavior (has bug), so presenting minimally-interpreted events forces
interpretation relative to the actual code.

* sync: hack to work around zcashd behavior

* sync: localize debug statement in extend_tips

* sync: change algorithm to define tips as pairs of hashes.

This is different enough from the existing description that its comments no
longer apply, so I removed them.  A further chunk of work is to change the sync
RFC to document this algorithm.

* sync: reduce block timeout

* state: add resource limits for sled

Closes #888

* sync: add a restart timeout constant

* sync: de-pub constants
2020-08-12 16:48:01 -07:00
Henry de Valence 299afe13df
zebra-network tweaks. (#877)
* network: move gossiped peer selection logic into address book.

* network: return BoxService from init.

* zebrad: add note on why we truncate thegossiped peer list

Co-authored-by: Jane Lusby <jlusby42@gmail.com>

* Remove unused .rustfmt.toml

Many of these options are never actually loaded by our CI because of a channel
mismatch, where they're not applied on stable but only on nightly (see the logs
from a rustfmt job).  This means that we can get different settings when
running `cargo fmt` on the nightly and stable channels, which was causing a CI
failure on this PR.  Reverting back to the default rustfmt settings avoids this
problem and keeps us in line with upstream rustfmt.  There's no loss to us
since we were using the defaults anyways.

Co-authored-by: Jane Lusby <jlusby42@gmail.com>
2020-08-11 13:07:44 -07:00
teor 2550c44d48
Make sync ignore known hashes (#853)
* fix: Handle known ObtainTips correctly

enumerate never returns a value beyond the end of the vector.

* fix: Ignore known tips in ExtendTips

Some peers send us known tips when we try to extend.

* fix: Ignore known hashes when downloading

Despite all our other checks, we still end up downloading some hashes
multiple times.

* fix: Increase the number of retries

The old sync code relied on duplicate block fetches to make progress,
but the last few commits have removed some of those duplicates.

Instead, just retry the fetches that fail.

* fix: Tweak comments

Co-authored-by: Jane Lusby <jlusby42@gmail.com>

* fix: Cleanup the state_contains interface in Sync

* Fix brackets

Oops

Co-authored-by: Jane Lusby <jlusby42@gmail.com>
2020-08-10 16:17:50 -07:00
Alfredo Garcia 9c387521bd
Print endpoint addresses at startup (#867)
* print tracing and metrics endpoints in startup

* print network address in startup
2020-08-10 12:47:26 -07:00
teor e95358dbe3 fix: Increase the number of retries
The old sync code relied on duplicate block fetches to make progress,
but the last few commits have removed some of those duplicates.

Instead, just retry the fetches that fail.
2020-08-10 18:58:21 +10:00
teor faac50697c feature: Add a verified blocks metrics counter
We have a counter for pending "download and verify" futures. But these
futures are spawned, so they can complete in any order. They can also
complete before we receive their results.
2020-08-10 15:12:08 +10:00
teor 6aeefcee8b fix: Improve sync diagnostics 2020-08-10 15:12:08 +10:00
Henry de Valence 6d1a4b2218
Load config after initializing the Terminal (#848) 2020-08-06 17:22:40 -07:00
Alfredo Garcia c52481c041 fix logs 2020-08-07 09:21:57 +10:00
Jane Lusby 3e9c6f054b
fix log level default for server commands (#840)
* fix log level default for server commands

* remove dbg
2020-08-06 11:23:00 -07:00
Henry de Valence a77328ad7c
Refactor tracing components (#834)
* Split tracing component code into modules.

* Repatriate Tracing and simplify config handling.

We upstreamed our Tracing component, expecting not to have to exert fine
control over the tracing settings.  But this turned out not to be the case, and
now that we want to do other things (flamegraphs, journalctl, opentelemetry,
etc), we end up with really awkward code (as in the current flamegraph
handling).

This also makes use of the changes to `init()` to load the config early to pass
configuration data into the components, which avoids the need for the
refactoring in #775.

Finally, we restore support for the `-v` flag when the filter is unset.  Closes #831.

* Disable tracing and metrics endpoints by default.

Closes #660.

* Switch back to upstream Abscissa.

* Integrate flamegraph support into the new Tracing component.

* Pass -v in acceptance tests to get info-level output.

* Clean up acceptance test code.
2020-08-06 10:29:31 -07:00
Jane Lusby 867dd0b475
Setup tracing-flame for use profiling zebrad (#436)
* Setup tracing-flame for use profiling zebrad

* start work on conditional flamegraph generation

* review time!

* update comments

* Update Cargo.toml

* disable default features for inferno

* reorganize

* missing one trait

* Apply suggestions from code review

* graceful shutdown!

* remove special case handling on ctrlc for cleanup

* rename signal fn to better represent its responsibility

* remove unused global hook for flushing flamegraph

* move tracing logic to the right file

* just copy linkerd's signal handling logic

* update book

* make zebrad app drop on shutdown normally

* Update zebrad/src/components/tokio.rs

Co-authored-by: teor <teor@riseup.net>

* Update zebrad/src/application.rs

Co-authored-by: teor <teor@riseup.net>

* Apply suggestions from code review

Co-authored-by: teor <teor@riseup.net>

* cleanup a little

* ooh yea there's an API for that

* setup env-filter for backup subscriber

* document env filter

* document return codes

* forgot to save

* Update book/src/applications/zebrad.md

Co-authored-by: teor <teor@riseup.net>

Co-authored-by: teor <teor@riseup.net>
2020-08-05 16:35:56 -07:00
Henry de Valence 4a03d76a41
Remove environment variables in favor of documented config options. (#827)
* Load tracing filter only from config and simplify logic.

* Configure the state storage in the config, not an environment variable.

This also changes the config so that the path is always set rather than being
optional, because Zebra always needs a place to store its config.
2020-08-05 11:48:08 -07:00
Henry de Valence 82da4a5326 Remove connect command. 2020-08-04 23:34:45 -07:00
Alfredo Garcia f2d7bb3177
Command execution tests (#690)
* add zebrad acceptance tests
* add custom command test helpers that work with kill
* add and use info event for start and seed commands
* combine conflicting tests into one test case

Co-authored-by: Jane Lusby <jane@zfnd.org>
2020-08-01 16:15:26 +10:00
Alfredo Garcia 617f1d80ef move docs to zebra book 2020-07-29 19:44:21 -07:00
Alfredo Garcia 6297a7cd19 document zebrad enviroment variables 2020-07-29 19:44:21 -07:00
teor 050c46388f fix: Open the endpoints after the config is loaded
We get the injected TokioComponent dependency before the config is
loaded, so we can't use it to open the endpoints.

And we can't define after_config, because we use derive(Component).

So we work around these issues by opening the endpoints manually,
from the application's after_config.
2020-07-29 16:03:52 +10:00
teor e7437cc551 feature: Get endpoint addresses from config 2020-07-29 16:03:52 +10:00
teor 11090dbf91 feature: Separate Mainnet and Testnet state 2020-07-29 01:45:19 -04:00
Alfredo Garcia 5b3c6e4c6c
Port bash checkpoint scripts to zebra-checkpoints single rust binary (#740)
* make zebra-checkpoints
* fix LOOKAHEAD_LIMIT scope
* add a default cli path
* change doc usage text
* add tracing
* move MAX_CHECKPOINT_HEIGHT_GAP to zebra-consensus
* do byte_reverse_hex in a map
2020-07-25 17:53:00 +10:00
Henry de Valence b59cfc49b7 sync: create requests sequentially to respect backpressure.
This seems like a better design on principle but also appears to give a much
nicer sawtooth pattern of queued blocks in the checkpointer and a much smoother
pattern of block requests.
2020-07-24 18:36:00 -04:00
teor 2acfcf3a90
Make the CheckpointVerifier handle partial restarts (#736)
Also put generic bounds on the BlockVerifier struct,
so we get better compilation errors.
2020-07-24 11:47:48 +10:00
teor 77a1fefa1e
Download genesis (#731)
* feature: Add more CheckpointVerifier tracing

* fix: Download the genesis block
2020-07-23 10:56:52 -07:00
Jane Lusby c1a1493159
use dirs crate for default location of state and config (#714)
* use dirs crate for default location of state and config
* panic if a path isn't specified for zebra-state
2020-07-23 21:12:20 +10:00
teor c95c825707 fix: Lookup the genesis hash based on the network 2020-07-23 03:46:24 -04:00
Henry de Valence 4a98b8fa0d Add basic metrics to the syncer. 2020-07-22 21:59:00 -07:00
Henry de Valence c2c2a28e8b Improve tracing output in chain verifier 2020-07-22 21:59:00 -07:00
Jane Lusby 7d4e717182
Add block locator request to state layer (#712)
* Add block locator request to state layer

* pass genesis in request

* Update zebrad/src/commands/start/sync.rs

* fix errors
2020-07-22 18:01:31 -07:00
Henry de Valence 49aa41544d sync: try to ignore spurious inv messages.
Closes #697.

per  https://github.com/ZcashFoundation/zebra/issues/697#issuecomment-662742971

The response to a getblocks message is an inv message with the hashes of the
following blocks. However, inv messages are also sent unsolicited to gossip new
blocks across the network. Normally, this wouldn't be a problem, because for
every other request we filter only for the messages that are relevant to us.
But because the response to a getblocks message is an inv, the network layer
doesn't (and can't) distinguish between the response inv and the unsolicited
inv.

But there is a mitigation we can do. In our sync algorithm we have two phases:
(1) "ObtainTips" to get a set of tips to chase down, (2) repeatedly call
"ExtendTips" to extend those as far as possible. The unsolicited inv messages
have length 1, but when extending tips we expect to get more than one hash. So
we could reject responses in ExtendTips that have length 1 in order to ignore
these messages. This way we automatically ignore gossip messages during initial
block sync (while we're extending a tip) but we don't ignore length-1 responses
while trying to obtain tips (while querying the network for new tips).
2020-07-22 17:55:52 -07:00
teor 9b97ebbd61 feature: Choose checkpoints based on the config 2020-07-23 10:26:25 +10:00
teor 3d721a96a5 feature: Add the state config to the config file 2020-07-23 10:26:25 +10:00