The old sync code relied on duplicate block fetches to make progress,
but the last few commits have removed some of those duplicates.
Instead, just retry the fetches that fail.
We have a counter for pending "download and verify" futures. But these
futures are spawned, so they can complete in any order. They can also
complete before we receive their results.
* Split tracing component code into modules.
* Repatriate Tracing and simplify config handling.
We upstreamed our Tracing component, expecting not to have to exert fine
control over the tracing settings. But this turned out not to be the case, and
now that we want to do other things (flamegraphs, journalctl, opentelemetry,
etc), we end up with really awkward code (as in the current flamegraph
handling).
This also makes use of the changes to `init()` to load the config early to pass
configuration data into the components, which avoids the need for the
refactoring in #775.
Finally, we restore support for the `-v` flag when the filter is unset. Closes#831.
* Disable tracing and metrics endpoints by default.
Closes#660.
* Switch back to upstream Abscissa.
* Integrate flamegraph support into the new Tracing component.
* Pass -v in acceptance tests to get info-level output.
* Clean up acceptance test code.
* Setup tracing-flame for use profiling zebrad
* start work on conditional flamegraph generation
* review time!
* update comments
* Update Cargo.toml
* disable default features for inferno
* reorganize
* missing one trait
* Apply suggestions from code review
* graceful shutdown!
* remove special case handling on ctrlc for cleanup
* rename signal fn to better represent its responsibility
* remove unused global hook for flushing flamegraph
* move tracing logic to the right file
* just copy linkerd's signal handling logic
* update book
* make zebrad app drop on shutdown normally
* Update zebrad/src/components/tokio.rs
Co-authored-by: teor <teor@riseup.net>
* Update zebrad/src/application.rs
Co-authored-by: teor <teor@riseup.net>
* Apply suggestions from code review
Co-authored-by: teor <teor@riseup.net>
* cleanup a little
* ooh yea there's an API for that
* setup env-filter for backup subscriber
* document env filter
* document return codes
* forgot to save
* Update book/src/applications/zebrad.md
Co-authored-by: teor <teor@riseup.net>
Co-authored-by: teor <teor@riseup.net>
* Load tracing filter only from config and simplify logic.
* Configure the state storage in the config, not an environment variable.
This also changes the config so that the path is always set rather than being
optional, because Zebra always needs a place to store its config.
* add zebrad acceptance tests
* add custom command test helpers that work with kill
* add and use info event for start and seed commands
* combine conflicting tests into one test case
Co-authored-by: Jane Lusby <jane@zfnd.org>
We get the injected TokioComponent dependency before the config is
loaded, so we can't use it to open the endpoints.
And we can't define after_config, because we use derive(Component).
So we work around these issues by opening the endpoints manually,
from the application's after_config.
* make zebra-checkpoints
* fix LOOKAHEAD_LIMIT scope
* add a default cli path
* change doc usage text
* add tracing
* move MAX_CHECKPOINT_HEIGHT_GAP to zebra-consensus
* do byte_reverse_hex in a map
This seems like a better design on principle but also appears to give a much
nicer sawtooth pattern of queued blocks in the checkpointer and a much smoother
pattern of block requests.
We had a brief discussion on discord and it seemed like we had consensus on the
following versioning policy:
* zebrad: match major version to NU version, so we will start by releasing
zebrad 3.0.0;
* zebra-* libraries: start by matching zebrad's version, then increment major
versions of each library as we need to make breaking changes (potentially
faster than the zebrad version, always respecting semver but making no
guarantees about the longevity of major releases).
This commit sets all of the crate versions to 3.0.0-alpha.0 -- the -alpha.0
marks it as a prerelease not subject to perfect adherence to compatibility
guarantees.
Closes#697.
per https://github.com/ZcashFoundation/zebra/issues/697#issuecomment-662742971
The response to a getblocks message is an inv message with the hashes of the
following blocks. However, inv messages are also sent unsolicited to gossip new
blocks across the network. Normally, this wouldn't be a problem, because for
every other request we filter only for the messages that are relevant to us.
But because the response to a getblocks message is an inv, the network layer
doesn't (and can't) distinguish between the response inv and the unsolicited
inv.
But there is a mitigation we can do. In our sync algorithm we have two phases:
(1) "ObtainTips" to get a set of tips to chase down, (2) repeatedly call
"ExtendTips" to extend those as far as possible. The unsolicited inv messages
have length 1, but when extending tips we expect to get more than one hash. So
we could reject responses in ExtendTips that have length 1 in order to ignore
these messages. This way we automatically ignore gossip messages during initial
block sync (while we're extending a tip) but we don't ignore length-1 responses
while trying to obtain tips (while querying the network for new tips).
Closes#617.
Closes#698.
The remaining work on the syncer is alluded to in a new comment:
1. Correctly constructing a block locator object
2. Detecting when we've stopped making progress syncing and restarting obtain_tips.
* fix: Resist CheckpointVerifier memory DoS attacks
Allow a maximum of 2 queued blocks at each height, as a tradeoff between
efficient bad block rejection, and memory usage.
Closes#628.
* fix: Make max queued blocks at height equal to fanout
* fix: Just allocate all the capacity upfront
* fix: Use with_capacity(1) and reserve_exact(1)
* Log at warn level for commands that use stdout
* Let zebrad revhex read from stdin
Most unix tools support reading from stdin, so they can be used in
pipelines.
Part of #564.
* Flatten consensus::verify::* to consensus::*
* Move consensus::*::tests into their own files
* Move CheckpointList into its own file
* Move Progress and Target into a types module
QueuedBlock and QueuedBlockList can stay in checkpoint.rs, because
they are tightly coupled to CheckpointVerifier.
* also download tips and filter tips
* dispatch all block downloads together
* tweek to match henry's changes
* switch to more intuitive match
Co-authored-by: Jane Lusby <jane@zfnd.org>
Each subsection has to have `serde(default)` to get the behaviour we want
(delete all fields except the ones that have been changed); otherwise, we can
delete only entire sections.
Prior to this change, the service returned by `zebra_network::init` would spawn background tasks that could silently fail, causing unexpected errors in the zebra_network service.
This change modifies the `PeerSet` that backs `zebra_network::init` to store all of the `JoinHandle`s for each background task it depends on. The `PeerSet` then checks this set of futures to see if any of them have exited with an error or a panic, and if they have it returns the error as part of `poll_ready`.
* rename zebra-storage to zebra-state
* Setup initial skeleton for zebra-state
* add test
* Apply suggestions from code review
Co-authored-by: Henry de Valence <hdevalence@hdevalence.ca>
* move shared test vectors to a common crate
Co-authored-by: Jane Lusby <jane@zfnd.org>
Co-authored-by: Henry de Valence <hdevalence@hdevalence.ca>
Co-authored-by: Jane Lusby <jane@zfnd.org>
Prior to this change, the seed subcommand would consistently encounter a panic in one of the background tasks, but would continue running after the panic. This is indicative of two bugs.
First, zebrad was not configured to treat panics as non recoverable and instead defaulted to the tokio defaults, which are to catch panics in tasks and return them via the join handle if available, or to print them if the join handle has been discarded. This is likely a poor fit for zebrad as an application, we do not need to maximize uptime or minimize the extent of an outage should one of our tasks / services start encountering panics. Ignoring a panic increases our risk of observing invalid state, causing all sorts of wild and bad bugs. To deal with this we've switched the default panic behavior from `unwind` to `abort`. This makes panics fail immediately and take down the entire application, regardless of where they occur, which is consistent with our treatment of misbehaving connections.
The second bug is the panic itself. This was triggered by a duplicate entry in the initial_peers set. To fix this we've switched the storage for the peers from a `Vec` to a `HashSet`, which has similar properties but guarantees uniqueness of its keys.