* Decide if Zebra is at the chain tip
* Avoid division by zero
* Try increasing EVENT_TIMEOUT
* Increase MAX_TEST_EXECUTION
* Implement basic tests
* Resolve Clippy's erorrs
* change doc comments to normal
Co-authored-by: Alfredo Garcia <oxarbitrage@gmail.com>
* Create a `SyncStatus` helper type
Keeps track if the synchronizer is close to the chain tip or not.
* Refactor `ChainSync` ctor. to return `SyncStatus`
Change the constructor API so that it returns a higher level construct.
* Test if `SyncStatus` waits for the chain tip
Test if waiting for the chain tip to be reached correctly finishes when
the chain tip is reached. This is done by sending recent sync lengths to
the `SyncStatus` instance, and checking that every time a separate
`SyncStatus` instance determines it has reached the tip the original
instance wakes up.
* Add a temporary attribute to allow dead code
The code added isn't used yet, so we'll add a temporary waiver until
another PR is merged to use them.
* Minimal recent sync lengths implementation
Also includes metrics and logging, to make diagnosing bugs easier.
* Add logging to check what happens when Zebra reaches the chain tip
* Add tests for recent sync lengths
- initially empty
- pruned to correct length
- newest entries go first
* Drop a redundant `/` from a Cargo.lock URL
This seems to be a nightly or beta Rust change,
but hopefully stable just accepts it.
* Use metrics histograms to avoid overwriting values
* Add detailed syncer monitoring dashboard
* Increase the recent sync length to 4
This length makes it easier to distinguish between temporary and
sustained errors/syncs.
Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com>
* implement and test a rate limit in `request_genesis()`
* add `request_genesis_is_rate_limited` test to sync
* add ensure_timeouts constraint for GENESIS_TIMEOUT_RETRY
* Suppress expected warning logs in zebrad tests
Co-authored-by: teor <teor@riseup.net>
In our README, we tell users to ignore these errors, so we should also
disable the issue URL.
Also include the hash in the error. (We don't want the span active for
all messages, we just want the hash in the error.)
This provides useful and not too noisy output at INFO level. We do an
info-level message on every block commit instead of trying to do one
message every N blocks, because this is useful both for initial block
sync as well as continuous state updates on new blocks.
We should error if we notice that we're attempting to download the same
blocks multiple times, because that indicates that peers reported bad
information to us, or we got confused trying to interpret their
responses.
Using the cancel_handles, we can deduplicate requests. This is
important to do, because otherwise when we insert the second cancel
handle, we'd drop the first one, cancelling an existing task for no
reason.
The hedge middleware implements hedged requests, as described in _The
Tail At Scale_. The idea is that we auto-tune our retry logic according
to the actual network conditions, pre-emptively retrying requests that
exceed some latency percentile. This would hopefully solve the problem
where our timeouts are too long on mainnet and too slow on testnet.
Try to use the better cancellation logic to revert to previous sync
algorithm. As designed, the sync algorithm is supposed to proceed by
downloading state prospectively and handle errors by flushing the
pipeline and starting over. This hasn't worked well, because we didn't
previously cancel tasks properly. Now that we can, try to use something
in the spirit of the original sync algorithm.
This makes two changes relative to the existing download code:
1. It uses a oneshot to attempt to cancel the download task after it
has started;
2. It encapsulates the download creation and cancellation logic into a
Downloads struct.