364 lines
20 KiB
Markdown
364 lines
20 KiB
Markdown
# Zebra Cached State Database Implementation
|
|
|
|
## Upgrading the State Database
|
|
|
|
For most state upgrades, we want to modify the database format of the existing database. If we
|
|
change the major database version, every user needs to re-download and re-verify all the blocks,
|
|
which can take days.
|
|
|
|
### In-Place Upgrade Goals
|
|
|
|
- avoid a full download and rebuild of the state
|
|
- the previous state format must be able to be loaded by the new state
|
|
- this is checked the first time CI runs on a PR with a new state version.
|
|
After the first CI run, the cached state is marked as upgraded, so the upgrade doesn't run
|
|
again. If CI fails on the first run, any cached states with that version should be deleted.
|
|
- previous zebra versions should be able to load the new format
|
|
- this is checked by other PRs running using the upgraded cached state, but only if a Rust PR
|
|
runs after the new PR's CI finishes, but before it merges
|
|
- best-effort loading of older supported states by newer Zebra versions
|
|
- best-effort compatibility between newer states and older supported Zebra versions
|
|
|
|
### Design Constraints
|
|
[design]: #design
|
|
|
|
Upgrades run concurrently with state verification and RPC requests.
|
|
|
|
This means that:
|
|
- the state must be able to read the old and new formats
|
|
- it can't panic if the data is missing
|
|
- it can't give incorrect results, because that can affect verification or wallets
|
|
- it can return an error
|
|
- it can only return an `Option` if the caller handles it correctly
|
|
- multiple upgrades must produce a valid state format
|
|
- if Zebra is restarted, the format upgrade will run multiple times
|
|
- if an older Zebra version opens the state, data can be written in an older format
|
|
- the format must be valid before and after each database transaction or API call, because an upgrade can be cancelled at any time
|
|
- multi-column family changes should made in database transactions
|
|
- if you are building new column family, disable state queries, then enable them once it's done
|
|
- if each database API call produces a valid format, transactions aren't needed
|
|
|
|
If there is an upgrade failure, it can panic and tell the user to delete their cached state and re-launch Zebra.
|
|
|
|
### Implementation Steps
|
|
|
|
- [ ] update the [database format](https://github.com/ZcashFoundation/zebra/blob/main/book/src/dev/state-db-upgrades.md#current) in the Zebra docs
|
|
- [ ] increment the state minor version
|
|
- [ ] write the new format in the block write task
|
|
- [ ] update older formats in the format upgrade task
|
|
- [ ] test that the new format works when creating a new state, and updating an older state
|
|
|
|
See the [upgrade design docs](https://github.com/ZcashFoundation/zebra/blob/main/book/src/dev/state-db-upgrades.md#design) for more details.
|
|
|
|
These steps can be copied into tickets.
|
|
|
|
## Current State Database Format
|
|
[current]: #current
|
|
|
|
rocksdb provides a persistent, thread-safe `BTreeMap<&[u8], &[u8]>`. Each map is
|
|
a distinct "tree". Keys are sorted using lexographic order (`[u8].sorted()`) on byte strings, so
|
|
integer values should be stored using big-endian encoding (so that the lex
|
|
order on byte strings is the numeric ordering).
|
|
|
|
Note that the lex order storage allows creating 1-to-many maps using keys only.
|
|
For example, the `tx_loc_by_transparent_addr_loc` allows mapping each address
|
|
to all transactions related to it, by simply storing each transaction prefixed
|
|
with the address as the key, leaving the value empty. Since rocksdb allows
|
|
listing all keys with a given prefix, it will allow listing all transactions
|
|
related to a given address.
|
|
|
|
We use the following rocksdb column families:
|
|
|
|
| Column Family | Keys | Values | Changes |
|
|
| ---------------------------------- | ---------------------- | ----------------------------- | ------- |
|
|
| *Blocks* | | | |
|
|
| `hash_by_height` | `block::Height` | `block::Hash` | Create |
|
|
| `height_by_hash` | `block::Hash` | `block::Height` | Create |
|
|
| `block_header_by_height` | `block::Height` | `block::Header` | Create |
|
|
| *Transactions* | | | |
|
|
| `tx_by_loc` | `TransactionLocation` | `Transaction` | Create |
|
|
| `hash_by_tx_loc` | `TransactionLocation` | `transaction::Hash` | Create |
|
|
| `tx_loc_by_hash` | `transaction::Hash` | `TransactionLocation` | Create |
|
|
| *Transparent* | | | |
|
|
| `balance_by_transparent_addr` | `transparent::Address` | `Amount \|\| AddressLocation` | Update |
|
|
| `tx_loc_by_transparent_addr_loc` | `AddressTransaction` | `()` | Create |
|
|
| `utxo_by_out_loc` | `OutputLocation` | `transparent::Output` | Delete |
|
|
| `utxo_loc_by_transparent_addr_loc` | `AddressUnspentOutput` | `()` | Delete |
|
|
| *Sprout* | | | |
|
|
| `sprout_nullifiers` | `sprout::Nullifier` | `()` | Create |
|
|
| `sprout_anchors` | `sprout::tree::Root` | `sprout::NoteCommitmentTree` | Create |
|
|
| `sprout_note_commitment_tree` | `block::Height` | `sprout::NoteCommitmentTree` | Delete |
|
|
| *Sapling* | | | |
|
|
| `sapling_nullifiers` | `sapling::Nullifier` | `()` | Create |
|
|
| `sapling_anchors` | `sapling::tree::Root` | `()` | Create |
|
|
| `sapling_note_commitment_tree` | `block::Height` | `sapling::NoteCommitmentTree` | Create |
|
|
| `sapling_note_commitment_subtree` | `block::Height` | `NoteCommitmentSubtreeData` | Create |
|
|
| *Orchard* | | | |
|
|
| `orchard_nullifiers` | `orchard::Nullifier` | `()` | Create |
|
|
| `orchard_anchors` | `orchard::tree::Root` | `()` | Create |
|
|
| `orchard_note_commitment_tree` | `block::Height` | `orchard::NoteCommitmentTree` | Create |
|
|
| `orchard_note_commitment_subtree` | `block::Height` | `NoteCommitmentSubtreeData` | Create |
|
|
| *Chain* | | | |
|
|
| `history_tree` | `block::Height` | `NonEmptyHistoryTree` | Delete |
|
|
| `tip_chain_value_pool` | `()` | `ValueBalance` | Update |
|
|
|
|
Zcash structures are encoded using `ZcashSerialize`/`ZcashDeserialize`.
|
|
Other structures are encoded using `IntoDisk`/`FromDisk`.
|
|
|
|
Block and Transaction Data:
|
|
- `Height`: 24 bits, big-endian, unsigned (allows for ~30 years worth of blocks)
|
|
- `TransactionIndex`: 16 bits, big-endian, unsigned (max ~23,000 transactions in the 2 MB block limit)
|
|
- `TransactionCount`: same as `TransactionIndex`
|
|
- `TransactionLocation`: `Height \|\| TransactionIndex`
|
|
- `OutputIndex`: 24 bits, big-endian, unsigned (max ~223,000 transfers in the 2 MB block limit)
|
|
- transparent and shielded input indexes, and shielded output indexes: 16 bits, big-endian, unsigned (max ~49,000 transfers in the 2 MB block limit)
|
|
- `OutputLocation`: `TransactionLocation \|\| OutputIndex`
|
|
- `AddressLocation`: the first `OutputLocation` used by a `transparent::Address`.
|
|
Always has the same value for each address, even if the first output is spent.
|
|
- `Utxo`: `Output`, derives extra fields from the `OutputLocation` key
|
|
- `AddressUnspentOutput`: `AddressLocation \|\| OutputLocation`,
|
|
used instead of a `BTreeSet<OutputLocation>` value, to improve database performance
|
|
- `AddressTransaction`: `AddressLocation \|\| TransactionLocation`
|
|
used instead of a `BTreeSet<TransactionLocation>` value, to improve database performance
|
|
- `NoteCommitmentSubtreeIndex`: 16 bits, big-endian, unsigned
|
|
- `NoteCommitmentSubtreeData<{sapling, orchard}::tree::Node>`: `Height \|\| {sapling, orchard}::tree::Node`
|
|
|
|
We use big-endian encoding for keys, to allow database index prefix searches.
|
|
|
|
Amounts:
|
|
- `Amount`: 64 bits, little-endian, signed
|
|
- `ValueBalance`: `[Amount; 4]`
|
|
|
|
Derived Formats:
|
|
- `*::NoteCommitmentTree`: `bincode` using `serde`
|
|
- `NonEmptyHistoryTree`: `bincode` using `serde`, using `zcash_history`'s `serde` implementation
|
|
|
|
|
|
The following figure helps visualizing the address index, which is the most complicated part.
|
|
Numbers in brackets are array sizes; bold arrows are compositions (i.e. `TransactionLocation` is the
|
|
concatenation of `Height` and `TransactionIndex`); dashed arrows are compositions that are also 1-to-many
|
|
maps (i.e. `AddressTransaction` is the concatenation of `AddressLocation` and `TransactionLocation`,
|
|
but also is used to map each `AddressLocation` to multiple `TransactionLocation`s).
|
|
|
|
```mermaid
|
|
graph TD;
|
|
Address -->|"balance_by_transparent_addr<br/>"| AddressBalance;
|
|
AddressBalance ==> Amount;
|
|
AddressBalance ==> AddressLocation;
|
|
AddressLocation ==> FirstOutputLocation;
|
|
AddressLocation -.->|"tx_loc_by_transparent_addr_loc<br/>(AddressTransaction[13])"| TransactionLocation;
|
|
TransactionLocation ==> Height;
|
|
TransactionLocation ==> TransactionIndex;
|
|
OutputLocation -->|utxo_by_out_loc| Output;
|
|
OutputLocation ==> TransactionLocation;
|
|
OutputLocation ==> OutputIndex;
|
|
AddressLocation -.->|"utxo_loc_by_transparent_addr_loc<br/>(AddressUnspentOutput[16])"| OutputLocation;
|
|
|
|
AddressBalance["AddressBalance[16]"];
|
|
Amount["Amount[8]"];
|
|
Height["Height[3]"];
|
|
Address["Address[21]"];
|
|
TransactionIndex["TransactionIndex[2]"];
|
|
TransactionLocation["TransactionLocation[5]"];
|
|
OutputIndex["OutputIndex[3]"];
|
|
OutputLocation["OutputLocation[8]"];
|
|
FirstOutputLocation["First OutputLocation[8]"];
|
|
AddressLocation["AddressLocation[8]"];
|
|
```
|
|
|
|
### Implementing consensus rules using rocksdb
|
|
[rocksdb-consensus-rules]: #rocksdb-consensus-rules
|
|
|
|
Each column family handles updates differently, based on its specific consensus rules:
|
|
- Create:
|
|
- Each key-value entry is created once.
|
|
- Keys are never deleted, values are never updated.
|
|
- Delete:
|
|
- Each key-value entry is created once.
|
|
- Keys can be deleted, but values are never updated.
|
|
- Code called by ReadStateService must ignore deleted keys, or use a read lock.
|
|
- TODO: should we prevent re-inserts of keys that have been deleted?
|
|
- Update:
|
|
- Each key-value entry is created once.
|
|
- Keys are never deleted, but values can be updated.
|
|
- Code called by ReadStateService must handle old or new values, or use a read lock.
|
|
|
|
We can't do some kinds of value updates, because they cause RocksDB performance issues:
|
|
- Append:
|
|
- Keys are never deleted.
|
|
- Existing values are never updated.
|
|
- Sets of values have additional items appended to the end of the set.
|
|
- Code called by ReadStateService must handle shorter or longer sets, or use a read lock.
|
|
- Up/Del:
|
|
- Keys can be deleted.
|
|
- Sets of values have items added or deleted (in any position).
|
|
- Code called by ReadStateService must ignore deleted keys and values,
|
|
accept shorter or longer sets, and accept old or new values.
|
|
Or it should use a read lock.
|
|
|
|
Avoid using large sets of values as RocksDB keys or values.
|
|
|
|
### RocksDB read locks
|
|
[rocksdb-read-locks]: #rocksdb-read-locks
|
|
|
|
The read-only ReadStateService needs to handle concurrent writes and deletes of the finalized
|
|
column families it reads. It must also handle overlaps between the cached non-finalized `Chain`,
|
|
and the current finalized state database.
|
|
|
|
The StateService uses RocksDB transactions for each block write.
|
|
So ReadStateService queries that only access a single key or value will always see
|
|
a consistent view of the database.
|
|
|
|
If a ReadStateService query only uses column families that have keys and values appended
|
|
(`Never` in the Updates table above), it should ignore extra appended values.
|
|
Most queries do this by default.
|
|
|
|
For more complex queries, there are several options:
|
|
|
|
Reading across multiple column families:
|
|
1. Ignore deleted values using custom Rust code
|
|
2. Take a database snapshot - https://docs.rs/rocksdb/latest/rocksdb/struct.DBWithThreadMode.html#method.snapshot
|
|
|
|
Reading a single column family:
|
|
3. multi_get - https://docs.rs/rocksdb/latest/rocksdb/struct.DBWithThreadMode.html#method.multi_get_cf
|
|
4. iterator - https://docs.rs/rocksdb/latest/rocksdb/struct.DBWithThreadMode.html#method.iterator_cf
|
|
|
|
RocksDB also has read transactions, but they don't seem to be exposed in the Rust crate.
|
|
|
|
### Low-Level Implementation Details
|
|
[rocksdb-low-level]: #rocksdb-low-level
|
|
|
|
RocksDB ignores duplicate puts and deletes, preserving the latest values.
|
|
If rejecting duplicate puts or deletes is consensus-critical,
|
|
check [`db.get_cf(cf, key)?`](https://docs.rs/rocksdb/0.16.0/rocksdb/struct.DBWithThreadMode.html#method.get_cf)
|
|
before putting or deleting any values in a batch.
|
|
|
|
Currently, these restrictions should be enforced by code review:
|
|
- multiple `zs_insert`s are only allowed on Update column families, and
|
|
- [`delete_cf`](https://docs.rs/rocksdb/0.16.0/rocksdb/struct.WriteBatch.html#method.delete_cf)
|
|
is only allowed on Delete column families.
|
|
|
|
In future, we could enforce these restrictions by:
|
|
- creating traits for Never, Delete, and Update
|
|
- doing different checks in `zs_insert` depending on the trait
|
|
- wrapping `delete_cf` in a trait, and only implementing that trait for types that use Delete column families.
|
|
|
|
As of June 2021, the Rust `rocksdb` crate [ignores the delete callback](https://docs.rs/rocksdb/0.16.0/src/rocksdb/merge_operator.rs.html#83-94),
|
|
and merge operators are unreliable (or have undocumented behaviour).
|
|
So they should not be used for consensus-critical checks.
|
|
|
|
### Notes on rocksdb column families
|
|
[rocksdb-column-families]: #rocksdb-column-families
|
|
|
|
- The `hash_by_height` and `height_by_hash` column families provide a bijection between
|
|
block heights and block hashes. (Since the rocksdb state only stores finalized
|
|
state, they are actually a bijection).
|
|
|
|
- Similarly, the `tx_loc_by_hash` and `hash_by_tx_loc` column families provide a bijection between
|
|
transaction locations and transaction hashes.
|
|
|
|
- The `block_header_by_height` column family provides a bijection between block
|
|
heights and block header data. There is no corresponding `height_by_block` column
|
|
family: instead, hash the block header, and use the hash from `height_by_hash`.
|
|
(Since the rocksdb state only stores finalized state, they are actually a bijection).
|
|
Similarly, there are no column families that go from transaction data
|
|
to transaction locations: hash the transaction and use `tx_loc_by_hash`.
|
|
|
|
- Block headers and transactions are stored separately in the database,
|
|
so that individual transactions can be accessed efficiently.
|
|
Blocks can be re-created on request using the following process:
|
|
- Look up `height` in `height_by_hash`
|
|
- Get the block header for `height` from `block_header_by_height`
|
|
- Iterate from `TransactionIndex` 0,
|
|
to get each transaction with `height` from `tx_by_loc`,
|
|
stopping when there are no more transactions in the block
|
|
|
|
- Block headers are stored by height, not by hash. This has the downside that looking
|
|
up a block by hash requires an extra level of indirection. The upside is
|
|
that blocks with adjacent heights are adjacent in the database, and many
|
|
common access patterns, such as helping a client sync the chain or doing
|
|
analysis, access blocks in (potentially sparse) height order. In addition,
|
|
the fact that we commit blocks in order means we're writing only to the end
|
|
of the rocksdb column family, which may help save space.
|
|
|
|
- Similarly, transaction data is stored in chain order in `tx_by_loc` and `utxo_by_out_loc`,
|
|
and chain order within each vector in `utxo_loc_by_transparent_addr_loc` and
|
|
`tx_loc_by_transparent_addr_loc`.
|
|
|
|
- `TransactionLocation`s are stored as a `(height, index)` pair referencing the
|
|
height of the transaction's parent block and the transaction's index in that
|
|
block. This would more traditionally be a `(hash, index)` pair, but because
|
|
we store blocks by height, storing the height saves one level of indirection.
|
|
Transaction hashes can be looked up using `hash_by_tx_loc`.
|
|
|
|
- Similarly, UTXOs are stored in `utxo_by_out_loc` by `OutputLocation`,
|
|
rather than `OutPoint`. `OutPoint`s can be looked up using `tx_loc_by_hash`,
|
|
and reconstructed using `hash_by_tx_loc`.
|
|
|
|
- The `Utxo` type can be constructed from the `OutputLocation` and `Output` data,
|
|
`height: OutputLocation.height`, and
|
|
`is_coinbase: OutputLocation.transaction_index == 0`
|
|
(coinbase transactions are always the first transaction in a block).
|
|
|
|
- `balance_by_transparent_addr` is the sum of all `utxo_loc_by_transparent_addr_loc`s
|
|
that are still in `utxo_by_out_loc`. It is cached to improve performance for
|
|
addresses with large UTXO sets. It also stores the `AddressLocation` for each
|
|
address, which allows for efficient lookups.
|
|
|
|
- `utxo_loc_by_transparent_addr_loc` stores unspent transparent output locations
|
|
by address. The address location and UTXO location are stored as a RocksDB key,
|
|
so they are in chain order, and get good database performance.
|
|
This column family includes also includes the original address location UTXO,
|
|
if it has not been spent.
|
|
|
|
- When a block write deletes a UTXO from `utxo_by_out_loc`,
|
|
that UTXO location should be deleted from `utxo_loc_by_transparent_addr_loc`.
|
|
The deleted UTXO can be removed efficiently, because the UTXO location is part of the key.
|
|
This is an index optimisation, which does not affect query results.
|
|
|
|
- `tx_loc_by_transparent_addr_loc` stores transaction locations by address.
|
|
This list includes transactions containing spent UTXOs.
|
|
The address location and transaction location are stored as a RocksDB key,
|
|
so they are in chain order, and get good database performance.
|
|
This column family also includes the `TransactionLocation`
|
|
of the transaction for the `AddressLocation`.
|
|
|
|
- The `sprout_note_commitment_tree` stores the note commitment tree state
|
|
at the tip of the finalized state, for the specific pool. There is always
|
|
a single entry. Each tree is stored
|
|
as a "Merkle tree frontier" which is basically a (logarithmic) subset of
|
|
the Merkle tree nodes as required to insert new items.
|
|
For each block committed, the old tree is deleted and a new one is inserted
|
|
by its new height.
|
|
**TODO:** store the sprout note commitment tree by `()`,
|
|
to avoid ReadStateService concurrent write issues.
|
|
|
|
- The `{sapling, orchard}_note_commitment_tree` stores the note commitment tree
|
|
state for every height, for the specific pool. Each tree is stored
|
|
as a "Merkle tree frontier" which is basically a (logarithmic) subset of
|
|
the Merkle tree nodes as required to insert new items.
|
|
|
|
- The `{sapling, orchard}_note_commitment_subtree` stores the completion height and
|
|
root for every completed level 16 note commitment subtree, for the specific pool.
|
|
|
|
- `history_tree` stores the ZIP-221 history tree state at the tip of the finalized
|
|
state. There is always a single entry for it. The tree is stored as the set of "peaks"
|
|
of the "Merkle mountain range" tree structure, which is what is required to
|
|
insert new items.
|
|
**TODO:** store the history tree by `()`, to avoid ReadStateService concurrent write issues.
|
|
|
|
- Each `*_anchors` stores the anchor (the root of a Merkle tree) of the note commitment
|
|
tree of a certain block. We only use the keys since we just need the set of anchors,
|
|
regardless of where they come from. The exception is `sprout_anchors` which also maps
|
|
the anchor to the matching note commitment tree. This is required to support interstitial
|
|
treestates, which are unique to Sprout.
|
|
**TODO:** store the `Root` hash in `sprout_note_commitment_tree`, and use it to look up the
|
|
note commitment tree. This de-duplicates tree state data. But we currently only store one sprout tree by height.
|
|
|
|
- The value pools are only stored for the finalized tip.
|
|
|
|
- We do not store the cumulative work for the finalized chain,
|
|
because the finalized work is equal for all non-finalized chains.
|
|
So the additional non-finalized work can be used to calculate the relative chain order,
|
|
and choose the best chain.
|