Skip to content

Conversation

@blp
Copy link
Member

@blp blp commented Dec 18, 2025

This PR implements two optimizations to indexed Z-set storage.

The first, which is implemented in the final commit, is mostly above the level of the storage file implementation. It is a change to the indexed Z set implementation. In this optimization, if a key K has fewer than a threshold number of (V,R) pairs (or perhaps less than a threshold number of bytes of (V,R) pairs), then we write those (V,R) pairs as auxdata in column 0. Then we avoid any reads to column 1 at all for visiting those keys and their values.

The only change needed to the storage file implementation for this is to allow a key in column 0 to have zero values in column 1 (currently this will panic), which is implemented in the second-to-last commit.

The second, implemented by the third commit, is at the level of the storage file implementation. There is some confusion here about what we really have in column 1. It is essentially a btree over 2-tuples (R,K), where R is the row number in column 1. Each row in column 0 includes an accompanying range of column-1 row numbers for its values, so R allows that range to be looked up. K is the key for column 1, which for an indexed Z-set is the V from each (K,V,R) tuple. Storing it in the b-tree allows efficient searching by value, which is something that the DBSP cursor supports.

However, when there's only one column-1 value for each column-0 key, putting the column-1 values in the b-tree don't help. That means that we can essentially reduce that b-tree's keys from 2-tuples (R,K) into just R.

(Note that R isn't stored on a per-row basis even now, we use a much more efficient representation, since it's just a count.)

Storage format compatibility

With this PR, Feldera can read storage written by earlier versions, but storage written after this PR cannot be read by earlier versions of Feldera.

@blp blp requested review from gz and ryzhyk December 18, 2025 20:34
@blp blp self-assigned this Dec 18, 2025
Copilot AI review requested due to automatic review settings December 18, 2025 20:34
@blp blp added DBSP core Related to the core DBSP library performance storage Persistence for internal state in DBSP operators rust Pull requests that update Rust code labels Dec 18, 2025

This comment was marked as resolved.

@blp
Copy link
Member Author

blp commented Dec 19, 2025

In fact I see that clippy agrees, it appended a commit that changes some pieces of code to use if-let chains.

@blp blp force-pushed the pk-optimizations branch from b75273c to cb780ff Compare December 23, 2025 17:42
@blp
Copy link
Member Author

blp commented Dec 23, 2025

I pushed a revision that adds metrics for the optimizations that this PR introduces, so that one can tell whether it should be a performance benefit.

blp added 6 commits December 24, 2025 11:32
Until now, in dbsp::storage::file::writer, the ColumnWriter,
DataBlockBuilder, and IndexBlockBuilder haven't needed to be parameterized
by their key or auxdata types.  An upcoming commit will need that, so this
commit prepares for it by parameterizing them.

This shouldn't affect the storage format.

Signed-off-by: Ben Pfaff <blp@feldera.com>
…ded.

Columns 1 (and greater) in a layer file are essentially B-trees over tuples
(R,K), where R is the row number and K is the key.  Each row in column 0
includes as auxiliary data a range R0..R1 of rows in column 1 (the "row
group"), so R in the B-tree allows that range to be looked up.  Storing
the minimum and maximum keys of child blocks in the index allows efficient
lookup of keys in row groups that span multiple data blocks.

However, when a row group starts at the beginning of a data block, there's
little value in including that key in its index block, because it won't
usually help us to avoid reading the data block.  It only helps if we're
searching for a key less than the minimum, which is rarely the case; more
often, we're iterating through the keys, and usually we're doing nested
iteration through the column-0 keys too, so that we'd typically have to
read the data block for another one of those anyhow.

Similarly, when a row group ends at the end of a data block, there's little
value in including its last key in the index block.

This can be a significant optimization when keys in column 1 are large.

In layer files where each row in column 0 has exactly one row in column 1,
this optimization means that the column-0 index will have no bounds at all.
Other layer files will benefit less.

Signed-off-by: Ben Pfaff <blp@feldera.com>
Until now, any row in column 0 had to have at least one row in column 1.
This commit allows there to instead be none.

Signed-off-by: Ben Pfaff <blp@feldera.com>
Indexed Z-sets are commonly used for updating indexes on primary keys.
For this purpose, each key usually has one value (for insertions and
deletions) or two values (for updates).  Writing this small number of
values to a second column maintained separately in the layer file causes
I/O to additional data blocks.

This commit introduces a new way to store these indexed Z-sets.  With
this commit, when a key has just one or two values, those values are
written inline with the key, which reduces the number of data blocks
that need to be fetched to read a key along with its values.  This
should improve performance.

This commit does drop two performance-oriented features from
FileIndexedZSet: push_cursor() and fetch().  I don't think that either
of these have really been used without explicit enabling via DevTweaks.
If they prove important, then we can figure out how to re-implement
them with the new format.

Signed-off-by: Ben Pfaff <blp@feldera.com>
Add circuit profile metadata for inlined values and omitted bounds.

Signed-off-by: Ben Pfaff <blp@feldera.com>
@blp blp force-pushed the pk-optimizations branch from 968824b to 3ed076d Compare December 24, 2025 19:32
@blp
Copy link
Member Author

blp commented Dec 24, 2025

I pushed a revision that adds the metrics to the circuit profile instead. Probably a better fit there (thanks @ryzhyk for the suggestion).

Signed-off-by: feldera-bot <feldera-bot@feldera.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

DBSP core Related to the core DBSP library performance rust Pull requests that update Rust code storage Persistence for internal state in DBSP operators

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants