PK optimizations #5314

blp · 2025-12-18T20:34:08Z

This PR implements two optimizations to indexed Z-set storage.

The first, which is implemented in the final commit, is mostly above the level of the storage file implementation. It is a change to the indexed Z set implementation. In this optimization, if a key K has fewer than a threshold number of (V,R) pairs (or perhaps less than a threshold number of bytes of (V,R) pairs), then we write those (V,R) pairs as auxdata in column 0. Then we avoid any reads to column 1 at all for visiting those keys and their values.

The only change needed to the storage file implementation for this is to allow a key in column 0 to have zero values in column 1 (currently this will panic), which is implemented in the second-to-last commit.

The second, implemented by the third commit, is at the level of the storage file implementation. There is some confusion here about what we really have in column 1. It is essentially a btree over 2-tuples (R,K), where R is the row number in column 1. Each row in column 0 includes an accompanying range of column-1 row numbers for its values, so R allows that range to be looked up. K is the key for column 1, which for an indexed Z-set is the V from each (K,V,R) tuple. Storing it in the b-tree allows efficient searching by value, which is something that the DBSP cursor supports.

However, when there's only one column-1 value for each column-0 key, putting the column-1 values in the b-tree don't help. That means that we can essentially reduce that b-tree's keys from 2-tuples (R,K) into just R.

(Note that R isn't stored on a per-row basis even now, we use a much more efficient representation, since it's just a count.)

Storage format compatibility

With this PR, Feldera can read storage written by earlier versions, but storage written after this PR cannot be read by earlier versions of Feldera.

blp · 2025-12-19T16:48:05Z

In fact I see that clippy agrees, it appended a commit that changes some pieces of code to use if-let chains.

blp · 2025-12-23T17:43:20Z

I pushed a revision that adds metrics for the optimizations that this PR introduces, so that one can tell whether it should be a performance benefit.

Signed-off-by: Ben Pfaff <blp@feldera.com>

Until now, in dbsp::storage::file::writer, the ColumnWriter, DataBlockBuilder, and IndexBlockBuilder haven't needed to be parameterized by their key or auxdata types. An upcoming commit will need that, so this commit prepares for it by parameterizing them. This shouldn't affect the storage format. Signed-off-by: Ben Pfaff <blp@feldera.com>

…ded. Columns 1 (and greater) in a layer file are essentially B-trees over tuples (R,K), where R is the row number and K is the key. Each row in column 0 includes as auxiliary data a range R0..R1 of rows in column 1 (the "row group"), so R in the B-tree allows that range to be looked up. Storing the minimum and maximum keys of child blocks in the index allows efficient lookup of keys in row groups that span multiple data blocks. However, when a row group starts at the beginning of a data block, there's little value in including that key in its index block, because it won't usually help us to avoid reading the data block. It only helps if we're searching for a key less than the minimum, which is rarely the case; more often, we're iterating through the keys, and usually we're doing nested iteration through the column-0 keys too, so that we'd typically have to read the data block for another one of those anyhow. Similarly, when a row group ends at the end of a data block, there's little value in including its last key in the index block. This can be a significant optimization when keys in column 1 are large. In layer files where each row in column 0 has exactly one row in column 1, this optimization means that the column-0 index will have no bounds at all. Other layer files will benefit less. Signed-off-by: Ben Pfaff <blp@feldera.com>

Until now, any row in column 0 had to have at least one row in column 1. This commit allows there to instead be none. Signed-off-by: Ben Pfaff <blp@feldera.com>

Indexed Z-sets are commonly used for updating indexes on primary keys. For this purpose, each key usually has one value (for insertions and deletions) or two values (for updates). Writing this small number of values to a second column maintained separately in the layer file causes I/O to additional data blocks. This commit introduces a new way to store these indexed Z-sets. With this commit, when a key has just one or two values, those values are written inline with the key, which reduces the number of data blocks that need to be fetched to read a key along with its values. This should improve performance. This commit does drop two performance-oriented features from FileIndexedZSet: push_cursor() and fetch(). I don't think that either of these have really been used without explicit enabling via DevTweaks. If they prove important, then we can figure out how to re-implement them with the new format. Signed-off-by: Ben Pfaff <blp@feldera.com>

Add circuit profile metadata for inlined values and omitted bounds. Signed-off-by: Ben Pfaff <blp@feldera.com>

blp · 2025-12-24T19:33:39Z

I pushed a revision that adds the metrics to the circuit profile instead. Probably a better fit there (thanks @ryzhyk for the suggestion).

Signed-off-by: feldera-bot <feldera-bot@feldera.com>

blp requested review from gz and ryzhyk December 18, 2025 20:34

blp self-assigned this Dec 18, 2025

Copilot AI review requested due to automatic review settings December 18, 2025 20:34

blp added DBSP core Related to the core DBSP library performance storage Persistence for internal state in DBSP operators rust Pull requests that update Rust code labels Dec 18, 2025

This comment was marked as resolved.

Sign in to view

blp force-pushed the pk-optimizations branch from d63901f to b75273c Compare December 18, 2025 21:07

blp temporarily deployed to ci December 18, 2025 21:38 — with GitHub Actions Inactive

blp temporarily deployed to ci December 18, 2025 23:26 — with GitHub Actions Inactive

blp force-pushed the pk-optimizations branch from b75273c to cb780ff Compare December 23, 2025 17:42

blp temporarily deployed to ci December 23, 2025 20:07 — with GitHub Actions Inactive

blp temporarily deployed to ci December 23, 2025 23:14 — with GitHub Actions Inactive

blp force-pushed the pk-optimizations branch from cb780ff to 968824b Compare December 24, 2025 19:32

blp added 6 commits December 24, 2025 11:32

[dbsp] Drop unused opt_key_factory from IndexedWSetBatchFactories.

665d40a

Signed-off-by: Ben Pfaff <blp@feldera.com>

[dbsp] Allow empty row groups in layer files.

f00cdaf

Until now, any row in column 0 had to have at least one row in column 1. This commit allows there to instead be none. Signed-off-by: Ben Pfaff <blp@feldera.com>

[dbsp] Add circuit profile metadata.

3ed076d

Add circuit profile metadata for inlined values and omitted bounds. Signed-off-by: Ben Pfaff <blp@feldera.com>

blp force-pushed the pk-optimizations branch from 968824b to 3ed076d Compare December 24, 2025 19:32

[ci] apply automatic fixes

91d53cb

Signed-off-by: feldera-bot <feldera-bot@feldera.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PK optimizations #5314

PK optimizations #5314

Uh oh!

blp commented Dec 18, 2025

Uh oh!

This comment was marked as resolved.

Uh oh!

blp commented Dec 19, 2025

Uh oh!

blp commented Dec 23, 2025

Uh oh!

blp commented Dec 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

PK optimizations #5314

Are you sure you want to change the base?

PK optimizations #5314

Uh oh!

Conversation

blp commented Dec 18, 2025

Storage format compatibility

Uh oh!

This comment was marked as resolved.

Uh oh!

blp commented Dec 19, 2025

Uh oh!

blp commented Dec 23, 2025

Uh oh!

blp commented Dec 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants