Memory experiment #2314

blp · 2024-08-23T17:11:01Z

blp
Aug 23, 2024
Collaborator

High memory use

Large input batches cause high memory use because:

GC does not work as well because it only takes effect in the next
step. That is, if step 1 raises the waterline to a given level,
data below that level can only be deleted when we get to step 2.
If we divide the same amount of input into 10x as many steps, then
there is much more opportunity for the merger to GC away data in
the process.
We only merge batches of similar size, so if there's one very large
batch we have to wait for another one before doing GC on it.
Merges of very large batches take longer and merging batches of size
A and B takes temporary extra space of up to A+B.
Large input batches take more memory than small ones, both in terms
of the raw transport and the parsed data fed to the circuit.
Large input batches use more memory for intermediate processing
through the pipeline.
Processing a large input batch takes longer than processing a small
one, which gives more time for a large number of records to build up
in the buffers, which in turn causes the next input batch to be a
big one. This also makes it more likely that we'll need to pause
the transport (which Kafka handles badly so that we prefer to just
wait and let even more data build up in the buffers in the hope that
we can clear the backlog).

Demonstration

Consider our customer demo. On my system:

With 16 workers, it peaks with about 6 GiB RAM.
With 8 workers, it peaks with about 35 GiB RAM.

The main difference is that 16 workers can process events as fast as
they arrive, which keeps input batches small. With 8 workers, they
pile up as described above.

We can visualize the behavior of each one. The following graph shows
the behavior with 16 workers. The buffered records stay near-zero the
whole time, the input records increases linearly and the processed
records follow closely behind. Memory initially increases
superlinearly but merging catches up and reduces it (one can see two
distinct inflection points):

The graph for 8 workers is very different. Input records arrive at
the same linear rate, but since processing cannot keep up, buffered
records increase during each step to a new higher peak (it never
reaches the default max_buffered_records of 1,000,000 only because
this data set has about 175,000 records). Thus, steps get bigger and
bigger and slower and slower. Memory also increases steadily, partly
because the merger can only increase memory use during a step, and the
few steps mean that there are few opportunities. The final step is
smaller because input is exhausted, and by then some memory can be
released by the merger as well.

Analysis

We can separate the memory use into different effects:

Eliminate effects from Kafka buffering, by using the file connector
instead.
Separate memory used for intermediate processing from memory for
buffering input, by running them at mutually exclusive times:
- Pausing the connectors when a step is running. This requires a
  patch.
- Start a step only when the buffers have accumulated a specific
  number of records. For example, with min_batch_size_records of
  50,000 and max_buffering_delay_usecs of 10,000,000, the pipeline
  will wait up to 10 seconds to buffer 50,000 records, and since it
  takes less than 10 seconds to do that, each step will have
  (approximately) 50,000 records.
Eliminate effects from background merging, by merging eagerly and
completely into a single batch whenever a new batch is inserted into
a spine. Similarly, whenever the spine's filters change (which are
how GC is implemented), the merger immediately frees all the records
that are no longer needed. This requires a patch to add an "eager
merger".
Show the GC cost of large batches by making the final batch read
from the file very small, so that memory use drops when GC frees
everything up to the near-final waterline. This requires a patch,
too.
Run without storage, so that memory allocated for the storage cache
does not muddy the problem.

If we run the same demo with these changes, we get the following
behavior (graphed below):

Buffered records build up in each step until 50,000 accumulate,
except for the second-to-last step where there are only about
25,000, and the last step (not visible due to resolution) where
there is only 1.
Input records increase at the same rate as buffered records (but
without the drops).
While records build up in the buffers, parsing takes place but no
other processing. Memory increases steadily but relatively slowly
in exact lockstep.
When 50,000 records accumulate or the 10-second timer expires (for
the last two steps), buffering pauses and a step begins. There are
only 5 steps:
1. During the first step (6-13 seconds), memory use shoots up very
  quickly, dropping slightly at the end of the step.
2. The second step (21-29 seconds) also increases memory consumption
  from start to finish, peaking higher than it finishes.
3. The third step (37-44 seconds) peaks still higher but finishes
  just above the second step's final memory consumption, which
  suggests that the steady-state memory consumption for
  50,000-record steps in this demo is about 10 GB.
4. The fourth step (55-60 seconds) is about half the size of the
  earlier ones. Memory consumption drops sharply because GC
  eliminated just as much data as in steps 2 and 3 and only half as
  much new data was added.
5. The fifth step (69-70 seconds) is a single record. Memory
  consumption drops sharply again, about as much as in the fourth
  step, because GC eliminated the data introduced in step 4 without
  adding a significant new amount.
The memory usage at the end of each step is the minimum possible
given the amount of GC allowed, since the merger eagerly merges and
discards all GCable data. At the end of the first step, no GC at
all is allowed.

mihaibudiu · 2024-08-23T17:13:17Z

mihaibudiu
Aug 23, 2024
Maintainer

One should not feed more data in a step than the LATENESS bound. Clearly, this changes the semantics of the program.

11 replies

mihaibudiu Aug 23, 2024
Maintainer

yes, I think we are "safe" in the sense that we will never throw away a record that should be processed, but not "sound", we may always process more records. But if we process way more records that's not good either, both for correctness and performance.

blp Aug 23, 2024
Collaborator Author

Can we ignore records that are late even if they happen to be present?

mihaibudiu Aug 23, 2024
Maintainer

the records in an input batch are sorted by key and processed in the sorted order, not in the arrival order.
maybe if we processed them in the arrival order we could have a much more eager form of lateness

blp Aug 23, 2024
Collaborator Author

I guess we could sort them in arrival order.

mihaibudiu Aug 23, 2024
Maintainer

right now there is no metadata with arrival timestamps.
moreover that sorting should be "append"

blp · 2024-08-23T17:25:51Z

blp
Aug 23, 2024
Collaborator Author

I think that the basic conclusion from my findings is that we should keep steps small.

2 replies

blp Aug 23, 2024
Collaborator Author

One way would be to limit the buffer size. We already do limit buffer size and allow the user to control it. The default limit is 1,000,000 records, which is effectively unlimited when individual records are large or expensive to process (the records average about 13.5 kB for this demo).

Another way would be to limit admissions from the buffers to the steps, so that even if we've buffered a given number of records we don't necessarily process all of them in the next step. This would give us another layer of control.

Either way, it would make sense to use some kind of dynamic feedback to control the processing rate. Suppose we add a target step duration. It could default to, for example, 1 second. Then, if a step takes longer than 1 second, we reduce the target buffer size (or admission size); if it takes less than one second, we increase it.

blp Aug 23, 2024
Collaborator Author

I think that the problem with limiting the buffer size is going to be Kafka, which doesn't like to be paused. Maybe that's why having an admission control layer is a good idea: instead of pausing Kafka when we fill the buffer, we only need to pause it when we get N1*A records for some N1 where A is the number we admit per step, and then resume it when the number drops to N2*A records for some N2 < N1. Then we can absorb the slowness of the resume.

mihaibudiu · 2024-08-23T17:34:23Z

mihaibudiu
Aug 23, 2024
Maintainer

I also think that there's a difference between backfill and real-time operation. Perhaps the connectors should understand this difference.
In real-time operation the expectation is that the timeout will trigger the processing. (Maybe the timeout should be the LATENESS period, at least if it's expressed in a time unit.)

In backfill the buffers full will mostly trigger the processing.

1 reply

blp Aug 23, 2024
Collaborator Author

I guess that this and the sorting cost that you mentioned elsethread could both be fixed by feeding in smaller batches.

mihaibudiu · 2024-08-23T17:36:03Z

mihaibudiu
Aug 23, 2024
Maintainer

Moreover, I think you also made the point at some other time that input batches need to be sorted. And since sorting is O(n log n), the bigger the batches, the more time spent for sorting. This sorting is often useless, if the data will be indexed again in the pipeline.

Perhaps we can make DBSP specify that some batches do not need to be sorted?

0 replies

mihaibudiu · 2024-08-23T17:40:11Z

mihaibudiu
Aug 23, 2024
Maintainer

This will make a great blog post

0 replies

mihaibudiu · 2024-08-23T17:40:36Z

mihaibudiu
Aug 23, 2024
Maintainer

And yes, dynamic control + maybe a bit of compiler support will be the way to go

0 replies

lalithsuresh · 2024-08-23T17:59:06Z

lalithsuresh
Aug 23, 2024
Maintainer

The main difference is that 16 workers can process events as fast as they arrive, which keeps input batches small. With 8 workers, they pile up as described above.

The pipeline here is in a state of overload, with the arrival rate being higher than the service rate, which means something somewhere has to either a) queue the requests (which means e2e latencies grow unbounded) or b) drop requests (to keep e2e latencies bounded).

You can keep step sizes low to bound the per-step latency indeed. That helps with the memory usage around GC/merges etc. That sounds like the correct way to gracefully degrade. But you cannot avoid the buffering before it (either growing sizes for "buffered" or growing lags for data sources like Kafka).

1 reply

blp Aug 23, 2024
Collaborator Author

Right. We already have the needed mechanism here: we can limit the number of buffered records. When we reach the limit, we pause the connector. If we set the max buffer size to, say, 1000, the memory problem goes away. It is a fix.

The problem then is a Kafka problem: when you resume the connector because you took that batch of about 1000 records, it takes up to about 1 second (a full second!) to actually deliver the next record. So you have transformed a memory problem into a time problem.

Of course, it is going to be inefficient anyway to resume the connector when you take the batch, because it will take 1 RTT to start getting new records. Unless there is some additional buffering, of course; the Kafka library doesn't have that built in, and even if it did we can't rely on it. So we need to buffer additional records that we don't deliver to the step. Which is effectively what I described in #2314 (reply in thread)

LegNeato · 2024-09-14T00:14:04Z

LegNeato
Sep 14, 2024

(outside observer here)

Sounds like there needs to be a generic backpressure mechanism rather than hardcoded limits / buffers. See https://lucumr.pocoo.org/2020/1/1/async-pressure/ and various posts / papers on the subject.

0 replies

mihaibudiu · 2024-09-14T00:18:42Z

mihaibudiu
Sep 14, 2024
Maintainer

There are two kinds of streaming sources: hot and cold. The hot ones do not support backpressure
https://luukgruijs.medium.com/understanding-hot-vs-cold-observables-62d04cf92e03

Internally we need backpressure, indeed.
But if you cannot keep up with a hot source, the only solution is to throw more resources at it.

There's also the question of jitter, the buffers are useful to average throughput of a jittery source.

0 replies

mihaibudiu · 2024-09-14T00:21:14Z

mihaibudiu
Sep 14, 2024
Maintainer

In this case we are dealing with a cold source, so we should be able to do better.
Another problem is the allocation of resources between all the competing operators in the circuit.

0 replies

mihaibudiu · 2024-09-14T00:30:49Z

mihaibudiu
Sep 14, 2024
Maintainer

As a proof, I can reliably crash the Nexmark Q9 pipeline on my laptop with 16 workers and an input batch size of 100K.
Seems to OOM at around 11GB.
However, if I reduce the number of workers to 8, the pipeline oscillates in throughput, but completes successfully.
The max memory consumption is also lower.
This is with storage enabled.

0 replies

Memory experiment #2314

Uh oh!

blp Aug 23, 2024 Collaborator

High memory use

Demonstration

Analysis

Replies: 11 comments · 15 replies

Uh oh!

mihaibudiu Aug 23, 2024 Maintainer

Uh oh!

mihaibudiu Aug 23, 2024 Maintainer

Uh oh!

blp Aug 23, 2024 Collaborator Author

Uh oh!

mihaibudiu Aug 23, 2024 Maintainer

Uh oh!

blp Aug 23, 2024 Collaborator Author

Uh oh!

mihaibudiu Aug 23, 2024 Maintainer

Uh oh!

blp Aug 23, 2024 Collaborator Author

Uh oh!

blp Aug 23, 2024 Collaborator Author

Uh oh!

blp Aug 23, 2024 Collaborator Author

Uh oh!

mihaibudiu Aug 23, 2024 Maintainer

Uh oh!

blp Aug 23, 2024 Collaborator Author

Uh oh!

mihaibudiu Aug 23, 2024 Maintainer

Uh oh!

mihaibudiu Aug 23, 2024 Maintainer

Uh oh!

mihaibudiu Aug 23, 2024 Maintainer

Uh oh!

lalithsuresh Aug 23, 2024 Maintainer

Uh oh!

blp Aug 23, 2024 Collaborator Author

Uh oh!

LegNeato Sep 14, 2024

Uh oh!

mihaibudiu Sep 14, 2024 Maintainer

Uh oh!

mihaibudiu Sep 14, 2024 Maintainer

Uh oh!

mihaibudiu Sep 14, 2024 Maintainer

blp
Aug 23, 2024
Collaborator

Replies: 11 comments 15 replies

mihaibudiu
Aug 23, 2024
Maintainer

mihaibudiu Aug 23, 2024
Maintainer

blp Aug 23, 2024
Collaborator Author

mihaibudiu Aug 23, 2024
Maintainer

blp Aug 23, 2024
Collaborator Author

mihaibudiu Aug 23, 2024
Maintainer

blp
Aug 23, 2024
Collaborator Author

blp Aug 23, 2024
Collaborator Author

blp Aug 23, 2024
Collaborator Author

mihaibudiu
Aug 23, 2024
Maintainer

blp Aug 23, 2024
Collaborator Author

mihaibudiu
Aug 23, 2024
Maintainer

mihaibudiu
Aug 23, 2024
Maintainer

mihaibudiu
Aug 23, 2024
Maintainer

lalithsuresh
Aug 23, 2024
Maintainer

blp Aug 23, 2024
Collaborator Author

LegNeato
Sep 14, 2024

mihaibudiu
Sep 14, 2024
Maintainer

mihaibudiu
Sep 14, 2024
Maintainer

mihaibudiu
Sep 14, 2024
Maintainer