Skip to content

GH-3466 Improve RunLengthBitPackingHybridDecoder.readNext to avoid per-call buffer allocation and DataInputStream wrapping#3467

Open
arouel wants to merge 2 commits intoapache:masterfrom
arouel:rle-buffer-reuse
Open

GH-3466 Improve RunLengthBitPackingHybridDecoder.readNext to avoid per-call buffer allocation and DataInputStream wrapping#3467
arouel wants to merge 2 commits intoapache:masterfrom
arouel:rle-buffer-reuse

Conversation

@arouel
Copy link
Copy Markdown

@arouel arouel commented Apr 6, 2026

Rationale for this change

RunLengthBitPackingHybridDecoder.readNext() allocates a new int[] and byte[] on every PACKED-mode call. The existing code even acknowledges this with a // TODO: reuse a buffer comment at line 94. In workloads that decode many bit-packed runs (definition levels, repetition levels, RLE-encoded integers), these allocations become a significant source of GC pressure.

There are two issues:

  1. Per-call buffer allocation. currentBuffer = new int[currentCount] and byte[] bytes = new byte[numGroups * bitWidth] are allocated fresh on every PACKED-mode readNext(). The individual allocations are modest (8–128 ints per run) but occur thousands of times per column chunk. In a custom JFR-profiled benchmark merging 60 Parquet files (180M rows, 10 columns), these two sites were the number 2 and number 6 allocation hotspots respectively: 2,402 out of 12,711 total allocation samples (18.9%), accounting for ~7.2 GB of allocation per operation.

  2. Per-call DataInputStream wrapping. new DataInputStream(in).readFully(bytes, 0, bytesToRead) creates a wrapper object on every PACKED-mode call just to access readFully().

What changes are included in this PR?

Three changes to RunLengthBitPackingHybridDecoder:

  • currentBuffer promoted from local to field, reused across readNext() calls with a grow-only strategy, only reallocated when the next run requires a larger buffer than currently held.
  • byte[] bytes promoted to a field packedBytes, same grow-only reuse strategy.
  • new DataInputStream(in).readFully(...) replaced with a private readFully() method that reads directly from the underlying InputStream, eliminating the per-call wrapper allocation and virtual dispatch.

Custom JFR-profiled benchmark results (merge of 60 Parquet files, 180M rows, 10 columns, JDK 26, macOS aarch64):

Metric Before After Change
Throughput 44.2 s/op 42.5 s/op -3.6%
Allocation rate 908.9 MB/s 793.9 MB/s -12.7%
Allocation per op 42.1 GB 35.4 GB -15.8%
RLE decoder alloc samples 2,402 (18.9%) 146 (1.2%) -93.9%
RLE decoder alloc bytes ~7,245 MB ~417 MB -94.2%

Are these changes tested?

Yes, changes are tested. We added a regression test in TestRunLengthBitPackingHybridEncoder called testTruncatedPackedRunAfterFullPackedRunDoesNotReuseStaleBytes to cover a truncated packed-run edge case after buffer reuse. We also ran the full TestRunLengthBitPackingHybridEncoder class and RunLengthBitPackingHybridIntegrationTest and all tests passed.

Are there any user-facing changes?

No. This is a transparent performance improvement internal to the RLE decoder. Decoded values are identical. No API changes, no configuration changes, no behavioral changes.

Closes #3466

@arouel arouel force-pushed the rle-buffer-reuse branch from 00bc776 to 75979e3 Compare April 6, 2026 20:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve RunLengthBitPackingHybridDecoder.readNext to avoid per-call buffer allocation and DataInputStream wrapping

1 participant