fix(merge): preserve target rows when MERGE batch contains only target#129
Closed
rampage644 wants to merge 1 commit intomainfrom
Closed
fix(merge): preserve target rows when MERGE batch contains only target#129rampage644 wants to merge 1 commit intomainfrom
rampage644 wants to merge 1 commit intomainfrom
Conversation
The MergeCOWFilterStream "no matches in this batch" fast path short-circuited on `matching_data_and_manifest_files.is_empty()` without checking the cumulative `all_matching_data_files` set. If a target file had been seen as matching in an earlier batch and a later batch contained only target rows for that file, the rows in the later batch were silently dropped. The downstream COW commit then overwrote the original file with the partial result, permanently losing the unmatched target rows whose batch hit the dead path. The fix tightens the guard to also require `all_matching_data_files` to be empty before taking the fast path. When a batch belongs to a file already in the overwrite set, it falls through to the main filter path which correctly emits target rows via `file_predicate OR source_exists`. Adds three unit tests against MergeCOWFilterStream covering the matching-then-target patterns, plus a SQL snapshot test that exercises the same shape end-to-end. Fixes #128 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
Author
|
Combined into #126 — branch |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #128.
MERGE INTO ... WHEN MATCHED THEN UPDATE WHEN NOT MATCHED THEN INSERTwas silently dropping unmatched target rows when the target's physical row order wasn't aligned with the join key. The MERGE row counters (number of rows updated/number of rows inserted) reported the correct values, so the loss was invisible to clients without a separate row-count check. The bug is non-deterministic — repeated runs of the same SQL produced different loss counts.Root cause: a stale fast-path guard in
MergeCOWFilterStream::poll_next. The filter maintains two related collections —matching_data_and_manifest_files(matches in the current batch only) andall_matching_data_files(matches in this batch or any prior batch, intersected with current). The "no matches in this batch" fast path short-circuited onmatching_data_and_manifest_files.is_empty()alone, without checkingall_matching_data_files. So if a target file had been seen as matching in an earlier batch and a later batch contained only target rows for that file, the rows in the later batch were silently dropped. The downstream COW commit then overwrote the original file with the partial result, permanently losing those rows.Sorting either input by the join key masked the bug because matched and unmatched rows for the same file co-located, so every batch hitting the file also contained a match — and the dead path never fired.
Fix
Six-line guard tightening: the fast path now also requires
all_matching_data_filesto be empty before short-circuiting. When a batch belongs to a file already in the overwrite set, it falls through to the main filter path which buildsfile_predicate = (path == file1) OR ... OR (path == fileN), ORs withsource_exists, and correctly re-emits the target rows.Tests added
Unit tests in
crates/executor/src/datafusion/physical_plan/merge.rs:matching_then_target— file matched in batch 0, target-only batch in batch 1matching_then_target_then_matching— match → target-only → match againmatching_then_multiple_target_batches— match followed by 3 consecutive target-only batchesEach fails on
mainwith assertion mismatches and passes on this branch.End-to-end SQL snapshot in
crates/executor/src/tests/sql/ddl/merge_into.rs:merge_into_mixed_unsorted_multi_row_no_data_loss— 10-row target × 4-row source (2 updates, 2 inserts), assertstotal=12 / updated=2 / preserved=8 / inserted=2. Snapshot atcrates/executor/src/tests/sql/ddl/snapshots/merge_into/query_merge_into_mixed_unsorted_multi_row_no_data_loss.snap.cargo test -p executor: 362 passed, 0 failed, 1 ignored.Validation against the deployed Lambda
Isolated MERGE repro (the one in #128) — 5 consecutive runs against a freshly-deployed lambda built from this branch, all
loss=0deterministic:Larger MERGE — 12.6 M target × 6.3 M source (full snapshot of
snowplow_web_base_sessions_lifecycle_manifest):End-to-end stretch validation (dbt-snowplow-web)
The original bug surfaced as
snowplow_web_user_mappingshrinking across rounds in a real dbt-snowplow-web pipeline. Running R1–R5 of that pipeline against the post-fix lambda from a fresh seed (rebuildingatomic.eventsfrom CTAS, dropping derived/manifest tables, thenloop_dbt.pywithNUM_ROUNDS=5):user_mappingΔ (post-fix)user_mappingΔ (pre-fix, for reference)04:30–05:0005:00–05:3005:30–06:0006:00–06:3006:30–07:00R3 is where the bug first manifested in the original buggy run — the pre-fix lambda dropped
user_mappinggrowth from the expected ~1.7 M to 690 K. With the fix in place, R3 delivers the full +1,700,804 rows. R4–R5 still running; will update this PR once they land. R1 and R2 match the buggy run because the bug only surfaces once the manifest has accumulated enough rows to trigger the dead fast-path.Out of scope (separate follow-ups)
Two additional issues uncovered while diagnosing this one. Not fixed here, intentionally:
MergeCOWFilterStream::not_matched_bufferLRU eviction. Capacity ismax(available_parallelism / THREAD_FILE_RATIO, 2)≈ 2 on Lambda. Doesn't fire in the MERGE INTO silently drops unmatched target rows when target is unsorted by the ON-key #128 repros (single target file) but will silently evict any third-and-later concurrently-buffered file's rows in real workloads with many data files. Should be filed separately with a 3+ data file repro.Round-6 SELECT-phase OOM in
dbt-snowplow-web'ssnowplow_web_base_sessions_lifecycle_manifestmodel. With the data-loss fix in place,lifecycle_manifestwill grow correctly to ~14.7 M+ rows by R6, and the model'sprevious_sessionsCTE LEFT JOIN materializes the whole thing — which OOMs the 9 GB Embucket pool. Independent of this PR.🤖 Generated with Claude Code