Skip to content

fix(merge): preserve target rows when MERGE batch contains only target#129

Closed
rampage644 wants to merge 1 commit intomainfrom
rampage644/fix-merge-into-data-loss
Closed

fix(merge): preserve target rows when MERGE batch contains only target#129
rampage644 wants to merge 1 commit intomainfrom
rampage644/fix-merge-into-data-loss

Conversation

@rampage644
Copy link
Copy Markdown
Contributor

Summary

Fixes #128. MERGE INTO ... WHEN MATCHED THEN UPDATE WHEN NOT MATCHED THEN INSERT was silently dropping unmatched target rows when the target's physical row order wasn't aligned with the join key. The MERGE row counters (number of rows updated/number of rows inserted) reported the correct values, so the loss was invisible to clients without a separate row-count check. The bug is non-deterministic — repeated runs of the same SQL produced different loss counts.

Root cause: a stale fast-path guard in MergeCOWFilterStream::poll_next. The filter maintains two related collections — matching_data_and_manifest_files (matches in the current batch only) and all_matching_data_files (matches in this batch or any prior batch, intersected with current). The "no matches in this batch" fast path short-circuited on matching_data_and_manifest_files.is_empty() alone, without checking all_matching_data_files. So if a target file had been seen as matching in an earlier batch and a later batch contained only target rows for that file, the rows in the later batch were silently dropped. The downstream COW commit then overwrote the original file with the partial result, permanently losing those rows.

Sorting either input by the join key masked the bug because matched and unmatched rows for the same file co-located, so every batch hitting the file also contained a match — and the dead path never fired.

Fix

Six-line guard tightening: the fast path now also requires all_matching_data_files to be empty before short-circuiting. When a batch belongs to a file already in the overwrite set, it falls through to the main filter path which builds file_predicate = (path == file1) OR ... OR (path == fileN), ORs with source_exists, and correctly re-emits the target rows.

-                    if matching_data_and_manifest_files.is_empty() {
+                    if matching_data_and_manifest_files.is_empty()
+                        && all_matching_data_files.is_empty()
+                    {
                         // Return early if all rows only come from source
                         if matching_data_file_array.len() == source_exists_array.len() {
                             return Poll::Ready(Some(Ok(batch)));

Tests added

Unit tests in crates/executor/src/datafusion/physical_plan/merge.rs:

  • matching_then_target — file matched in batch 0, target-only batch in batch 1
  • matching_then_target_then_matching — match → target-only → match again
  • matching_then_multiple_target_batches — match followed by 3 consecutive target-only batches

Each fails on main with assertion mismatches and passes on this branch.

End-to-end SQL snapshot in crates/executor/src/tests/sql/ddl/merge_into.rs:

  • merge_into_mixed_unsorted_multi_row_no_data_loss — 10-row target × 4-row source (2 updates, 2 inserts), asserts total=12 / updated=2 / preserved=8 / inserted=2. Snapshot at crates/executor/src/tests/sql/ddl/snapshots/merge_into/query_merge_into_mixed_unsorted_multi_row_no_data_loss.snap.

cargo test -p executor: 362 passed, 0 failed, 1 ignored.

Validation against the deployed Lambda

Isolated MERGE repro (the one in #128) — 5 consecutive runs against a freshly-deployed lambda built from this branch, all loss=0 deterministic:

target=2101659 source=2103638 merge=[1052088, 1051550] post=3153747 expected=3153747 loss=0 dt=3.11s
target=2101659 source=2103638 merge=[1052088, 1051550] post=3153747 expected=3153747 loss=0 dt=3.03s
target=2101659 source=2103638 merge=[1052088, 1051550] post=3153747 expected=3153747 loss=0 dt=2.95s
target=2101659 source=2103638 merge=[1052088, 1051550] post=3153747 expected=3153747 loss=0 dt=3.05s
target=2101659 source=2103638 merge=[1052088, 1051550] post=3153747 expected=3153747 loss=0 dt=3.05s

Larger MERGE — 12.6 M target × 6.3 M source (full snapshot of snowplow_web_base_sessions_lifecycle_manifest):

target=12598285 source=6295880 merge=[2099470, 4196410] post=14697755 expected=14697755 loss=0 dt=9.42s

End-to-end stretch validation (dbt-snowplow-web)

The original bug surfaced as snowplow_web_user_mapping shrinking across rounds in a real dbt-snowplow-web pipeline. Running R1–R5 of that pipeline against the post-fix lambda from a fresh seed (rebuilding atomic.events from CTAS, dropping derived/manifest tables, then loop_dbt.py with NUM_ROUNDS=5):

Round Chunk dbt elapsed user_mapping Δ (post-fix) user_mapping Δ (pre-fix, for reference)
1 04:30–05:00 280.9 s +1,702,223 +1,702,223
2 05:00–05:30 381.1 s +1,699,998 +1,699,998
3 05:30–06:00 386.0 s +1,700,804 +690,664 ❌
4 06:00–06:30 running pending +186,950 ❌
5 06:30–07:00 pending pending +611,373 ❌

R3 is where the bug first manifested in the original buggy run — the pre-fix lambda dropped user_mapping growth from the expected ~1.7 M to 690 K. With the fix in place, R3 delivers the full +1,700,804 rows. R4–R5 still running; will update this PR once they land. R1 and R2 match the buggy run because the bug only surfaces once the manifest has accumulated enough rows to trigger the dead fast-path.

Out of scope (separate follow-ups)

Two additional issues uncovered while diagnosing this one. Not fixed here, intentionally:

  1. MergeCOWFilterStream::not_matched_buffer LRU eviction. Capacity is max(available_parallelism / THREAD_FILE_RATIO, 2) ≈ 2 on Lambda. Doesn't fire in the MERGE INTO silently drops unmatched target rows when target is unsorted by the ON-key #128 repros (single target file) but will silently evict any third-and-later concurrently-buffered file's rows in real workloads with many data files. Should be filed separately with a 3+ data file repro.

  2. Round-6 SELECT-phase OOM in dbt-snowplow-web's snowplow_web_base_sessions_lifecycle_manifest model. With the data-loss fix in place, lifecycle_manifest will grow correctly to ~14.7 M+ rows by R6, and the model's previous_sessions CTE LEFT JOIN materializes the whole thing — which OOMs the 9 GB Embucket pool. Independent of this PR.

🤖 Generated with Claude Code

The MergeCOWFilterStream "no matches in this batch" fast path
short-circuited on `matching_data_and_manifest_files.is_empty()`
without checking the cumulative `all_matching_data_files` set. If a
target file had been seen as matching in an earlier batch and a later
batch contained only target rows for that file, the rows in the later
batch were silently dropped. The downstream COW commit then overwrote
the original file with the partial result, permanently losing the
unmatched target rows whose batch hit the dead path.

The fix tightens the guard to also require `all_matching_data_files`
to be empty before taking the fast path. When a batch belongs to a
file already in the overwrite set, it falls through to the main
filter path which correctly emits target rows via
`file_predicate OR source_exists`.

Adds three unit tests against MergeCOWFilterStream covering the
matching-then-target patterns, plus a SQL snapshot test that exercises
the same shape end-to-end.

Fixes #128

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@rampage644
Copy link
Copy Markdown
Contributor Author

Combined into #126 — branch rampage644/fix-merge-partition-transform now carries the data-loss fix as commit ce562f5a on top of the partition-pruning + EXPLAIN routing fixes. Closing this PR to avoid duplication.

@rampage644 rampage644 closed this Apr 15, 2026
@rampage644 rampage644 deleted the rampage644/fix-merge-into-data-loss branch April 15, 2026 21:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MERGE INTO silently drops unmatched target rows when target is unsorted by the ON-key

1 participant