Skip to content

[iceberg-rust] identity-self-named partition columns are excluded from target-side filter pruning #127

@rampage644

Description

@rampage644

Filed here because issues are disabled on Embucket/iceberg-rust. The fix lives in that fork.

Summary

When datafusion_iceberg::table_scan receives a filter over a column that Iceberg partitions by identity(col) where pf.name() == pf.source_name(), it never uses that filter to prune data files. The filter reaches the TableScan.filters slot and is visible in the logical plan (partial_filters=[event_name = Utf8("ad_start_event")]), but the second-stage PruneDataFiles pruner is constructed with partition_schema — a subset that only contains the Hive-style partition columns (e.g. collector_tstamp_day) — so the column lookup for event_name in PruneDataFiles::min_values / max_values fails and no statistics are returned. Every partition file of the target is read in full.

Cause

At datafusion_iceberg/src/table.rs:544-558, identity-self-named partition fields are intentionally dropped from partition_column_names (as they are dropped from file_partition_fields upstream at :443-454 — the parquet reader would otherwise trip over them since the column is materialized in the file body). The intent, per the comment at :435-442, is that identity-partition pruning happens via PruneDataFiles instead of PruneManifests.

But at :601-605, PruneDataFiles::new is passed partition_schema (built from the reduced table_partition_cols) as its arrow_schema:

let pruning_predicate =
    PruningPredicate::try_new(physical_predicate, arrow_schema.clone())?;
let files_to_prune = pruning_predicate.prune(&PruneDataFiles::new(
    &schema,
    &partition_schema,  // <-- only has partition cols, missing identity-self-named
    &data_files,
))?;

PruneDataFiles::min_values (in pruning_statistics.rs:166) then fails the arrow-schema lookup for any non-partition column, including identity-self-named ones:

fn min_values(&self, column: &Column) -> Option<ArrayRef> {
    let column_id = self.schema.fields().get_name(&column.name)?.id;
    let datatype = self
        .arrow_schema
        .field_with_name(&column.name)  // <-- returns Err for `event_name`
        .ok()?
        .data_type();
    ...
}

So the pruning predicate, which was built against the full arrow_schema, asks for stats on event_name, gets None back, and cannot prune anything. File-level lower/upper bounds for the identity-self-named column that Iceberg does write into the manifest entries are never consulted.

Reproducer

  1. Create an Iceberg target partitioned by identity(event_name) (with the partition field name equal to the source column name) containing at least two distinct event_name partition values, each with its own data file.
  2. Issue a SQL query through DataFusion with a filter referencing event_name, e.g. SELECT * FROM t WHERE event_name = ad_start_event.
  3. Inspect the returned ExecutionPlan (or EXPLAIN ANALYZE the query) — every partition file is present in file_groups; nothing is pruned.

Expected: files_ranges_pruned_statistics > 0 on the physical DataSourceExec (equal to partition_count - 1 when the filter matches a single partition).

Fix direction

Pass &arrow_schema (the full table arrow schema) to PruneDataFiles::new instead of &partition_schema:

let files_to_prune = pruning_predicate.prune(&PruneDataFiles::new(
    &schema,
    &arrow_schema,
    &data_files,
))?;

PruneDataFiles now has access to every column in the table, not just the partition columns. Column lookups succeed for identity-self-named partitions (and, as a bonus, for any other column with per-file statistics in the manifest — useful for pruning by non-partition columns too). The first-stage PruneManifests path continues to handle transformed partition columns like collector_tstamp_day via manifest-list partition bounds, so correctness is preserved there.

Related

  • Embucket/iceberg-rust#57 — enables MERGE on partitioned targets in the first place.
  • Embucket/embucket#126custom_type_coercion leaf fallback (unblocked the filter reaching TableScan) + EXPLAIN ANALYZE MERGE routing + MergeIntoSinkExec metrics (revealed this bug).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions