[iceberg-rust] identity-self-named partition columns are excluded from target-side filter pruning

> Filed here because issues are disabled on `Embucket/iceberg-rust`. The fix lives in that fork.

## Summary

When `datafusion_iceberg::table_scan` receives a filter over a column that Iceberg partitions by `identity(col)` where `pf.name() == pf.source_name()`, it never uses that filter to prune data files. The filter reaches the `TableScan.filters` slot and is visible in the logical plan (`partial_filters=[event_name = Utf8("ad_start_event")]`), but the second-stage `PruneDataFiles` pruner is constructed with `partition_schema` — a subset that only contains the Hive-style partition columns (e.g. `collector_tstamp_day`) — so the column lookup for `event_name` in `PruneDataFiles::min_values` / `max_values` fails and no statistics are returned. Every partition file of the target is read in full.

## Cause

At `datafusion_iceberg/src/table.rs:544-558`, identity-self-named partition fields are intentionally dropped from `partition_column_names` (as they are dropped from `file_partition_fields` upstream at `:443-454` — the parquet reader would otherwise trip over them since the column is materialized in the file body). The intent, per the comment at `:435-442`, is that identity-partition pruning happens via `PruneDataFiles` instead of `PruneManifests`.

But at `:601-605`, `PruneDataFiles::new` is passed `partition_schema` (built from the reduced `table_partition_cols`) as its `arrow_schema`:

```rust
let pruning_predicate =
    PruningPredicate::try_new(physical_predicate, arrow_schema.clone())?;
let files_to_prune = pruning_predicate.prune(&PruneDataFiles::new(
    &schema,
    &partition_schema,  // <-- only has partition cols, missing identity-self-named
    &data_files,
))?;
```

`PruneDataFiles::min_values` (in `pruning_statistics.rs:166`) then fails the arrow-schema lookup for any non-partition column, including identity-self-named ones:

```rust
fn min_values(&self, column: &Column) -> Option<ArrayRef> {
    let column_id = self.schema.fields().get_name(&column.name)?.id;
    let datatype = self
        .arrow_schema
        .field_with_name(&column.name)  // <-- returns Err for `event_name`
        .ok()?
        .data_type();
    ...
}
```

So the pruning predicate, which was built against the full `arrow_schema`, asks for stats on `event_name`, gets `None` back, and cannot prune anything. File-level lower/upper bounds for the identity-self-named column that Iceberg _does_ write into the manifest entries are never consulted.

## Reproducer

1. Create an Iceberg target partitioned by `identity(event_name)` (with the partition field name equal to the source column name) containing at least two distinct `event_name` partition values, each with its own data file.
2. Issue a SQL query through DataFusion with a filter referencing `event_name`, e.g. `SELECT * FROM t WHERE event_name = ad_start_event`.
3. Inspect the returned `ExecutionPlan` (or `EXPLAIN ANALYZE` the query) — every partition file is present in `file_groups`; nothing is pruned.

Expected: `files_ranges_pruned_statistics` > 0 on the physical `DataSourceExec` (equal to `partition_count - 1` when the filter matches a single partition).

## Fix direction

Pass `&arrow_schema` (the full table arrow schema) to `PruneDataFiles::new` instead of `&partition_schema`:

```rust
let files_to_prune = pruning_predicate.prune(&PruneDataFiles::new(
    &schema,
    &arrow_schema,
    &data_files,
))?;
```

`PruneDataFiles` now has access to every column in the table, not just the partition columns. Column lookups succeed for identity-self-named partitions (and, as a bonus, for any other column with per-file statistics in the manifest — useful for pruning by non-partition columns too). The first-stage `PruneManifests` path continues to handle transformed partition columns like `collector_tstamp_day` via manifest-list partition bounds, so correctness is preserved there.

## Related

- [Embucket/iceberg-rust#57](https://github.com/Embucket/iceberg-rust/pull/57) — enables MERGE on partitioned targets in the first place.
- [Embucket/embucket#126](https://github.com/Embucket/embucket/pull/126) — `custom_type_coercion` leaf fallback (unblocked the filter reaching `TableScan`) + `EXPLAIN ANALYZE MERGE` routing + `MergeIntoSinkExec` metrics (revealed this bug).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[iceberg-rust] identity-self-named partition columns are excluded from target-side filter pruning #127

Summary

Cause

Reproducer

Fix direction

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[iceberg-rust] identity-self-named partition columns are excluded from target-side filter pruning #127

Description

Summary

Cause

Reproducer

Fix direction

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions