Filed here because issues are disabled on Embucket/iceberg-rust. The fix lives in that fork.
Summary
When datafusion_iceberg::table_scan receives a filter over a column that Iceberg partitions by identity(col) where pf.name() == pf.source_name(), it never uses that filter to prune data files. The filter reaches the TableScan.filters slot and is visible in the logical plan (partial_filters=[event_name = Utf8("ad_start_event")]), but the second-stage PruneDataFiles pruner is constructed with partition_schema — a subset that only contains the Hive-style partition columns (e.g. collector_tstamp_day) — so the column lookup for event_name in PruneDataFiles::min_values / max_values fails and no statistics are returned. Every partition file of the target is read in full.
Cause
At datafusion_iceberg/src/table.rs:544-558, identity-self-named partition fields are intentionally dropped from partition_column_names (as they are dropped from file_partition_fields upstream at :443-454 — the parquet reader would otherwise trip over them since the column is materialized in the file body). The intent, per the comment at :435-442, is that identity-partition pruning happens via PruneDataFiles instead of PruneManifests.
But at :601-605, PruneDataFiles::new is passed partition_schema (built from the reduced table_partition_cols) as its arrow_schema:
let pruning_predicate =
PruningPredicate::try_new(physical_predicate, arrow_schema.clone())?;
let files_to_prune = pruning_predicate.prune(&PruneDataFiles::new(
&schema,
&partition_schema, // <-- only has partition cols, missing identity-self-named
&data_files,
))?;
PruneDataFiles::min_values (in pruning_statistics.rs:166) then fails the arrow-schema lookup for any non-partition column, including identity-self-named ones:
fn min_values(&self, column: &Column) -> Option<ArrayRef> {
let column_id = self.schema.fields().get_name(&column.name)?.id;
let datatype = self
.arrow_schema
.field_with_name(&column.name) // <-- returns Err for `event_name`
.ok()?
.data_type();
...
}
So the pruning predicate, which was built against the full arrow_schema, asks for stats on event_name, gets None back, and cannot prune anything. File-level lower/upper bounds for the identity-self-named column that Iceberg does write into the manifest entries are never consulted.
Reproducer
- Create an Iceberg target partitioned by
identity(event_name) (with the partition field name equal to the source column name) containing at least two distinct event_name partition values, each with its own data file.
- Issue a SQL query through DataFusion with a filter referencing
event_name, e.g. SELECT * FROM t WHERE event_name = ad_start_event.
- Inspect the returned
ExecutionPlan (or EXPLAIN ANALYZE the query) — every partition file is present in file_groups; nothing is pruned.
Expected: files_ranges_pruned_statistics > 0 on the physical DataSourceExec (equal to partition_count - 1 when the filter matches a single partition).
Fix direction
Pass &arrow_schema (the full table arrow schema) to PruneDataFiles::new instead of &partition_schema:
let files_to_prune = pruning_predicate.prune(&PruneDataFiles::new(
&schema,
&arrow_schema,
&data_files,
))?;
PruneDataFiles now has access to every column in the table, not just the partition columns. Column lookups succeed for identity-self-named partitions (and, as a bonus, for any other column with per-file statistics in the manifest — useful for pruning by non-partition columns too). The first-stage PruneManifests path continues to handle transformed partition columns like collector_tstamp_day via manifest-list partition bounds, so correctness is preserved there.
Related
- Embucket/iceberg-rust#57 — enables MERGE on partitioned targets in the first place.
- Embucket/embucket#126 —
custom_type_coercion leaf fallback (unblocked the filter reaching TableScan) + EXPLAIN ANALYZE MERGE routing + MergeIntoSinkExec metrics (revealed this bug).
Summary
When
datafusion_iceberg::table_scanreceives a filter over a column that Iceberg partitions byidentity(col)wherepf.name() == pf.source_name(), it never uses that filter to prune data files. The filter reaches theTableScan.filtersslot and is visible in the logical plan (partial_filters=[event_name = Utf8("ad_start_event")]), but the second-stagePruneDataFilespruner is constructed withpartition_schema— a subset that only contains the Hive-style partition columns (e.g.collector_tstamp_day) — so the column lookup forevent_nameinPruneDataFiles::min_values/max_valuesfails and no statistics are returned. Every partition file of the target is read in full.Cause
At
datafusion_iceberg/src/table.rs:544-558, identity-self-named partition fields are intentionally dropped frompartition_column_names(as they are dropped fromfile_partition_fieldsupstream at:443-454— the parquet reader would otherwise trip over them since the column is materialized in the file body). The intent, per the comment at:435-442, is that identity-partition pruning happens viaPruneDataFilesinstead ofPruneManifests.But at
:601-605,PruneDataFiles::newis passedpartition_schema(built from the reducedtable_partition_cols) as itsarrow_schema:PruneDataFiles::min_values(inpruning_statistics.rs:166) then fails the arrow-schema lookup for any non-partition column, including identity-self-named ones:So the pruning predicate, which was built against the full
arrow_schema, asks for stats onevent_name, getsNoneback, and cannot prune anything. File-level lower/upper bounds for the identity-self-named column that Iceberg does write into the manifest entries are never consulted.Reproducer
identity(event_name)(with the partition field name equal to the source column name) containing at least two distinctevent_namepartition values, each with its own data file.event_name, e.g.SELECT * FROM t WHERE event_name = ad_start_event.ExecutionPlan(orEXPLAIN ANALYZEthe query) — every partition file is present infile_groups; nothing is pruned.Expected:
files_ranges_pruned_statistics> 0 on the physicalDataSourceExec(equal topartition_count - 1when the filter matches a single partition).Fix direction
Pass
&arrow_schema(the full table arrow schema) toPruneDataFiles::newinstead of&partition_schema:PruneDataFilesnow has access to every column in the table, not just the partition columns. Column lookups succeed for identity-self-named partitions (and, as a bonus, for any other column with per-file statistics in the manifest — useful for pruning by non-partition columns too). The first-stagePruneManifestspath continues to handle transformed partition columns likecollector_tstamp_dayvia manifest-list partition bounds, so correctness is preserved there.Related
custom_type_coercionleaf fallback (unblocked the filter reachingTableScan) +EXPLAIN ANALYZE MERGErouting +MergeIntoSinkExecmetrics (revealed this bug).