Skip to content

[iceberg-rust] Manifest writer declares bucket() partition field as String instead of Int32 #125

@rampage644

Description

@rampage644

Filed here because issues are disabled on `Embucket/iceberg-rust`. The fix lives in that fork.

Summary

When writing to an Iceberg table whose partition spec uses `bucket(N, col)`, the iceberg-rust manifest writer builds the Avro schema for the `data_file.partition` struct with the bucket field typed as a nullable string. But `transform_arrow()` produces an `Int32` hash for `Bucket`, so serialization fails at commit time:

```
Iceberg error: Failed to serialize field 'data_file' for record
Record(RecordSchema { name: Name { name: "manifest_entry", ... },
fields: [... RecordField { name: "data_file", ... schema: Record(RecordSchema {
name: Name { name: "r2", ... },
fields: [... RecordField { name: "partition", ... schema: Record(RecordSchema {
name: Name { name: "r102", ... },
fields: [RecordField {
name: "id_bucket", ...,
schema: Union(UnionSchema { schemas: [Null, String], ... }), ← wrong type
...
}],
...
```

Expected partition field type: `Union(Null, Int)` (matching the `Int32Type` returned by `transform_arrow` for `Bucket`).

Repro

Create any Iceberg table partitioned by `bucket(N, col)` via Athena — `WITH (table_type = 'ICEBERG', partitioning = ARRAY['bucket(4, id)'])`. Seed one row. Then from Embucket run any MERGE or UPDATE on that target. Seen against the probe table `atomic.merge_test_bucket` on S3 Tables during the verification of Embucket/iceberg-rust#57.

Likely location

Wherever `iceberg-rust` derives the partition-struct Avro schema from the Iceberg partition spec. The per-field result type for each transform should match what `transform_arrow` produces: `Bucket() → Int`, `Truncate() → same as source`, `Day/Month/Year → Int`, `Hour → Int`, `Identity → source type`. Almost certainly a single table of `Transform → Iceberg result type` that's missing or wrong for `Bucket`.

Related

Unmasked once Embucket/iceberg-rust#57 landed — before that, the projection schema mismatch on partitioned targets short-circuited every MERGE before the manifest writer ran.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions