Filed here because issues are disabled on `Embucket/iceberg-rust`. The fix lives in that fork.
Summary
When writing to an Iceberg table whose partition spec uses `bucket(N, col)`, the iceberg-rust manifest writer builds the Avro schema for the `data_file.partition` struct with the bucket field typed as a nullable string. But `transform_arrow()` produces an `Int32` hash for `Bucket`, so serialization fails at commit time:
```
Iceberg error: Failed to serialize field 'data_file' for record
Record(RecordSchema { name: Name { name: "manifest_entry", ... },
fields: [... RecordField { name: "data_file", ... schema: Record(RecordSchema {
name: Name { name: "r2", ... },
fields: [... RecordField { name: "partition", ... schema: Record(RecordSchema {
name: Name { name: "r102", ... },
fields: [RecordField {
name: "id_bucket", ...,
schema: Union(UnionSchema { schemas: [Null, String], ... }), ← wrong type
...
}],
...
```
Expected partition field type: `Union(Null, Int)` (matching the `Int32Type` returned by `transform_arrow` for `Bucket`).
Repro
Create any Iceberg table partitioned by `bucket(N, col)` via Athena — `WITH (table_type = 'ICEBERG', partitioning = ARRAY['bucket(4, id)'])`. Seed one row. Then from Embucket run any MERGE or UPDATE on that target. Seen against the probe table `atomic.merge_test_bucket` on S3 Tables during the verification of Embucket/iceberg-rust#57.
Likely location
Wherever `iceberg-rust` derives the partition-struct Avro schema from the Iceberg partition spec. The per-field result type for each transform should match what `transform_arrow` produces: `Bucket() → Int`, `Truncate() → same as source`, `Day/Month/Year → Int`, `Hour → Int`, `Identity → source type`. Almost certainly a single table of `Transform → Iceberg result type` that's missing or wrong for `Bucket`.
Related
Unmasked once Embucket/iceberg-rust#57 landed — before that, the projection schema mismatch on partitioned targets short-circuited every MERGE before the manifest writer ran.
Summary
When writing to an Iceberg table whose partition spec uses `bucket(N, col)`, the iceberg-rust manifest writer builds the Avro schema for the `data_file.partition` struct with the bucket field typed as a nullable string. But `transform_arrow()` produces an `Int32` hash for `Bucket`, so serialization fails at commit time:
```
Iceberg error: Failed to serialize field 'data_file' for record
Record(RecordSchema { name: Name { name: "manifest_entry", ... },
fields: [... RecordField { name: "data_file", ... schema: Record(RecordSchema {
name: Name { name: "r2", ... },
fields: [... RecordField { name: "partition", ... schema: Record(RecordSchema {
name: Name { name: "r102", ... },
fields: [RecordField {
name: "id_bucket", ...,
schema: Union(UnionSchema { schemas: [Null, String], ... }), ← wrong type
...
}],
...
```
Expected partition field type: `Union(Null, Int)` (matching the `Int32Type` returned by `transform_arrow` for `Bucket`).
Repro
Create any Iceberg table partitioned by `bucket(N, col)` via Athena — `WITH (table_type = 'ICEBERG', partitioning = ARRAY['bucket(4, id)'])`. Seed one row. Then from Embucket run any MERGE or UPDATE on that target. Seen against the probe table `atomic.merge_test_bucket` on S3 Tables during the verification of Embucket/iceberg-rust#57.
Likely location
Wherever `iceberg-rust` derives the partition-struct Avro schema from the Iceberg partition spec. The per-field result type for each transform should match what `transform_arrow` produces: `Bucket() → Int`, `Truncate() → same as source`, `Day/Month/Year → Int`, `Hour → Int`, `Identity → source type`. Almost certainly a single table of `Transform → Iceberg result type` that's missing or wrong for `Bucket`.
Related
Unmasked once Embucket/iceberg-rust#57 landed — before that, the projection schema mismatch on partitioned targets short-circuited every MERGE before the manifest writer ran.