Skip to content

Docs: add note for day transform#11749

Closed
xxchan wants to merge 1 commit intoapache:mainfrom
xxchan:xxchan/weary-skink
Closed

Docs: add note for day transform#11749
xxchan wants to merge 1 commit intoapache:mainfrom
xxchan:xxchan/weary-skink

Conversation

@xxchan
Copy link
Copy Markdown
Member

@xxchan xxchan commented Dec 11, 2024

This was very confusing

related: apache/iceberg-rust#478, #10616

cc @Fokko @sdd

Signed-off-by: xxchan <xxchan22f@gmail.com>
@github-actions github-actions bot added the Specification Issues that may introduce spec changes. label Dec 11, 2024
Comment thread format/spec.md
| **`year`** | Extract a date or timestamp year, as years from 1970 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `int` |
| **`month`** | Extract a date or timestamp month, as months from 1970-01-01 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `int` |
| **`day`** | Extract a date or timestamp day, as days from 1970-01-01 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `int` |
| **`day`** | Extract a date or timestamp day, as days from 1970-01-01 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `int` (the physical type should be an `int`, but the the logical type should be a `date`) |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it's unclear what is physical type versus logical type.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have better suggestions?

Copy link
Copy Markdown

@ZENOTME ZENOTME Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using using date directly here. In the context of iceberg, date is primitive type. ref: https://iceberg.apache.org/spec/#nested-types:~:text=38%20or%20less-,date,-Calendar%20date%20without cc @Fokko Seems there are discussions about this #9345

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I also don't fully understand why we can't use date directly here, since it's a primitive type.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed some time to think about this. I would also lean towards stating that it returns a date. The confusion (and also the terminology) comes from that it is being encoded in Avro where the physical (bytes written to disk) is an int, that's annotated with a logical type that it represents a date.

The relevant piece of code:

@Override
public Type getResultType(Type sourceType) {
if (granularity == ChronoUnit.DAYS) {
return Types.DateType.get();
}
return Types.IntegerType.get();
}

Curious to learn what others think @RussellSpitzer @rdblue @danielcweeks @nastra

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think many previous questions and discussions ended prematurely with "date is also just an integer". Or "physically as int", but displayed as "date", this is also very confusing.

  • Then why not just use date? What may break?
  • What on earth is date?
    • In Avro/Parquet, date is exactly a "logical type"
    • In Iceberg spec, it seems not clear what date is physically, and doesn't require it to be an int.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the spec is clear here for day transform. days from 1970-01-01 is a int. What might be confusing is with date and implementation details, which can be enhanced like apache/iceberg-python#1211

Copy link
Copy Markdown
Member Author

@xxchan xxchan Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the spec is clear here for day transform. days from 1970-01-01 is a int.

I believe this is not clear enough, and has lead to problems repeately in the wild like apache/iceberg-rust#478.

As also mentioned by Fokko, what is now persisted is really an "Avro Date". Parse it by assuming it's an Avro Int will lead to error.

When it inserts data, the reference Java Iceberg implementation writes the Avro manifest files, using an Avro type of Date for the partition struct value.


Actually this looks a case of abstraction leak to me: We didn't specify date is int (days from 1970-01-01).

But the day transform here requires:

  1. The value is int (days from 1970-01-01)
  2. The value should be serialized/displayed as Date (This is not mentioned in the spec here, but is in the reference implementation.)

This implicitly forces date to be int. (And then day transform's return should also be date)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value should be serialized/displayed as Date (This is not mentioned in the spec here, but is in the reference implementation.)

Yes, but since it is serialized with Avro, it will always be an int:

{
  "type": "int",
  "logicalType": "date"
}

Specifying this twice would lead to duplication of the Avro spec into the Iceberg spec. The error in Iceberg-Rust did raise my eyebrow a bit since I would expect it to read the int without the logicalType as well because there is no ambiguity.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also encountered this before in apache/iceberg-python#1208 and apache/iceberg-python#1211. There's also this accompanying devlist thread https://lists.apache.org/thread/2gq7b54nvc9q6f1j08l9lnzgm5onkmx5

@xxchan xxchan changed the title doc: add note for day transform Docs: add note for day transform Dec 12, 2024
@github-actions
Copy link
Copy Markdown

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Jan 17, 2025
@github-actions
Copy link
Copy Markdown

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Specification Issues that may introduce spec changes. stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants