feat: auto-register ObjectStore and accept it in read/register methods#1476
Draft
mesejo wants to merge 2 commits intoapache:mainfrom
Draft
feat: auto-register ObjectStore and accept it in read/register methods#1476mesejo wants to merge 2 commits intoapache:mainfrom
mesejo wants to merge 2 commits intoapache:mainfrom
Conversation
8cc23f3 to
2e5eb98
Compare
Closes apache#899. Integrates pyo3-object-store 0.9 to replace hand-rolled store classes and adds two quality-of-life improvements: 1. **Auto-registration**: Every `read_*` / `register_*` call now invokes `try_register_url_store()`, which inspects the URL scheme (s3 / gs / az / http / https) and silently registers an appropriate `ObjectStore`. A guard (`object_store_registry.get_store()`) prevents overwriting a store the user already registered. 2. **Explicit store parameter**: All eight read/register methods on `SessionContext` now accept an optional `object_store` keyword argument. When supplied the store is registered directly (keyed on the path URL) before the operation runs; the auto-registration path is skipped. Changes: - `Cargo.toml` / `crates/core/Cargo.toml`: add `pyo3-object_store 0.9` workspace dependency. - `crates/core/src/lib.rs`: replace hand-rolled store sub-module registration with `pyo3_object_store::register_store_module()`. - `crates/core/src/context.rs`: - `register_object_store(url, store)` rewritten to accept `PyObjectStore` directly (no more `StorageContexts` enum). - New `prepare_store_for_path(path, store)` helper centralises the explicit-vs-auto dispatch. - `try_register_url_store` gains an early-return guard. - All eight read/register methods gain `object_store: Option<PyObjectStore>`. - `python/datafusion/object_store.py`: rewritten to re-export `S3Store`, `GCSStore`, `AzureStore`, `HTTPStore`, `LocalStore`, `MemoryStore`, and `from_url` from `pyo3-object-store`, plus backward-compat aliases (`AmazonS3`, `GoogleCloud`, `MicrosoftAzure`, `Http`, `LocalFileSystem`). - `python/datafusion/context.py`: all eight Python-side methods updated with `object_store: Any | None = None` and docstring entries. - `python/tests/test_sql.py`: new integration tests covering explicit S3 store, auto-registered S3 URL, HTTP CSV, HTTPS CSV, and HTTPS Parquet (using public `coiled-datasets` bucket and GitHub raw URLs).
2e5eb98 to
8e21c26
Compare
Member
|
In the "Are there any user facing changes?" can you please describe what changes the users need to make or be aware of? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #899.
Rationale for this change
Implement the proposed solution in #899.
What changes are included in this PR?
Users no longer need to call
register_object_store()before reading remote files. Two complementary mechanisms are provided:Auto-registration from URL scheme
try_register_url_store()is called inside everyread_*/register_*method. It parses the path, detects the scheme (s3, gs, az/abfss, http/https) and builds an appropriateObjectStorefrom environment variables. An existing registration is never overwritten, so an explicitregister_object_store()call still takes precedence. Anonymous S3 access is enabled viaAWS_SKIP_SIGNATURE=true/1(avoids EC2 IMDS timeouts when not running on AWS).object_storeparameter on read/register methods All eightread_*/register_*methods (read_parquet,register_parquet,read_csv,register_csv,read_json,register_json,read_avro,register_avro) now accept an optionalobject_storekeyword argument. Passing a store instance registers it for the URL immediately, with no separate call required:pyo3-object-store integration
Replaced the hand-rolled
store.rsPython classes withpyo3-object_store 0.9(compatible with object_store 0.13), which provides richer, actively maintained Python builders for every backend. Thedatafusion.object_storemodule now exposesS3Store,GCSStore,AzureStore,HTTPStore,LocalStore,MemoryStore, andfrom_url. Legacy names (AmazonS3,GoogleCloud,MicrosoftAzure,Http,LocalFileSystem) are kept as backward-compatible aliases.register_object_store(url, store)now takes a full URL prefix and aPyObjectStoreinstead of the old(scheme, StorageContexts, host)triple, matching the pattern suggested in #899.Tests
Added integration tests:
test_read_http_csv- reads CSV from GitHub raw HTTPStest_read_https_parquet- reads Parquet from Apache parquet-testingtest_read_s3_parquet_explicit- passesS3Storeviaobject_store=test_read_s3_parquet_auto- usesAWS_SKIP_SIGNATURE=trueenv varAre there any user-facing changes?
Yes