Skip to content

feat: auto-register ObjectStore and accept it in read/register methods#1476

Draft
mesejo wants to merge 2 commits intoapache:mainfrom
mesejo:feat/improve-object-storage-register
Draft

feat: auto-register ObjectStore and accept it in read/register methods#1476
mesejo wants to merge 2 commits intoapache:mainfrom
mesejo:feat/improve-object-storage-register

Conversation

@mesejo
Copy link
Copy Markdown
Contributor

@mesejo mesejo commented Apr 4, 2026

Which issue does this PR close?

Closes #899.

Rationale for this change

Implement the proposed solution in #899.

What changes are included in this PR?

Users no longer need to call register_object_store() before reading remote files. Two complementary mechanisms are provided:

Auto-registration from URL scheme
try_register_url_store() is called inside every read_* / register_* method. It parses the path, detects the scheme (s3, gs, az/abfss, http/https) and builds an appropriate ObjectStore from environment variables. An existing registration is never overwritten, so an explicit register_object_store() call still takes precedence. Anonymous S3 access is enabled via AWS_SKIP_SIGNATURE=true/1 (avoids EC2 IMDS timeouts when not running on AWS).

object_store parameter on read/register methods All eight read_* / register_* methods (read_parquet, register_parquet, read_csv, register_csv, read_json, register_json, read_avro, register_avro) now accept an optional object_store keyword argument. Passing a store instance registers it for the URL immediately, with no separate call required:

from datafusion.object_store import S3Store
store = S3Store("my-bucket", region="us-east-1", skip_signature=True)
df = ctx.read_parquet("s3://my-bucket/data.parquet", object_store=store)

pyo3-object-store integration
Replaced the hand-rolled store.rs Python classes with pyo3-object_store 0.9 (compatible with object_store 0.13), which provides richer, actively maintained Python builders for every backend. The datafusion.object_store module now exposes S3Store, GCSStore, AzureStore, HTTPStore, LocalStore, MemoryStore, and from_url. Legacy names (AmazonS3, GoogleCloud, MicrosoftAzure, Http, LocalFileSystem) are kept as backward-compatible aliases.

register_object_store(url, store) now takes a full URL prefix and a PyObjectStore instead of the old (scheme, StorageContexts, host) triple, matching the pattern suggested in #899.

Tests
Added integration tests:

  • test_read_http_csv - reads CSV from GitHub raw HTTPS
  • test_read_https_parquet - reads Parquet from Apache parquet-testing
  • test_read_s3_parquet_explicit - passes S3Store via object_store=
  • test_read_s3_parquet_auto - uses AWS_SKIP_SIGNATURE=true env var

Are there any user-facing changes?

Yes

@mesejo mesejo force-pushed the feat/improve-object-storage-register branch 2 times, most recently from 8cc23f3 to 2e5eb98 Compare April 4, 2026 13:51
Closes apache#899. Integrates pyo3-object-store 0.9 to replace hand-rolled
store classes and adds two quality-of-life improvements:

1. **Auto-registration**: Every `read_*` / `register_*` call now
   invokes `try_register_url_store()`, which inspects the URL scheme
   (s3 / gs / az / http / https) and silently registers an appropriate
   `ObjectStore`. A guard (`object_store_registry.get_store()`) prevents
   overwriting a store the user already registered.

2. **Explicit store parameter**: All eight read/register methods on
   `SessionContext` now accept an optional `object_store` keyword
   argument. When supplied the store is registered directly (keyed on
   the path URL) before the operation runs; the auto-registration path
   is skipped.

Changes:
- `Cargo.toml` / `crates/core/Cargo.toml`: add `pyo3-object_store 0.9`
  workspace dependency.
- `crates/core/src/lib.rs`: replace hand-rolled store sub-module
  registration with `pyo3_object_store::register_store_module()`.
- `crates/core/src/context.rs`:
  - `register_object_store(url, store)` rewritten to accept
    `PyObjectStore` directly (no more `StorageContexts` enum).
  - New `prepare_store_for_path(path, store)` helper centralises the
    explicit-vs-auto dispatch.
  - `try_register_url_store` gains an early-return guard.
  - All eight read/register methods gain `object_store: Option<PyObjectStore>`.
- `python/datafusion/object_store.py`: rewritten to re-export
  `S3Store`, `GCSStore`, `AzureStore`, `HTTPStore`, `LocalStore`,
  `MemoryStore`, and `from_url` from `pyo3-object-store`, plus
  backward-compat aliases (`AmazonS3`, `GoogleCloud`, `MicrosoftAzure`,
  `Http`, `LocalFileSystem`).
- `python/datafusion/context.py`: all eight Python-side methods updated
  with `object_store: Any | None = None` and docstring entries.
- `python/tests/test_sql.py`: new integration tests covering explicit
  S3 store, auto-registered S3 URL, HTTP CSV, HTTPS CSV, and HTTPS
  Parquet (using public `coiled-datasets` bucket and GitHub raw URLs).
@mesejo mesejo force-pushed the feat/improve-object-storage-register branch from 2e5eb98 to 8e21c26 Compare April 4, 2026 13:53
@timsaucer
Copy link
Copy Markdown
Member

In the "Are there any user facing changes?" can you please describe what changes the users need to make or be aware of?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remove the need for registering an ObjectStore for remote files

2 participants