Similar to #1041
Apache Iceberg version
None
Please describe the bug 🐞
Problem
I want to read files from multiple s3 regions. For example, my metadata files are in us-west-2 but my data files are in us-east-1. This is not possible currently.
Context
Reading a file in pyarrow requires a location and a file system implementation, fs. For example, location="s3://blah/foo.parquet" and fs=S3FileSystem.
|
def new_input(self, location: str) -> PyArrowFile: |
|
"""Get a PyArrowFile instance to read bytes from the file at the given location. |
|
|
|
Args: |
|
location (str): A URI or a path to a local file. |
|
|
|
Returns: |
|
PyArrowFile: A PyArrowFile instance for the given location. |
|
""" |
|
scheme, netloc, path = self.parse_location(location) |
|
return PyArrowFile( |
|
fs=self.fs_by_scheme(scheme, netloc), |
|
location=location, |
|
path=path, |
|
buffer_size=int(self.properties.get(BUFFER_SIZE, ONE_MEGABYTE)), |
|
) |
The fs is used to access the files in s3. And is initialized with the given S3_REGION according to the S3 configuration.
|
def _initialize_fs(self, scheme: str, netloc: Optional[str] = None) -> FileSystem: |
|
if scheme in {"s3", "s3a", "s3n"}: |
|
from pyarrow.fs import S3FileSystem |
|
|
|
client_kwargs: Dict[str, Any] = { |
|
"endpoint_override": self.properties.get(S3_ENDPOINT), |
|
"access_key": get_first_property_value(self.properties, S3_ACCESS_KEY_ID, AWS_ACCESS_KEY_ID), |
|
"secret_key": get_first_property_value(self.properties, S3_SECRET_ACCESS_KEY, AWS_SECRET_ACCESS_KEY), |
|
"session_token": get_first_property_value(self.properties, S3_SESSION_TOKEN, AWS_SESSION_TOKEN), |
|
"region": get_first_property_value(self.properties, S3_REGION, AWS_REGION), |
|
} |
|
|
|
if proxy_uri := self.properties.get(S3_PROXY_URI): |
|
client_kwargs["proxy_options"] = proxy_uri |
|
|
|
if connect_timeout := self.properties.get(S3_CONNECT_TIMEOUT): |
|
client_kwargs["connect_timeout"] = float(connect_timeout) |
|
|
|
return S3FileSystem(**client_kwargs) |
This means only 1 S3 region is allowed.
Possible Solution
Create multiple instances of S3FileSystem, one for each region. And fetch the corresponding instance based on location. pyarrow.fs.resolve_s3_region(bucket) can determine the correct region
Similar to #1041
Apache Iceberg version
None
Please describe the bug 🐞
Problem
I want to read files from multiple s3 regions. For example, my metadata files are in
us-west-2but my data files are inus-east-1. This is not possible currently.Context
Reading a file in
pyarrowrequires alocationand a file system implementation,fs. For example,location="s3://blah/foo.parquet"andfs=S3FileSystem.iceberg-python/pyiceberg/io/pyarrow.py
Lines 404 to 419 in 0cebec4
The
fsis used to access the files in s3. And is initialized with the givenS3_REGIONaccording to the S3 configuration.iceberg-python/pyiceberg/io/pyarrow.py
Lines 347 to 365 in 0cebec4
This means only 1 S3 region is allowed.
Possible Solution
Create multiple instances of
S3FileSystem, one for each region. And fetch the corresponding instance based onlocation.pyarrow.fs.resolve_s3_region(bucket)can determine the correct region