Skip to content

[bug] read from multiple s3 regions #1279

@kevinjqliu

Description

@kevinjqliu

Similar to #1041

Apache Iceberg version

None

Please describe the bug 🐞

Problem

I want to read files from multiple s3 regions. For example, my metadata files are in us-west-2 but my data files are in us-east-1. This is not possible currently.

Context

Reading a file in pyarrow requires a location and a file system implementation, fs. For example, location="s3://blah/foo.parquet" and fs=S3FileSystem.

def new_input(self, location: str) -> PyArrowFile:
"""Get a PyArrowFile instance to read bytes from the file at the given location.
Args:
location (str): A URI or a path to a local file.
Returns:
PyArrowFile: A PyArrowFile instance for the given location.
"""
scheme, netloc, path = self.parse_location(location)
return PyArrowFile(
fs=self.fs_by_scheme(scheme, netloc),
location=location,
path=path,
buffer_size=int(self.properties.get(BUFFER_SIZE, ONE_MEGABYTE)),
)

The fs is used to access the files in s3. And is initialized with the given S3_REGION according to the S3 configuration.

def _initialize_fs(self, scheme: str, netloc: Optional[str] = None) -> FileSystem:
if scheme in {"s3", "s3a", "s3n"}:
from pyarrow.fs import S3FileSystem
client_kwargs: Dict[str, Any] = {
"endpoint_override": self.properties.get(S3_ENDPOINT),
"access_key": get_first_property_value(self.properties, S3_ACCESS_KEY_ID, AWS_ACCESS_KEY_ID),
"secret_key": get_first_property_value(self.properties, S3_SECRET_ACCESS_KEY, AWS_SECRET_ACCESS_KEY),
"session_token": get_first_property_value(self.properties, S3_SESSION_TOKEN, AWS_SESSION_TOKEN),
"region": get_first_property_value(self.properties, S3_REGION, AWS_REGION),
}
if proxy_uri := self.properties.get(S3_PROXY_URI):
client_kwargs["proxy_options"] = proxy_uri
if connect_timeout := self.properties.get(S3_CONNECT_TIMEOUT):
client_kwargs["connect_timeout"] = float(connect_timeout)
return S3FileSystem(**client_kwargs)

This means only 1 S3 region is allowed.

Possible Solution

Create multiple instances of S3FileSystem, one for each region. And fetch the corresponding instance based on location. pyarrow.fs.resolve_s3_region(bucket) can determine the correct region

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions