File System Integration

This section covers S3 filesystem integration and object management.

S3 FileSystem

class pyathena.filesystem.s3.S3FileSystem(*args, **kwargs)[source]

A filesystem interface for Amazon S3 that implements the fsspec protocol.

This class provides a file-system like interface to Amazon S3, allowing you to use familiar file operations (ls, open, cp, rm, etc.) with S3 objects. It’s designed to be compatible with s3fs while offering PyAthena-specific optimizations.

The filesystem supports standard S3 operations including: - Listing objects and directories - Reading and writing files - Copying and moving objects - Creating and removing directories - Multipart uploads for large files - Various S3 storage classes and encryption options

session

The boto3 session used for S3 operations.

client

The S3 client for direct API calls.

config

Boto3 configuration for the client.

retry_config

Configuration for retry behavior on failed operations.

Example

>>> from pyathena.filesystem.s3 import S3FileSystem
>>> fs = S3FileSystem()
>>>
>>> # List objects in a bucket
>>> files = fs.ls('s3://my-bucket/data/')
>>>
>>> # Read a file
>>> with fs.open('s3://my-bucket/data/file.csv', 'r') as f:
...     content = f.read()
>>>
>>> # Write a file
>>> with fs.open('s3://my-bucket/output/result.txt', 'w') as f:
...     f.write('Hello, S3!')
>>>
>>> # Copy files
>>> fs.cp('s3://source-bucket/file.txt', 's3://dest-bucket/file.txt')

Note

This filesystem is used internally by PyAthena for handling query results stored in S3, but can also be used independently for S3 file operations.

MULTIPART_UPLOAD_MIN_PART_SIZE: int = 5242880
MULTIPART_UPLOAD_MAX_PART_SIZE: int = 5368709120
DELETE_OBJECTS_MAX_KEYS: int = 1000
DEFAULT_BLOCK_SIZE: int = 5242880
PATTERN_PATH: Pattern[str] = re.compile('(^s3://|^s3a://|^)(?P<bucket>[a-zA-Z0-9.\\-_]+)(/(?P<key>[^?]+)|/)?($|\\?version(Id|ID|id|_id)=(?P<version_id>.+)$)')
protocol: ClassVar[str | tuple[str, ...]] = ('s3', 's3a')
__init__(connection: 'Connection[Any]' | None = None, default_block_size: int | None = None, default_cache_type: str | None = None, max_workers: int = 20, s3_additional_kwargs=None, *args, **kwargs) None[source]

Create and configure file-system instance

Instances may be cachable, so if similar enough arguments are seen a new instance is not required. The token attribute exists to allow implementations to cache instances if they wish.

A reasonable default should be provided if there are no arguments.

Subclasses should call this method.

Parameters

use_listings_cache, listings_expiry_time, max_paths:

passed to DirCache, if the implementation supports directory listing caching. Pass use_listings_cache=False to disable such caching.

skip_instance_cache: bool

If this is a cachable implementation, pass True here to force creating a new instance even if a matching instance exists, and prevent storing this instance.

asynchronous: bool loop: asyncio-compatible IOLoop or None

static parse_path(path: str) Tuple[str, str | None, str | None][source]
ls(path: str, detail: bool = False, refresh: bool = False, **kwargs) List[S3Object] | List[str][source]

List contents of an S3 path.

Lists buckets (when path is root) or objects within a bucket/prefix. Compatible with fsspec interface for filesystem operations.

Parameters:
  • path – S3 path to list (e.g., “s3://bucket” or “s3://bucket/prefix”).

  • detail – If True, return S3Object instances; if False, return paths as strings.

  • refresh – If True, bypass cache and fetch fresh results from S3.

  • **kwargs – Additional arguments (ignored for S3).

Returns:

List of S3Object instances (if detail=True) or paths as strings (if detail=False).

Example

>>> fs = S3FileSystem()
>>> fs.ls("s3://my-bucket")  # List objects in bucket
>>> fs.ls("s3://my-bucket/", detail=True)  # Get detailed object info
info(path: str, **kwargs) S3Object[source]

Give details of entry at path

Returns a single dictionary, with exactly the same information as ls would with detail=True.

The default implementation calls ls and could be overridden by a shortcut. kwargs are passed on to `ls().

Some file systems might not be able to measure the file’s size, in which case, the returned dict will include 'size': None.

Returns

dict with keys: name (full path in the FS), size (in bytes), type (file, directory, or something else) and other FS-specific keys.

find(path: str, maxdepth: int | None = None, withdirs: bool | None = None, detail: bool = False, **kwargs) Dict[str, S3Object] | List[str][source]

Find all files below a given S3 path.

Recursively searches for files under the specified path, with optional depth limiting and directory inclusion. Uses efficient S3 list operations with delimiter handling for performance.

Parameters:
  • path – S3 path to search under (e.g., “s3://bucket/prefix”).

  • maxdepth – Maximum depth to recurse (None for unlimited).

  • withdirs – Whether to include directories in results (None = default behavior).

  • detail – If True, return dict of {path: S3Object}; if False, return list of paths.

  • **kwargs – Additional arguments.

Returns:

Dictionary mapping paths to S3Objects (if detail=True) or list of paths (if detail=False).

Example

>>> fs = S3FileSystem()
>>> fs.find("s3://bucket/data/", maxdepth=2)  # Limit depth
>>> fs.find("s3://bucket/", withdirs=True)    # Include directories
exists(path: str, **kwargs) bool[source]

Check if an S3 path exists.

Determines whether a bucket, object, or prefix exists in S3. Uses caching and efficient head operations to minimize API calls.

Parameters:
  • path – S3 path to check (e.g., “s3://bucket” or “s3://bucket/key”).

  • **kwargs – Additional arguments (unused).

Returns:

True if the path exists, False otherwise.

Example

>>> fs = S3FileSystem()
>>> fs.exists("s3://my-bucket/file.txt")
>>> fs.exists("s3://my-bucket/")
rm_file(path: str, **kwargs) None[source]

Delete a file

rm(path, recursive=False, maxdepth=None, **kwargs) None[source]

Delete files.

Parameters

path: str or list of str

File(s) to delete.

recursive: bool

If file(s) are directories, recursively delete contents and then also remove the directory

maxdepth: int or None

Depth to pass to walk for finding files to delete, if recursive. If None, there will be no limit and infinite recursion may be possible.

touch(path: str, truncate: bool = True, **kwargs) Dict[str, Any][source]

Create empty file, or update timestamp

Parameters

path: str

file location

truncate: bool

If True, always set file size to 0; if False, update timestamp and leave file unchanged, if backend allows this

cp_file(path1: str, path2: str, recursive=False, maxdepth=None, on_error=None, **kwargs)[source]

Copy an S3 object to another S3 location.

Performs server-side copy of S3 objects, which is more efficient than downloading and re-uploading. Automatically chooses between simple copy and multipart copy based on object size.

Parameters:
  • path1 – Source S3 path (s3://bucket/key).

  • path2 – Destination S3 path (s3://bucket/key).

  • recursive – Unused parameter for fsspec compatibility.

  • maxdepth – Unused parameter for fsspec compatibility.

  • on_error – Unused parameter for fsspec compatibility.

  • **kwargs – Additional S3 copy parameters (e.g., metadata, storage class).

Raises:

ValueError – If trying to copy to a versioned file or copy buckets.

Note

Uses multipart copy for objects larger than the maximum part size to optimize performance for large files. The copy operation is performed entirely on the S3 service without data transfer.

cat_file(path: str, start: int | None = None, end: int | None = None, **kwargs) bytes[source]

Get the content of a file

Parameters

path: URL of file on this filesystems start, end: int

Bytes limits of the read. If negative, backwards from end, like usual python slices. Either can be None for start or end of file, respectively

kwargs: passed to open().

put_file(lpath: str, rpath: str, callback=<fsspec.callbacks.NoOpCallback object>, **kwargs)[source]

Upload a local file to S3.

Uploads a file from the local filesystem to an S3 location. Supports automatic content type detection based on file extension and provides progress callback functionality.

Parameters:
  • lpath – Local file path to upload.

  • rpath – S3 destination path (s3://bucket/key).

  • callback – Progress callback for tracking upload progress.

  • **kwargs – Additional S3 parameters (e.g., ContentType, StorageClass).

Note

Directories are not supported for upload. If lpath is a directory, the method returns without performing any operation. Bucket-only destinations (without key) are also not supported.

get_file(rpath: str, lpath: str, callback=<fsspec.callbacks.NoOpCallback object>, outfile=None, **kwargs)[source]

Download an S3 file to local filesystem.

Downloads a file from S3 to the local filesystem with progress tracking. Reads the file in chunks to handle large files efficiently.

Parameters:
  • rpath – S3 source path (s3://bucket/key).

  • lpath – Local destination file path.

  • callback – Progress callback for tracking download progress.

  • outfile – Unused parameter for fsspec compatibility.

  • **kwargs – Additional S3 parameters passed to open().

Note

If lpath is a directory, the method returns without performing any operation.

checksum(path: str, **kwargs)[source]

Get checksum for S3 object or directory.

Computes a checksum for the specified S3 path. For individual objects, returns the ETag converted to an integer. For directories, returns a checksum based on the directory’s tokenized representation.

Parameters:
  • path – S3 path (s3://bucket/key) to get checksum for.

  • **kwargs – Additional arguments including: refresh: If True, refresh cached info before computing checksum.

Returns:

Integer checksum value derived from S3 ETag or directory token.

Note

For multipart uploads, ETag format is different and only the first part before the dash is used for checksum calculation.

sign(path: str, expiration: int = 3600, **kwargs)[source]

Generate a presigned URL for S3 object access.

Creates a presigned URL that allows temporary access to an S3 object without requiring AWS credentials. Useful for sharing files or providing time-limited access to resources.

Parameters:
  • path – S3 path (s3://bucket/key) to generate URL for.

  • expiration – URL expiration time in seconds. Defaults to 3600 (1 hour).

  • **kwargs

    Additional parameters including: client_method: S3 operation (‘get_object’, ‘put_object’, etc.).

    Defaults to ‘get_object’.

    Additional parameters passed to the S3 operation.

Returns:

Presigned URL string that provides temporary access to the S3 object.

Example

>>> fs = S3FileSystem()
>>> url = fs.sign("s3://my-bucket/file.txt", expiration=7200)
>>> # URL valid for 2 hours
>>>
>>> # Generate upload URL
>>> upload_url = fs.sign(
...     "s3://my-bucket/upload.txt",
...     client_method="put_object"
... )
created(path: str) datetime[source]

Return the created timestamp of a file as a datetime.datetime

modified(path: str) datetime[source]

Return the modified timestamp of a file as a datetime.datetime

invalidate_cache(path: str | None = None) None[source]

Discard any cached directory information

Parameters

path: string or None

If None, clear all listings cached else listings at or under given path.

S3 Objects

class pyathena.filesystem.s3_object.S3Object(init: Dict[str, Any], **kwargs)[source]

Represents an S3 object with metadata and filesystem-like properties.

This class provides a dictionary-like interface to S3 object metadata, making it easier to work with S3 objects in filesystem operations. It handles the mapping between S3 API field names and more pythonic property names.

The object supports both dictionary-style access and property-style access to metadata fields like content type, storage class, encryption settings, and object lock configurations.

Example

>>> s3_obj = S3Object({"ContentType": "text/csv", "ContentLength": 1024})
>>> print(s3_obj.content_type)  # "text/csv"
>>> print(s3_obj["content_length"])  # 1024
>>> s3_obj.storage_class = "STANDARD_IA"

Note

This class is primarily used internally by S3FileSystem for representing S3 objects in filesystem operations.

__init__(init: Dict[str, Any], **kwargs) None[source]
get(k[, d]) D[k] if k in D, else d.  d defaults to None.[source]
to_dict() Dict[str, Any][source]

Convert S3Object to dictionary representation.

Returns:

Deep copy of the object’s attributes as a dictionary.

to_api_repr() Dict[str, Any][source]
class pyathena.filesystem.s3_object.S3ObjectType[source]

Constants for S3 object types in filesystem operations.

These constants are used to distinguish between directories and files when working with S3 paths through the S3FileSystem interface.

S3_OBJECT_TYPE_DIRECTORY: str = 'directory'
S3_OBJECT_TYPE_FILE: str = 'file'
class pyathena.filesystem.s3_object.S3StorageClass[source]

Constants for Amazon S3 storage classes.

S3 storage classes determine the availability, durability, and cost characteristics of stored objects. Each class is optimized for different access patterns and use cases.

Storage classes:
  • STANDARD: Default storage for frequently accessed data

  • REDUCED_REDUNDANCY: Lower cost, reduced durability (deprecated)

  • STANDARD_IA: Infrequently accessed data with rapid retrieval

  • ONEZONE_IA: Lower cost IA storage in single availability zone

  • INTELLIGENT_TIERING: Automatic tiering between frequent/infrequent

  • GLACIER: Archive storage for long-term backup

  • DEEP_ARCHIVE: Lowest cost archive storage

  • GLACIER_IR: Archive with faster retrieval than standard Glacier

  • OUTPOSTS: Storage on AWS Outposts

See also

AWS S3 storage classes documentation: https://docs.aws.amazon.com/s3/latest/userguide/storage-class-intro.html

S3_STORAGE_CLASS_STANDARD: str = 'STANDARD'
S3_STORAGE_CLASS_REDUCED_REDUNDANCY: str = 'REDUCED_REDUNDANCY'
S3_STORAGE_CLASS_STANDARD_IA: str = 'STANDARD_IA'
S3_STORAGE_CLASS_ONEZONE_IA: str = 'ONEZONE_IA'
S3_STORAGE_CLASS_INTELLIGENT_TIERING: str = 'INTELLIGENT_TIERING'
S3_STORAGE_CLASS_GLACIER: str = 'GLACIER'
S3_STORAGE_CLASS_DEEP_ARCHIVE: str = 'DEEP_ARCHIVE'
S3_STORAGE_CLASS_OUTPOSTS: str = 'OUTPOSTS'
S3_STORAGE_CLASS_GLACIER_IR: str = 'GLACIER_IR'
S3_STORAGE_CLASS_BUCKET: str = 'BUCKET'
S3_STORAGE_CLASS_DIRECTORY: str = 'DIRECTORY'

S3 Upload Operations

class pyathena.filesystem.s3_object.S3PutObject(response: Dict[str, Any])[source]

Represents the response from an S3 PUT object operation.

This class encapsulates the metadata returned when uploading an object to S3, including encryption details, versioning information, and integrity checksums.

expiration

Object expiration time if lifecycle policy applies.

version_id

Version ID if bucket versioning is enabled.

etag

Entity tag for the uploaded object.

server_side_encryption

Server-side encryption method used.

Various checksum properties

For data integrity verification.

Note

This class is used internally by S3FileSystem operations and typically not instantiated directly by users.

__init__(response: Dict[str, Any]) None[source]
property expiration: str | None
property version_id: str | None
property etag: str | None
property checksum_crc32: str | None
property checksum_crc32c: str | None
property checksum_sha1: str | None
property checksum_sha256: str | None
property server_side_encryption: str | None
property sse_customer_algorithm: str | None
property sse_customer_key_md5: str | None
property sse_kms_key_id: str | None
property sse_kms_encryption_context: str | None
property bucket_key_enabled: bool | None
property request_charged: str | None
to_dict() Dict[str, Any][source]
class pyathena.filesystem.s3_object.S3MultipartUpload(response: Dict[str, Any])[source]

Represents an S3 multipart upload operation.

This class manages the metadata for multipart uploads, which allow uploading large files in chunks for better reliability and performance. It tracks upload identifiers, encryption settings, and lifecycle rules.

bucket

S3 bucket name for the upload.

key

Object key being uploaded.

upload_id

Unique identifier for the multipart upload.

server_side_encryption

Encryption method applied to the upload.

abort_date/abort_rule_id

Lifecycle rule information for upload cleanup.

Note

Used internally by S3FileSystem for large file upload operations.

__init__(response: Dict[str, Any]) None[source]
property abort_date: datetime | None
property abort_rule_id: str | None
property bucket: str | None
property key: str | None
property upload_id: str | None
property server_side_encryption: str | None
property sse_customer_algorithm: str | None
property sse_customer_key_md5: str | None
property sse_kms_key_id: str | None
property sse_kms_encryption_context: str | None
property bucket_key_enabled: bool | None
property request_charged: str | None
property checksum_algorithm: str | None
class pyathena.filesystem.s3_object.S3MultipartUploadPart(part_number: int, response: Dict[str, Any])[source]

Represents a single part in an S3 multipart upload operation.

Each part in a multipart upload has its own metadata including checksums, encryption details, and part identification. This class manages that metadata and provides methods to convert it to API-compatible formats.

part_number

The sequential part number (1-based).

etag

Entity tag for this specific part.

checksum_*

Various integrity checksums for the part data.

server_side_encryption

Encryption settings for this part.

Note

Parts must be at least 5MB except for the last part. Used internally by S3FileSystem for chunked upload operations.

__init__(part_number: int, response: Dict[str, Any]) None[source]
property part_number: int
property copy_source_version_id: str | None
property last_modified: datetime | None
property etag: str | None
property checksum_crc32: str | None
property checksum_crc32c: str | None
property checksum_sha1: str | None
property checksum_sha256: str | None
property server_side_encryption: str | None
property sse_customer_algorithm: str | None
property sse_customer_key_md5: str | None
property sse_kms_key_id: str | None
property bucket_key_enabled: bool | None
property request_charged: str | None
to_api_repr() Dict[str, Any][source]
class pyathena.filesystem.s3_object.S3CompleteMultipartUpload(response: Dict[str, Any])[source]

Represents the completion of an S3 multipart upload operation.

This class encapsulates the final response when a multipart upload is completed, including the final object location, versioning information, and consolidated metadata from all parts.

location

Final S3 URL of the completed object.

bucket

S3 bucket containing the object.

key

Final object key.

version_id

Version ID if bucket versioning is enabled.

etag

Final entity tag of the complete object.

server_side_encryption

Encryption applied to the final object.

Note

This represents the successful completion of a multipart upload. Used internally by S3FileSystem operations.

__init__(response: Dict[str, Any]) None[source]
property location: str | None
property bucket: str | None
property key: str | None
property expiration: str | None
property version_id: str | None
property etag: str | None
property checksum_crc32: str | None
property checksum_crc32c: str | None
property checksum_sha1: str | None
property checksum_sha256: str | None
property server_side_encryption: str | None
property sse_kms_key_id: str | None
property bucket_key_enabled: bool | None
property request_charged: str | None
to_dict()[source]