File System Integration¶
This section covers S3 filesystem integration and object management.
S3 FileSystem¶
- class pyathena.filesystem.s3.S3FileSystem(*args, **kwargs)[source]¶
A filesystem interface for Amazon S3 that implements the fsspec protocol.
This class provides a file-system like interface to Amazon S3, allowing you to use familiar file operations (ls, open, cp, rm, etc.) with S3 objects. It’s designed to be compatible with s3fs while offering PyAthena-specific optimizations.
The filesystem supports standard S3 operations including: - Listing objects and directories - Reading and writing files - Copying and moving objects - Creating and removing directories - Multipart uploads for large files - Various S3 storage classes and encryption options
- session¶
The boto3 session used for S3 operations.
- client¶
The S3 client for direct API calls.
- config¶
Boto3 configuration for the client.
- retry_config¶
Configuration for retry behavior on failed operations.
Example
>>> from pyathena.filesystem.s3 import S3FileSystem >>> fs = S3FileSystem() >>> >>> # List objects in a bucket >>> files = fs.ls('s3://my-bucket/data/') >>> >>> # Read a file >>> with fs.open('s3://my-bucket/data/file.csv', 'r') as f: ... content = f.read() >>> >>> # Write a file >>> with fs.open('s3://my-bucket/output/result.txt', 'w') as f: ... f.write('Hello, S3!') >>> >>> # Copy files >>> fs.cp('s3://source-bucket/file.txt', 's3://dest-bucket/file.txt')
Note
This filesystem is used internally by PyAthena for handling query results stored in S3, but can also be used independently for S3 file operations.
- PATTERN_PATH: Pattern[str] = re.compile('(^s3://|^s3a://|^)(?P<bucket>[a-zA-Z0-9.\\-_]+)(/(?P<key>[^?]+)|/)?($|\\?version(Id|ID|id|_id)=(?P<version_id>.+)$)')¶
- __init__(connection: 'Connection[Any]' | None = None, default_block_size: int | None = None, default_cache_type: str | None = None, max_workers: int = 20, s3_additional_kwargs=None, *args, **kwargs) None [source]¶
Create and configure file-system instance
Instances may be cachable, so if similar enough arguments are seen a new instance is not required. The token attribute exists to allow implementations to cache instances if they wish.
A reasonable default should be provided if there are no arguments.
Subclasses should call this method.
Parameters¶
- use_listings_cache, listings_expiry_time, max_paths:
passed to
DirCache
, if the implementation supports directory listing caching. Pass use_listings_cache=False to disable such caching.- skip_instance_cache: bool
If this is a cachable implementation, pass True here to force creating a new instance even if a matching instance exists, and prevent storing this instance.
asynchronous: bool loop: asyncio-compatible IOLoop or None
- ls(path: str, detail: bool = False, refresh: bool = False, **kwargs) List[S3Object] | List[str] [source]¶
List contents of an S3 path.
Lists buckets (when path is root) or objects within a bucket/prefix. Compatible with fsspec interface for filesystem operations.
- Parameters:
path – S3 path to list (e.g., “s3://bucket” or “s3://bucket/prefix”).
detail – If True, return S3Object instances; if False, return paths as strings.
refresh – If True, bypass cache and fetch fresh results from S3.
**kwargs – Additional arguments (ignored for S3).
- Returns:
List of S3Object instances (if detail=True) or paths as strings (if detail=False).
Example
>>> fs = S3FileSystem() >>> fs.ls("s3://my-bucket") # List objects in bucket >>> fs.ls("s3://my-bucket/", detail=True) # Get detailed object info
- info(path: str, **kwargs) S3Object [source]¶
Give details of entry at path
Returns a single dictionary, with exactly the same information as
ls
would withdetail=True
.The default implementation calls ls and could be overridden by a shortcut. kwargs are passed on to
`ls()
.Some file systems might not be able to measure the file’s size, in which case, the returned dict will include
'size': None
.Returns¶
dict with keys: name (full path in the FS), size (in bytes), type (file, directory, or something else) and other FS-specific keys.
- find(path: str, maxdepth: int | None = None, withdirs: bool | None = None, detail: bool = False, **kwargs) Dict[str, S3Object] | List[str] [source]¶
Find all files below a given S3 path.
Recursively searches for files under the specified path, with optional depth limiting and directory inclusion. Uses efficient S3 list operations with delimiter handling for performance.
- Parameters:
path – S3 path to search under (e.g., “s3://bucket/prefix”).
maxdepth – Maximum depth to recurse (None for unlimited).
withdirs – Whether to include directories in results (None = default behavior).
detail – If True, return dict of {path: S3Object}; if False, return list of paths.
**kwargs – Additional arguments.
- Returns:
Dictionary mapping paths to S3Objects (if detail=True) or list of paths (if detail=False).
Example
>>> fs = S3FileSystem() >>> fs.find("s3://bucket/data/", maxdepth=2) # Limit depth >>> fs.find("s3://bucket/", withdirs=True) # Include directories
- exists(path: str, **kwargs) bool [source]¶
Check if an S3 path exists.
Determines whether a bucket, object, or prefix exists in S3. Uses caching and efficient head operations to minimize API calls.
- Parameters:
path – S3 path to check (e.g., “s3://bucket” or “s3://bucket/key”).
**kwargs – Additional arguments (unused).
- Returns:
True if the path exists, False otherwise.
Example
>>> fs = S3FileSystem() >>> fs.exists("s3://my-bucket/file.txt") >>> fs.exists("s3://my-bucket/")
- rm(path, recursive=False, maxdepth=None, **kwargs) None [source]¶
Delete files.
Parameters¶
- path: str or list of str
File(s) to delete.
- recursive: bool
If file(s) are directories, recursively delete contents and then also remove the directory
- maxdepth: int or None
Depth to pass to walk for finding files to delete, if recursive. If None, there will be no limit and infinite recursion may be possible.
- touch(path: str, truncate: bool = True, **kwargs) Dict[str, Any] [source]¶
Create empty file, or update timestamp
Parameters¶
- path: str
file location
- truncate: bool
If True, always set file size to 0; if False, update timestamp and leave file unchanged, if backend allows this
- cp_file(path1: str, path2: str, recursive=False, maxdepth=None, on_error=None, **kwargs)[source]¶
Copy an S3 object to another S3 location.
Performs server-side copy of S3 objects, which is more efficient than downloading and re-uploading. Automatically chooses between simple copy and multipart copy based on object size.
- Parameters:
path1 – Source S3 path (s3://bucket/key).
path2 – Destination S3 path (s3://bucket/key).
recursive – Unused parameter for fsspec compatibility.
maxdepth – Unused parameter for fsspec compatibility.
on_error – Unused parameter for fsspec compatibility.
**kwargs – Additional S3 copy parameters (e.g., metadata, storage class).
- Raises:
ValueError – If trying to copy to a versioned file or copy buckets.
Note
Uses multipart copy for objects larger than the maximum part size to optimize performance for large files. The copy operation is performed entirely on the S3 service without data transfer.
- cat_file(path: str, start: int | None = None, end: int | None = None, **kwargs) bytes [source]¶
Get the content of a file
Parameters¶
path: URL of file on this filesystems start, end: int
Bytes limits of the read. If negative, backwards from end, like usual python slices. Either can be None for start or end of file, respectively
kwargs: passed to
open()
.
- put_file(lpath: str, rpath: str, callback=<fsspec.callbacks.NoOpCallback object>, **kwargs)[source]¶
Upload a local file to S3.
Uploads a file from the local filesystem to an S3 location. Supports automatic content type detection based on file extension and provides progress callback functionality.
- Parameters:
lpath – Local file path to upload.
rpath – S3 destination path (s3://bucket/key).
callback – Progress callback for tracking upload progress.
**kwargs – Additional S3 parameters (e.g., ContentType, StorageClass).
Note
Directories are not supported for upload. If lpath is a directory, the method returns without performing any operation. Bucket-only destinations (without key) are also not supported.
- get_file(rpath: str, lpath: str, callback=<fsspec.callbacks.NoOpCallback object>, outfile=None, **kwargs)[source]¶
Download an S3 file to local filesystem.
Downloads a file from S3 to the local filesystem with progress tracking. Reads the file in chunks to handle large files efficiently.
- Parameters:
rpath – S3 source path (s3://bucket/key).
lpath – Local destination file path.
callback – Progress callback for tracking download progress.
outfile – Unused parameter for fsspec compatibility.
**kwargs – Additional S3 parameters passed to open().
Note
If lpath is a directory, the method returns without performing any operation.
- checksum(path: str, **kwargs)[source]¶
Get checksum for S3 object or directory.
Computes a checksum for the specified S3 path. For individual objects, returns the ETag converted to an integer. For directories, returns a checksum based on the directory’s tokenized representation.
- Parameters:
path – S3 path (s3://bucket/key) to get checksum for.
**kwargs – Additional arguments including: refresh: If True, refresh cached info before computing checksum.
- Returns:
Integer checksum value derived from S3 ETag or directory token.
Note
For multipart uploads, ETag format is different and only the first part before the dash is used for checksum calculation.
- sign(path: str, expiration: int = 3600, **kwargs)[source]¶
Generate a presigned URL for S3 object access.
Creates a presigned URL that allows temporary access to an S3 object without requiring AWS credentials. Useful for sharing files or providing time-limited access to resources.
- Parameters:
path – S3 path (s3://bucket/key) to generate URL for.
expiration – URL expiration time in seconds. Defaults to 3600 (1 hour).
**kwargs –
Additional parameters including: client_method: S3 operation (‘get_object’, ‘put_object’, etc.).
Defaults to ‘get_object’.
Additional parameters passed to the S3 operation.
- Returns:
Presigned URL string that provides temporary access to the S3 object.
Example
>>> fs = S3FileSystem() >>> url = fs.sign("s3://my-bucket/file.txt", expiration=7200) >>> # URL valid for 2 hours >>> >>> # Generate upload URL >>> upload_url = fs.sign( ... "s3://my-bucket/upload.txt", ... client_method="put_object" ... )
S3 Objects¶
- class pyathena.filesystem.s3_object.S3Object(init: Dict[str, Any], **kwargs)[source]¶
Represents an S3 object with metadata and filesystem-like properties.
This class provides a dictionary-like interface to S3 object metadata, making it easier to work with S3 objects in filesystem operations. It handles the mapping between S3 API field names and more pythonic property names.
The object supports both dictionary-style access and property-style access to metadata fields like content type, storage class, encryption settings, and object lock configurations.
Example
>>> s3_obj = S3Object({"ContentType": "text/csv", "ContentLength": 1024}) >>> print(s3_obj.content_type) # "text/csv" >>> print(s3_obj["content_length"]) # 1024 >>> s3_obj.storage_class = "STANDARD_IA"
Note
This class is primarily used internally by S3FileSystem for representing S3 objects in filesystem operations.
- class pyathena.filesystem.s3_object.S3ObjectType[source]¶
Constants for S3 object types in filesystem operations.
These constants are used to distinguish between directories and files when working with S3 paths through the S3FileSystem interface.
- class pyathena.filesystem.s3_object.S3StorageClass[source]¶
Constants for Amazon S3 storage classes.
S3 storage classes determine the availability, durability, and cost characteristics of stored objects. Each class is optimized for different access patterns and use cases.
- Storage classes:
STANDARD: Default storage for frequently accessed data
REDUCED_REDUNDANCY: Lower cost, reduced durability (deprecated)
STANDARD_IA: Infrequently accessed data with rapid retrieval
ONEZONE_IA: Lower cost IA storage in single availability zone
INTELLIGENT_TIERING: Automatic tiering between frequent/infrequent
GLACIER: Archive storage for long-term backup
DEEP_ARCHIVE: Lowest cost archive storage
GLACIER_IR: Archive with faster retrieval than standard Glacier
OUTPOSTS: Storage on AWS Outposts
See also
AWS S3 storage classes documentation: https://docs.aws.amazon.com/s3/latest/userguide/storage-class-intro.html
S3 Upload Operations¶
- class pyathena.filesystem.s3_object.S3PutObject(response: Dict[str, Any])[source]¶
Represents the response from an S3 PUT object operation.
This class encapsulates the metadata returned when uploading an object to S3, including encryption details, versioning information, and integrity checksums.
- expiration¶
Object expiration time if lifecycle policy applies.
- version_id¶
Version ID if bucket versioning is enabled.
- etag¶
Entity tag for the uploaded object.
- server_side_encryption¶
Server-side encryption method used.
- Various checksum properties
For data integrity verification.
Note
This class is used internally by S3FileSystem operations and typically not instantiated directly by users.
- class pyathena.filesystem.s3_object.S3MultipartUpload(response: Dict[str, Any])[source]¶
Represents an S3 multipart upload operation.
This class manages the metadata for multipart uploads, which allow uploading large files in chunks for better reliability and performance. It tracks upload identifiers, encryption settings, and lifecycle rules.
- bucket¶
S3 bucket name for the upload.
- key¶
Object key being uploaded.
- upload_id¶
Unique identifier for the multipart upload.
- server_side_encryption¶
Encryption method applied to the upload.
- abort_date/abort_rule_id
Lifecycle rule information for upload cleanup.
Note
Used internally by S3FileSystem for large file upload operations.
- class pyathena.filesystem.s3_object.S3MultipartUploadPart(part_number: int, response: Dict[str, Any])[source]¶
Represents a single part in an S3 multipart upload operation.
Each part in a multipart upload has its own metadata including checksums, encryption details, and part identification. This class manages that metadata and provides methods to convert it to API-compatible formats.
- part_number¶
The sequential part number (1-based).
- etag¶
Entity tag for this specific part.
- checksum_*
Various integrity checksums for the part data.
- server_side_encryption¶
Encryption settings for this part.
Note
Parts must be at least 5MB except for the last part. Used internally by S3FileSystem for chunked upload operations.
- class pyathena.filesystem.s3_object.S3CompleteMultipartUpload(response: Dict[str, Any])[source]¶
Represents the completion of an S3 multipart upload operation.
This class encapsulates the final response when a multipart upload is completed, including the final object location, versioning information, and consolidated metadata from all parts.
- location¶
Final S3 URL of the completed object.
- bucket¶
S3 bucket containing the object.
- key¶
Final object key.
- version_id¶
Version ID if bucket versioning is enabled.
- etag¶
Final entity tag of the complete object.
- server_side_encryption¶
Encryption applied to the final object.
Note
This represents the successful completion of a multipart upload. Used internally by S3FileSystem operations.