Apache Arrow Integration¶
This section covers Apache Arrow-specific cursors and data converters.
Arrow Cursors¶
- class pyathena.arrow.cursor.ArrowCursor(s3_staging_dir: str | None = None, schema_name: str | None = None, catalog_name: str | None = None, work_group: str | None = None, poll_interval: float = 1, encryption_option: str | None = None, kms_key: str | None = None, kill_on_interrupt: bool = True, unload: bool = False, result_reuse_enable: bool = False, result_reuse_minutes: int = 60, on_start_query_execution: Callable[[str], None] | None = None, **kwargs)[source]¶
Cursor for handling Apache Arrow Table results from Athena queries.
This cursor returns query results as Apache Arrow Tables, which provide efficient columnar data processing and memory usage. Arrow Tables are especially useful for analytical workloads and data science applications.
The cursor supports both regular CSV-based results and high-performance UNLOAD operations that return results in Parquet format for improved performance with large datasets.
- description¶
Sequence of column descriptions for the last query.
- rowcount¶
Number of rows affected by the last query (-1 for SELECT queries).
- arraysize¶
Default number of rows to fetch with fetchmany().
Example
>>> from pyathena.arrow.cursor import ArrowCursor >>> cursor = connection.cursor(ArrowCursor) >>> cursor.execute("SELECT * FROM large_table") >>> table = cursor.fetchall() # Returns pyarrow.Table >>> df = table.to_pandas() # Convert to pandas if needed
# High-performance UNLOAD for large datasets >>> cursor = connection.cursor(ArrowCursor, unload=True) >>> cursor.execute(“SELECT * FROM huge_table”) >>> table = cursor.fetchall() # Faster Parquet-based result
- __init__(s3_staging_dir: str | None = None, schema_name: str | None = None, catalog_name: str | None = None, work_group: str | None = None, poll_interval: float = 1, encryption_option: str | None = None, kms_key: str | None = None, kill_on_interrupt: bool = True, unload: bool = False, result_reuse_enable: bool = False, result_reuse_minutes: int = 60, on_start_query_execution: Callable[[str], None] | None = None, **kwargs) None [source]¶
- static get_default_converter(unload: bool = False) DefaultArrowTypeConverter | DefaultArrowUnloadTypeConverter | Any [source]¶
Get the default type converter for this cursor class.
- Parameters:
unload – Whether the converter is for UNLOAD operations. Some cursor types may return different converters for UNLOAD operations.
- Returns:
The default type converter instance for this cursor type.
- execute(operation: str, parameters: Dict[str, Any] | List[str] | None = None, work_group: str | None = None, s3_staging_dir: str | None = None, cache_size: int | None = 0, cache_expiration_time: int | None = 0, result_reuse_enable: bool | None = None, result_reuse_minutes: int | None = None, paramstyle: str | None = None, on_start_query_execution: Callable[[str], None] | None = None, **kwargs) ArrowCursor [source]¶
Execute a SQL query and return results as Apache Arrow Tables.
Executes the SQL query on Amazon Athena and configures the result set for Apache Arrow Table output. Arrow format provides high-performance columnar data processing with efficient memory usage.
- Parameters:
operation – SQL query string to execute.
parameters – Query parameters for parameterized queries.
work_group – Athena workgroup to use for this query.
s3_staging_dir – S3 location for query results.
cache_size – Number of queries to check for result caching.
cache_expiration_time – Cache expiration time in seconds.
result_reuse_enable – Enable Athena result reuse for this query.
result_reuse_minutes – Minutes to reuse cached results.
paramstyle – Parameter style (‘qmark’ or ‘pyformat’).
on_start_query_execution – Callback called when query starts.
**kwargs – Additional execution parameters.
- Returns:
Self reference for method chaining.
Example
>>> cursor.execute("SELECT * FROM sales WHERE year = 2023") >>> table = cursor.as_arrow() # Returns Apache Arrow Table
- executemany(operation: str, seq_of_parameters: List[Dict[str, Any] | List[str] | None], **kwargs) None [source]¶
- DEFAULT_RESULT_REUSE_MINUTES = 60¶
- LIST_DATABASES_MAX_RESULTS = 50¶
- LIST_QUERY_EXECUTIONS_MAX_RESULTS = 50¶
- LIST_TABLE_METADATA_MAX_RESULTS = 50¶
- property connection: Connection[Any]¶
- get_table_metadata(table_name: str, catalog_name: str | None = None, schema_name: str | None = None, logging_: bool = True) AthenaTableMetadata ¶
- list_table_metadata(catalog_name: str | None = None, schema_name: str | None = None, expression: str | None = None, max_results: int | None = None) List[AthenaTableMetadata] ¶
- setinputsizes(sizes)¶
Does nothing by default
- setoutputsize(size, column=None)¶
Does nothing by default
- as_arrow() Table [source]¶
Return query results as an Apache Arrow Table.
Converts the entire result set into an Apache Arrow Table for efficient columnar data processing. Arrow Tables provide excellent performance for analytical workloads and interoperability with other data processing frameworks.
- Returns:
Apache Arrow Table containing all query results.
- Raises:
ProgrammingError – If no query has been executed or no results are available.
Example
>>> cursor = connection.cursor(ArrowCursor) >>> cursor.execute("SELECT * FROM my_table") >>> table = cursor.as_arrow() >>> print(f"Table has {table.num_rows} rows and {table.num_columns} columns")
- class pyathena.arrow.async_cursor.AsyncArrowCursor(s3_staging_dir: str | None = None, schema_name: str | None = None, catalog_name: str | None = None, work_group: str | None = None, poll_interval: float = 1, encryption_option: str | None = None, kms_key: str | None = None, kill_on_interrupt: bool = True, max_workers: int = 20, arraysize: int = 1000, unload: bool = False, result_reuse_enable: bool = False, result_reuse_minutes: int = 60, **kwargs)[source]¶
Asynchronous cursor that returns results in Apache Arrow format.
This cursor extends AsyncCursor to provide asynchronous query execution with results returned as Apache Arrow Tables or RecordBatches. It’s optimized for high-performance analytics workloads and interoperability with the Apache Arrow ecosystem.
- Features:
Asynchronous query execution with concurrent futures
Apache Arrow columnar data format for high performance
Memory-efficient processing of large datasets
Support for UNLOAD operations with Parquet output
Integration with pandas, Polars, and other Arrow-compatible libraries
- arraysize¶
Number of rows to fetch per batch (configurable).
Example
>>> import asyncio >>> from pyathena.arrow.async_cursor import AsyncArrowCursor >>> >>> cursor = connection.cursor(AsyncArrowCursor, unload=True) >>> query_id, future = cursor.execute("SELECT * FROM large_table") >>> >>> # Get result when ready >>> result_set = await future >>> arrow_table = result_set.as_arrow() >>> >>> # Convert to pandas if needed >>> df = arrow_table.to_pandas()
Note
Requires pyarrow to be installed. UNLOAD operations generate Parquet files in S3 for optimal Arrow compatibility.
- __init__(s3_staging_dir: str | None = None, schema_name: str | None = None, catalog_name: str | None = None, work_group: str | None = None, poll_interval: float = 1, encryption_option: str | None = None, kms_key: str | None = None, kill_on_interrupt: bool = True, max_workers: int = 20, arraysize: int = 1000, unload: bool = False, result_reuse_enable: bool = False, result_reuse_minutes: int = 60, **kwargs) None [source]¶
- static get_default_converter(unload: bool = False) DefaultArrowTypeConverter | DefaultArrowUnloadTypeConverter | Any [source]¶
Get the default type converter for this cursor class.
- Parameters:
unload – Whether the converter is for UNLOAD operations. Some cursor types may return different converters for UNLOAD operations.
- Returns:
The default type converter instance for this cursor type.
- LIST_DATABASES_MAX_RESULTS = 50¶
- LIST_QUERY_EXECUTIONS_MAX_RESULTS = 50¶
- LIST_TABLE_METADATA_MAX_RESULTS = 50¶
- cancel(query_id: str) Future[None] ¶
Cancel a running query asynchronously.
Submits a cancellation request for the specified query. The cancellation itself runs asynchronously in the background.
- Parameters:
query_id – The Athena query execution ID to cancel.
- Returns:
Future object that completes when the cancellation request finishes.
Example
>>> query_id, future = cursor.execute("SELECT * FROM huge_table") >>> # Later, cancel the query >>> cancel_future = cursor.cancel(query_id) >>> cancel_future.result() # Wait for cancellation to complete
- property connection: Connection[Any]¶
- execute(operation: str, parameters: Dict[str, Any] | List[str] | None = None, work_group: str | None = None, s3_staging_dir: str | None = None, cache_size: int | None = 0, cache_expiration_time: int | None = 0, result_reuse_enable: bool | None = None, result_reuse_minutes: int | None = None, paramstyle: str | None = None, **kwargs) Tuple[str, Future[AthenaArrowResultSet | Any]] [source]¶
Execute a SQL query asynchronously.
Starts query execution on Amazon Athena and returns immediately without waiting for completion. The query runs in the background while your application can continue with other work.
- Parameters:
operation – SQL query string to execute.
parameters – Query parameters (optional).
work_group – Athena workgroup to use (optional).
s3_staging_dir – S3 location for query results (optional).
cache_size – Query result cache size in MB (optional).
cache_expiration_time – Cache expiration time in seconds (optional).
result_reuse_enable – Enable result reuse for identical queries (optional).
result_reuse_minutes – Result reuse duration in minutes (optional).
paramstyle – Parameter style to use (optional).
**kwargs – Additional execution parameters.
- Returns:
query_id: Athena query execution ID for tracking
future: Future object for result retrieval
- Return type:
Tuple of (query_id, future) where
Example
>>> query_id, future = cursor.execute("SELECT * FROM large_table") >>> print(f"Query started: {query_id}") >>> # Do other work while query runs... >>> result_set = future.result() # Wait for completion
- executemany(operation: str, seq_of_parameters: List[Dict[str, Any] | List[str] | None], **kwargs) None ¶
Execute multiple queries asynchronously (not supported).
This method is not supported for asynchronous cursors because managing multiple concurrent queries would be complex and resource-intensive.
- Parameters:
operation – SQL query string.
seq_of_parameters – Sequence of parameter sets.
**kwargs – Additional arguments.
- Raises:
NotSupportedError – Always raised as this operation is not supported.
Note
For bulk operations, consider using execute() with parameterized queries or batch processing patterns instead.
- get_table_metadata(table_name: str, catalog_name: str | None = None, schema_name: str | None = None, logging_: bool = True) AthenaTableMetadata ¶
- list_table_metadata(catalog_name: str | None = None, schema_name: str | None = None, expression: str | None = None, max_results: int | None = None) List[AthenaTableMetadata] ¶
- poll(query_id: str) Future[AthenaQueryExecution] ¶
Poll for query completion asynchronously.
Waits for the query to complete (succeed, fail, or be cancelled) and returns the final execution status. This method blocks until completion but runs the polling in a background thread.
- Parameters:
query_id – The Athena query execution ID to poll.
- Returns:
Future object containing the final AthenaQueryExecution status.
Note
This method performs polling internally, so it will take time proportional to your query execution duration.
- query_execution(query_id: str) Future[AthenaQueryExecution] ¶
Get query execution details asynchronously.
Retrieves the current execution status and metadata for a query. This is useful for monitoring query progress without blocking.
- Parameters:
query_id – The Athena query execution ID.
- Returns:
Future object containing AthenaQueryExecution with query details.
- setinputsizes(sizes)¶
Does nothing by default
- setoutputsize(size, column=None)¶
Does nothing by default
Arrow Data Converters¶
- class pyathena.arrow.converter.DefaultArrowTypeConverter[source]¶
Optimized type converter for Apache Arrow Table results.
This converter is specifically designed for the ArrowCursor and provides optimized type conversion for Apache Arrow’s columnar data format. It converts Athena data types to Python types that are efficiently handled by Apache Arrow.
- The converter focuses on:
Converting date/time types to appropriate Python objects
Handling decimal and binary types for Arrow compatibility
Preserving JSON and complex types
Maintaining high performance for columnar operations
Example
>>> from pyathena.arrow.converter import DefaultArrowTypeConverter >>> converter = DefaultArrowTypeConverter() >>> >>> # Used automatically by ArrowCursor >>> cursor = connection.cursor(ArrowCursor) >>> # converter is applied automatically to results
Note
This converter is used by default in ArrowCursor. Most users don’t need to instantiate it directly.
- class pyathena.arrow.converter.DefaultArrowUnloadTypeConverter[source]¶
Type converter for Arrow UNLOAD operations.
This converter is designed for use with UNLOAD queries that write results directly to Parquet files in S3. Since UNLOAD operations bypass the normal conversion process and write data in native Parquet format, this converter has minimal functionality.
Note
Used automatically when ArrowCursor is configured with unload=True. UNLOAD results are read directly as Arrow tables from Parquet files.