Pandas Integration

This section covers pandas-specific cursors and data converters.

Pandas Cursors

class pyathena.pandas.cursor.PandasCursor(s3_staging_dir: str | None = None, schema_name: str | None = None, catalog_name: str | None = None, work_group: str | None = None, poll_interval: float = 1, encryption_option: str | None = None, kms_key: str | None = None, kill_on_interrupt: bool = True, unload: bool = False, engine: str = 'auto', chunksize: int | None = None, block_size: int | None = None, cache_type: str | None = None, max_workers: int = 20, result_reuse_enable: bool = False, result_reuse_minutes: int = 60, auto_optimize_chunksize: bool = False, on_start_query_execution: Callable[[str], None] | None = None, **kwargs)[source]

Cursor for handling pandas DataFrame results from Athena queries.

This cursor returns query results as pandas DataFrames with memory-efficient processing through chunking support and automatic chunksize optimization for large result sets. It’s ideal for data analysis and data science workflows.

The cursor supports both regular CSV-based results and high-performance UNLOAD operations that return results in Parquet format, which is significantly faster for large datasets and preserves data types more accurately.

description

Sequence of column descriptions for the last query.

rowcount

Number of rows affected by the last query (-1 for SELECT queries).

arraysize

Default number of rows to fetch with fetchmany().

chunksize

Number of rows per chunk when iterating through results.

Example

>>> from pyathena.pandas.cursor import PandasCursor
>>> cursor = connection.cursor(PandasCursor)
>>> cursor.execute("SELECT * FROM sales_data WHERE year = 2023")
>>> df = cursor.fetchall()  # Returns pandas DataFrame
>>> print(df.describe())

# Memory-efficient iteration for large datasets >>> cursor.execute(“SELECT * FROM huge_table”) >>> for chunk_df in cursor: … process_chunk(chunk_df) # Process data in chunks

# High-performance UNLOAD for large datasets >>> cursor = connection.cursor(PandasCursor, unload=True) >>> cursor.execute(“SELECT * FROM big_table”) >>> df = cursor.fetchall() # Faster Parquet-based result

__init__(s3_staging_dir: str | None = None, schema_name: str | None = None, catalog_name: str | None = None, work_group: str | None = None, poll_interval: float = 1, encryption_option: str | None = None, kms_key: str | None = None, kill_on_interrupt: bool = True, unload: bool = False, engine: str = 'auto', chunksize: int | None = None, block_size: int | None = None, cache_type: str | None = None, max_workers: int = 20, result_reuse_enable: bool = False, result_reuse_minutes: int = 60, auto_optimize_chunksize: bool = False, on_start_query_execution: Callable[[str], None] | None = None, **kwargs) None[source]

Initialize PandasCursor with configuration options.

Parameters:
  • s3_staging_dir – S3 directory for query result staging.

  • schema_name – Default schema name for queries.

  • catalog_name – Default catalog name for queries.

  • work_group – Athena workgroup name.

  • poll_interval – Query polling interval in seconds.

  • encryption_option – S3 encryption option.

  • kms_key – KMS key for encryption.

  • kill_on_interrupt – Cancel query on interrupt signal.

  • unload – Use UNLOAD statement for faster result retrieval.

  • engine – CSV parsing engine (‘auto’, ‘c’, ‘python’, ‘pyarrow’).

  • chunksize – Number of rows per chunk for memory-efficient processing. If specified, takes precedence over auto_optimize_chunksize.

  • block_size – S3 read block size.

  • cache_type – S3 caching strategy.

  • max_workers – Maximum worker threads for parallel processing.

  • result_reuse_enable – Enable query result reuse.

  • result_reuse_minutes – Result reuse duration in minutes.

  • auto_optimize_chunksize – Enable automatic chunksize determination for large files. Only effective when chunksize is None. Default: False (no automatic chunking).

  • on_start_query_execution – Callback for query start events.

  • **kwargs – Additional arguments passed to pandas.read_csv.

static get_default_converter(unload: bool = False) DefaultPandasTypeConverter | Any[source]

Get the default type converter for this cursor class.

Parameters:

unload – Whether the converter is for UNLOAD operations. Some cursor types may return different converters for UNLOAD operations.

Returns:

The default type converter instance for this cursor type.

property arraysize: int
property result_set: AthenaPandasResultSet | None
property query_id: str | None
property rownumber: int | None
close() None[source]
execute(operation: str, parameters: Dict[str, Any] | List[str] | None = None, work_group: str | None = None, s3_staging_dir: str | None = None, cache_size: int | None = 0, cache_expiration_time: int | None = 0, result_reuse_enable: bool | None = None, result_reuse_minutes: int | None = None, paramstyle: str | None = None, keep_default_na: bool = False, na_values: Iterable[str] | None = ('',), quoting: int = 1, on_start_query_execution: Callable[[str], None] | None = None, **kwargs) PandasCursor[source]

Execute a SQL query and return results as pandas DataFrames.

Executes the SQL query on Amazon Athena and configures the result set for pandas DataFrame output. Supports both regular CSV-based results and high-performance UNLOAD operations with Parquet format.

Parameters:
  • operation – SQL query string to execute.

  • parameters – Query parameters for parameterized queries.

  • work_group – Athena workgroup to use for this query.

  • s3_staging_dir – S3 location for query results.

  • cache_size – Number of queries to check for result caching.

  • cache_expiration_time – Cache expiration time in seconds.

  • result_reuse_enable – Enable Athena result reuse for this query.

  • result_reuse_minutes – Minutes to reuse cached results.

  • paramstyle – Parameter style (‘qmark’ or ‘pyformat’).

  • keep_default_na – Whether to keep default pandas NA values.

  • na_values – Additional values to treat as NA.

  • quoting – CSV quoting behavior (pandas csv.QUOTE_* constants).

  • on_start_query_execution – Callback called when query starts.

  • **kwargs – Additional pandas read_csv/read_parquet parameters.

Returns:

Self reference for method chaining.

Example

>>> cursor.execute("SELECT * FROM sales WHERE year = %(year)s",
...                {"year": 2023})
>>> df = cursor.fetchall()  # Returns pandas DataFrame
executemany(operation: str, seq_of_parameters: List[Dict[str, Any] | List[str] | None], **kwargs) None[source]
cancel() None[source]
fetchone() Tuple[Any | None, ...] | Dict[Any, Any | None] | None[source]
fetchmany(size: int | None = None) List[Tuple[Any | None, ...] | Dict[Any, Any | None]][source]
DEFAULT_FETCH_SIZE: int = 1000
DEFAULT_RESULT_REUSE_MINUTES = 60
LIST_DATABASES_MAX_RESULTS = 50
LIST_QUERY_EXECUTIONS_MAX_RESULTS = 50
LIST_TABLE_METADATA_MAX_RESULTS = 50
property catalog: str | None
property completion_date_time: datetime | None
property connection: Connection[Any]
property data_manifest_location: str | None
property data_scanned_in_bytes: int | None
property database: str | None
property description: List[Tuple[str, str, None, None, int, int, str]] | None
property effective_engine_version: str | None
property encryption_option: str | None
property engine_execution_time_in_millis: int | None
property error_category: int | None
property error_message: str | None
property error_type: int | None
property execution_parameters: List[str]
property expected_bucket_owner: str | None
fetchall() List[Tuple[Any | None, ...] | Dict[Any, Any | None]][source]
get_table_metadata(table_name: str, catalog_name: str | None = None, schema_name: str | None = None, logging_: bool = True) AthenaTableMetadata
property has_result_set: bool
property kms_key: str | None
list_databases(catalog_name: str | None, max_results: int | None = None) List[AthenaDatabase]
list_table_metadata(catalog_name: str | None = None, schema_name: str | None = None, expression: str | None = None, max_results: int | None = None) List[AthenaTableMetadata]
property output_location: str | None
property query: str | None
property query_planning_time_in_millis: int | None
property query_queue_time_in_millis: int | None
property result_reuse_enabled: bool | None
property result_reuse_minutes: int | None
property retryable: bool | None
property reused_previous_result: bool | None
property rowcount: int
property s3_acl_option: str | None
property selected_engine_version: str | None
property service_processing_time_in_millis: int | None
setinputsizes(sizes)

Does nothing by default

setoutputsize(size, column=None)

Does nothing by default

property state: str | None
property state_change_reason: str | None
property statement_type: str | None
property submission_date_time: datetime | None
property substatement_type: str | None
property total_execution_time_in_millis: int | None
property work_group: str | None
as_pandas() 'DataFrame' | DataFrameIterator[source]

Return DataFrame or DataFrameIterator based on chunksize setting.

Returns:

DataFrame when chunksize is None, DataFrameIterator when chunksize is set.

iter_chunks() Generator['DataFrame', None, None][source]

Iterate over DataFrame chunks for memory-efficient processing.

This method provides an iterator interface for processing large result sets in chunks, preventing memory exhaustion when working with datasets that are too large to fit in memory as a single DataFrame.

Chunking behavior: - If chunksize is explicitly set, uses that value - If auto_optimize_chunksize=True and chunksize=None, automatically determines

optimal chunksize based on file size

  • If auto_optimize_chunksize=False and chunksize=None, yields entire DataFrame

Yields:

DataFrame

Individual chunks of the result set when chunking is enabled,

or the entire DataFrame as a single chunk when chunking is disabled.

Examples

# Explicit chunksize cursor = connection.cursor(PandasCursor, chunksize=50000) cursor.execute(“SELECT * FROM large_table”) for chunk in cursor.iter_chunks():

process_chunk(chunk)

# Auto-optimization enabled cursor = connection.cursor(PandasCursor, auto_optimize_chunksize=True) cursor.execute(“SELECT * FROM large_table”) for chunk in cursor.iter_chunks():

process_chunk(chunk) # Chunks determined automatically for large files

# No chunking (default behavior) cursor = connection.cursor(PandasCursor) cursor.execute(“SELECT * FROM large_table”) for chunk in cursor.iter_chunks():

process_chunk(chunk) # Single DataFrame regardless of size

class pyathena.pandas.async_cursor.AsyncPandasCursor(s3_staging_dir: str | None = None, schema_name: str | None = None, catalog_name: str | None = None, work_group: str | None = None, poll_interval: float = 1, encryption_option: str | None = None, kms_key: str | None = None, kill_on_interrupt: bool = True, max_workers: int = 20, arraysize: int = 1000, unload: bool = False, engine: str = 'auto', chunksize: int | None = None, result_reuse_enable: bool = False, result_reuse_minutes: int = 60, **kwargs)[source]

Asynchronous cursor that returns results as pandas DataFrames.

This cursor extends AsyncCursor to provide asynchronous query execution with results returned as pandas DataFrames. It’s designed for data analysis workflows where pandas integration is required and non-blocking query execution is beneficial.

Features:
  • Asynchronous query execution with concurrent futures

  • Direct pandas DataFrame results for data analysis

  • Configurable CSV and Parquet engines for optimal performance

  • Support for chunked processing of large datasets

  • UNLOAD operations for improved performance with large results

  • Memory optimization through configurable chunking

arraysize

Number of rows to fetch per batch.

engine

Parsing engine (‘auto’, ‘c’, ‘python’, ‘pyarrow’).

chunksize

Number of rows per chunk for large datasets.

Example

>>> import asyncio
>>> from pyathena.pandas.async_cursor import AsyncPandasCursor
>>>
>>> cursor = connection.cursor(AsyncPandasCursor, chunksize=10000)
>>> query_id, future = cursor.execute("SELECT * FROM large_table")
>>>
>>> # Get result when ready
>>> result_set = await future
>>> df = result_set.as_pandas()
>>>
>>> # Or iterate through chunks for large datasets
>>> for chunk_df in result_set:
...     process_chunk(chunk_df)

Note

Requires pandas to be installed. For large datasets, consider using chunksize or UNLOAD operations for better memory efficiency.

__init__(s3_staging_dir: str | None = None, schema_name: str | None = None, catalog_name: str | None = None, work_group: str | None = None, poll_interval: float = 1, encryption_option: str | None = None, kms_key: str | None = None, kill_on_interrupt: bool = True, max_workers: int = 20, arraysize: int = 1000, unload: bool = False, engine: str = 'auto', chunksize: int | None = None, result_reuse_enable: bool = False, result_reuse_minutes: int = 60, **kwargs) None[source]
static get_default_converter(unload: bool = False) DefaultPandasTypeConverter | Any[source]

Get the default type converter for this cursor class.

Parameters:

unload – Whether the converter is for UNLOAD operations. Some cursor types may return different converters for UNLOAD operations.

Returns:

The default type converter instance for this cursor type.

property arraysize: int
LIST_DATABASES_MAX_RESULTS = 50
LIST_QUERY_EXECUTIONS_MAX_RESULTS = 50
LIST_TABLE_METADATA_MAX_RESULTS = 50
cancel(query_id: str) Future[None]

Cancel a running query asynchronously.

Submits a cancellation request for the specified query. The cancellation itself runs asynchronously in the background.

Parameters:

query_id – The Athena query execution ID to cancel.

Returns:

Future object that completes when the cancellation request finishes.

Example

>>> query_id, future = cursor.execute("SELECT * FROM huge_table")
>>> # Later, cancel the query
>>> cancel_future = cursor.cancel(query_id)
>>> cancel_future.result()  # Wait for cancellation to complete
close(wait: bool = False) None
property connection: Connection[Any]
description(query_id: str) Future[List[Tuple[str, str, None, None, int, int, str]] | None]
execute(operation: str, parameters: Dict[str, Any] | List[str] | None = None, work_group: str | None = None, s3_staging_dir: str | None = None, cache_size: int | None = 0, cache_expiration_time: int | None = 0, result_reuse_enable: bool | None = None, result_reuse_minutes: int | None = None, paramstyle: str | None = None, keep_default_na: bool = False, na_values: Iterable[str] | None = ('',), quoting: int = 1, **kwargs) Tuple[str, Future[AthenaPandasResultSet | Any]][source]

Execute a SQL query asynchronously.

Starts query execution on Amazon Athena and returns immediately without waiting for completion. The query runs in the background while your application can continue with other work.

Parameters:
  • operation – SQL query string to execute.

  • parameters – Query parameters (optional).

  • work_group – Athena workgroup to use (optional).

  • s3_staging_dir – S3 location for query results (optional).

  • cache_size – Query result cache size in MB (optional).

  • cache_expiration_time – Cache expiration time in seconds (optional).

  • result_reuse_enable – Enable result reuse for identical queries (optional).

  • result_reuse_minutes – Result reuse duration in minutes (optional).

  • paramstyle – Parameter style to use (optional).

  • **kwargs – Additional execution parameters.

Returns:

  • query_id: Athena query execution ID for tracking

  • future: Future object for result retrieval

Return type:

Tuple of (query_id, future) where

Example

>>> query_id, future = cursor.execute("SELECT * FROM large_table")
>>> print(f"Query started: {query_id}")
>>> # Do other work while query runs...
>>> result_set = future.result()  # Wait for completion
executemany(operation: str, seq_of_parameters: List[Dict[str, Any] | List[str] | None], **kwargs) None

Execute multiple queries asynchronously (not supported).

This method is not supported for asynchronous cursors because managing multiple concurrent queries would be complex and resource-intensive.

Parameters:
  • operation – SQL query string.

  • seq_of_parameters – Sequence of parameter sets.

  • **kwargs – Additional arguments.

Raises:

NotSupportedError – Always raised as this operation is not supported.

Note

For bulk operations, consider using execute() with parameterized queries or batch processing patterns instead.

get_table_metadata(table_name: str, catalog_name: str | None = None, schema_name: str | None = None, logging_: bool = True) AthenaTableMetadata
list_databases(catalog_name: str | None, max_results: int | None = None) List[AthenaDatabase]
list_table_metadata(catalog_name: str | None = None, schema_name: str | None = None, expression: str | None = None, max_results: int | None = None) List[AthenaTableMetadata]
poll(query_id: str) Future[AthenaQueryExecution]

Poll for query completion asynchronously.

Waits for the query to complete (succeed, fail, or be cancelled) and returns the final execution status. This method blocks until completion but runs the polling in a background thread.

Parameters:

query_id – The Athena query execution ID to poll.

Returns:

Future object containing the final AthenaQueryExecution status.

Note

This method performs polling internally, so it will take time proportional to your query execution duration.

query_execution(query_id: str) Future[AthenaQueryExecution]

Get query execution details asynchronously.

Retrieves the current execution status and metadata for a query. This is useful for monitoring query progress without blocking.

Parameters:

query_id – The Athena query execution ID.

Returns:

Future object containing AthenaQueryExecution with query details.

setinputsizes(sizes)

Does nothing by default

setoutputsize(size, column=None)

Does nothing by default

Pandas Data Converters

class pyathena.pandas.converter.DefaultPandasTypeConverter[source]

Optimized type converter for pandas DataFrame results.

This converter is specifically designed for the PandasCursor and provides optimized type conversion that works well with pandas data types. It minimizes conversions for types that pandas handles efficiently and only converts complex types that need special handling.

The converter focuses on:
  • Preserving numeric types for pandas optimization

  • Converting only complex types (json, binary, etc.)

  • Maintaining compatibility with pandas data type inference

Example

>>> from pyathena.pandas.converter import DefaultPandasTypeConverter
>>> converter = DefaultPandasTypeConverter()
>>>
>>> # Used automatically by PandasCursor
>>> cursor = connection.cursor(PandasCursor)
>>> # converter is applied automatically to results

Note

This converter is used by default in PandasCursor. Most users don’t need to instantiate it directly.

__init__() None[source]
convert(type_: str, value: str | None) Any | None[source]
class pyathena.pandas.converter.DefaultPandasUnloadTypeConverter[source]

Type converter for pandas UNLOAD operations.

This converter is designed for use with UNLOAD queries that write results directly to Parquet files in S3. Since UNLOAD operations bypass the normal conversion process and write data in native Parquet format, this converter has minimal functionality.

Note

Used automatically when PandasCursor is configured with unload=True. UNLOAD results are read directly as DataFrames from Parquet files.

__init__() None[source]
convert(type_: str, value: str | None) Any | None[source]