Documentation Index
Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Ray Data can read from local disk, cloud storage (S3, GCS, Azure Blob), and a variety of formats. All readers return a lazy Dataset.
ds = ray.data.read_parquet("s3://bucket/path/")
Images
ds = ray.data.read_images(
"s3://bucket/images/",
size=(224, 224),
mode="RGB",
)
Each row contains an image column (HxWxC numpy array) and a path column.
Databases
ds = ray.data.read_sql(
"SELECT * FROM events",
create_connection=lambda: psycopg2.connect(...)
)
For Databricks:
ds = ray.data.read_databricks_tables(
catalog="main",
schema="default",
table="events",
)
In-memory objects
import pandas as pd
df = pd.DataFrame({"x": range(100)})
ds = ray.data.from_pandas(df)
Hugging Face
ds = ray.data.from_huggingface("imdb", split="train")
Custom data sources
Subclass Datasource to implement reads from custom systems.
from ray.data.datasource import Datasource, ReadTask
class MyDatasource(Datasource):
def get_read_tasks(self, parallelism: int) -> list[ReadTask]:
...
ds = ray.data.read_datasource(MyDatasource())
Authentication
Ray Data delegates authentication to the underlying storage client (s3fs, gcsfs, adlfs).
import pyarrow.fs as fs
filesystem = fs.S3FileSystem(access_key="...", secret_key="...")
ds = ray.data.read_parquet("bucket/path/", filesystem=filesystem)
Read parallelism
Ray Data picks the number of read tasks based on data size and cluster resources. Override with override_num_blocks:
ds = ray.data.read_parquet("s3://bucket/", override_num_blocks=64)
Next steps
Transforming data
Apply user-defined functions to your dataset.
Saving data
Write datasets back to storage.