Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Ray Data can read from local disk, cloud storage (S3, GCS, Azure Blob), and a variety of formats. All readers return a lazy Dataset.

File formats

ds = ray.data.read_parquet("s3://bucket/path/")

Images

ds = ray.data.read_images(
    "s3://bucket/images/",
    size=(224, 224),
    mode="RGB",
)
Each row contains an image column (HxWxC numpy array) and a path column.

Databases

ds = ray.data.read_sql(
    "SELECT * FROM events",
    create_connection=lambda: psycopg2.connect(...)
)
For Databricks:
ds = ray.data.read_databricks_tables(
    catalog="main",
    schema="default",
    table="events",
)

In-memory objects

import pandas as pd
df = pd.DataFrame({"x": range(100)})
ds = ray.data.from_pandas(df)

Hugging Face

ds = ray.data.from_huggingface("imdb", split="train")

Custom data sources

Subclass Datasource to implement reads from custom systems.
from ray.data.datasource import Datasource, ReadTask

class MyDatasource(Datasource):
    def get_read_tasks(self, parallelism: int) -> list[ReadTask]:
        ...

ds = ray.data.read_datasource(MyDatasource())

Authentication

Ray Data delegates authentication to the underlying storage client (s3fs, gcsfs, adlfs).
import pyarrow.fs as fs
filesystem = fs.S3FileSystem(access_key="...", secret_key="...")
ds = ray.data.read_parquet("bucket/path/", filesystem=filesystem)

Read parallelism

Ray Data picks the number of read tasks based on data size and cluster resources. Override with override_num_blocks:
ds = ray.data.read_parquet("s3://bucket/", override_num_blocks=64)

Next steps

Transforming data

Apply user-defined functions to your dataset.

Saving data

Write datasets back to storage.