Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

This quickstart loads a sample dataset, applies a transformation, and consumes batches.

Install

pip install -U "ray[data]"

Load a dataset

import ray

ds = ray.data.read_parquet("s3://anonymous@air-example-data/iris.parquet")
ds.show(5)
read_parquet returns a lazy Dataset. The metadata is read immediately; the data is read on demand.

Inspect

print(ds.schema())
print(ds.count())

Transform

map_batches applies a function to each block of the dataset, returning a new block.
def normalize(batch):
    batch["sepal length (cm)"] = batch["sepal length (cm)"] / 10.0
    return batch

ds = ds.map_batches(normalize)
For row-level transformations, use map. For filters, use filter.
ds = ds.filter(lambda row: row["target"] != 2)

Consume

The simplest way to consume a dataset is to iterate over batches.
for batch in ds.iter_batches(batch_size=32, batch_format="numpy"):
    print(batch)
    break
Alternative formats:
  • batch_format="pandas" for pandas DataFrames
  • batch_format="pyarrow" for Arrow tables
  • iter_torch_batches() for PyTorch tensors

Save

ds.write_parquet("/tmp/output/")

Distribute across a cluster

Ray Data automatically uses every node in your Ray cluster. To run on a multi-node cluster, connect to it:
ray.init(address="auto")
ds = ray.data.read_parquet("s3://...")

Pipe into Ray Train

from ray.train import ScalingConfig
from ray.train.torch import TorchTrainer

trainer = TorchTrainer(
    train_loop_per_worker=train_func,
    scaling_config=ScalingConfig(num_workers=4, use_gpu=True),
    datasets={"train": ds},
)
trainer.fit()
The training workers automatically receive sharded streams of ds.

Next steps

Loading data

Sources and formats Ray Data supports.

Batch inference

Run a model over a dataset.