Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Use write_* operators to persist a dataset to disk or cloud storage. Writes execute in parallel — each block becomes one or more output files.

File formats

ds.write_parquet("s3://bucket/output/")

Partitioning

Write a partitioned dataset by passing a partitioning column:
ds.write_parquet(
    "s3://bucket/output/",
    partition_cols=["region", "day"],
)
This produces a Hive-style layout (region=us/day=2025-01-01/...).

Compression

Most writers accept a compression argument:
ds.write_parquet("s3://bucket/output/", compression="snappy")
ds.write_csv("s3://bucket/output/", compression="gzip")

Custom writers

Subclass Datasink for systems Ray Data doesn’t ship with:
from ray.data.datasource import Datasink

class MyDatasink(Datasink):
    def write(self, blocks, ctx):
        ...

ds.write_datasink(MyDatasink())

Concurrency and resources

Pass num_rows_per_file, concurrency, num_cpus, etc. to control writer behavior:
ds.write_parquet("s3://bucket/", num_rows_per_file=100_000, concurrency=8)

Next steps

Loading data

Read the data back in.

Working with tensors

Save and reload tensor columns.