Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Ray Data supports tensor columns natively: an Arrow column whose elements are NumPy ndarrays of the same shape.

Read tensor data

read_images returns tensor columns by default:
ds = ray.data.read_images("s3://bucket/images/", size=(224, 224))
ds.schema()
# Column           Type
# ------           ----
# image            ArrowTensorType(shape=(224, 224, 3), dtype=uint8)
# path             string
read_numpy does the same for raw .npy files.

Build tensor columns from arrays

import numpy as np

ds = ray.data.from_items([
    {"id": i, "embedding": np.random.rand(768)} for i in range(1000)
])

Transform tensor columns

def normalize(batch):
    arr = batch["image"]
    batch["image"] = arr.astype("float32") / 255.0
    return batch

ds = ds.map_batches(normalize, batch_format="numpy")
The numpy batch format gives you native ndarrays, which is the most ergonomic format for tensor work.

Use in training

for batch in ds.iter_torch_batches(batch_size=32, dtypes={"image": torch.float32}):
    out = model(batch["image"].cuda())

Variable-shape tensors

For tensors with varying shapes (e.g., variable-length sequences), use a list of arrays per row:
ds = ray.data.from_items([
    {"id": i, "tokens": np.random.randint(0, 50000, size=(np.random.randint(10, 100),))}
    for i in range(100)
])
Note that some operators (like sort or groupby) don’t support variable-shape tensor columns.

Save tensor columns

ds.write_parquet("s3://bucket/embeddings/")
Tensor columns are serialized as nested Arrow arrays.

Next steps

Batch inference

Run image and text models over tensor columns.

Working with LLMs

Score prompts at scale.