Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

A Preprocessor encapsulates a fit/transform pipeline. It wraps a stateful transformation that can be fit on training data and applied to validation, test, and inference data.

Built-in preprocessors

PreprocessorWhat it does
StandardScalerSubtract mean, divide by std.
MinMaxScalerScale to a specified range.
CategorizerConvert string columns to integer codes.
OneHotEncoderOne-hot encode a categorical column.
LabelEncoderMap labels to integers.
TokenizerTokenize text columns.
HashingVectorizerHash tokens to feature buckets.
ConcatenatorConcatenate columns into a tensor column.
ChainCompose multiple preprocessors.

Fit and transform

from ray.data.preprocessors import StandardScaler

prep = StandardScaler(columns=["x1", "x2"])
prep.fit(train_ds)
train_ds = prep.transform(train_ds)
val_ds = prep.transform(val_ds)
fit computes the per-column mean and std; transform applies them.

Compose with Chain

from ray.data.preprocessors import Chain, StandardScaler, OneHotEncoder

prep = Chain(
    OneHotEncoder(columns=["category"]),
    StandardScaler(columns=["x1", "x2"]),
)
prep.fit(train_ds)
ds = prep.transform(train_ds)

Custom preprocessors

Subclass Preprocessor:
from ray.data.preprocessors import Preprocessor

class ClipOutliers(Preprocessor):
    def __init__(self, columns, lower=0.01, upper=0.99):
        self._columns = columns
        self._lower = lower
        self._upper = upper

    def _fit(self, ds):
        self.bounds_ = {
            c: (ds.min(c) * self._lower, ds.max(c) * self._upper)
            for c in self._columns
        }
        return self

    def _transform_pandas(self, df):
        for c, (lo, hi) in self.bounds_.items():
            df[c] = df[c].clip(lo, hi)
        return df

Save and load

prep.save("s3://bucket/prep.pkl")
prep = Preprocessor.load("s3://bucket/prep.pkl")
Persisted preprocessors include all fitted state, so you can apply them at inference time.

Next steps

Train integration

Use preprocessors in Ray Train.

Transforming data

Lower-level UDF transformations.