Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

RunConfig captures everything about a run that isn’t model code or scaling: where checkpoints go, what to call the run, when to stop, how to handle failures.

Basic usage

from ray.train import RunConfig, CheckpointConfig, FailureConfig

RunConfig(
    name="resnet50-imagenet",
    storage_path="s3://bucket/runs/",
    checkpoint_config=CheckpointConfig(num_to_keep=3),
    failure_config=FailureConfig(max_failures=2),
)

Storage

storage_path is where Ray Train writes checkpoints, metrics, and trial state. Local paths, S3, GCS, and any pyarrow-supported filesystem are valid.

Run name

name controls the directory under storage_path. Defaults to a generated name like TorchTrainer_2025-04-30_12-34-56.

Stop conditions

RunConfig(stop={"training_iteration": 50, "loss": 0.01})
Stops when any condition is met. Pass a callable for more complex stopping logic:
def stopper(trial_id, result):
    return result["loss"] < 0.01 or result["epoch"] >= 50

RunConfig(stop=stopper)

Failure handling

FailureConfig(max_failures=3, fail_fast=False)
max_failures retries the trial after worker errors. fail_fast=True stops the entire run on the first failure.

Sync config

Ray Train syncs checkpoints from each worker’s local disk to storage_path. Tune behavior with SyncConfig:
from ray.train import SyncConfig

SyncConfig(sync_period=300)  # sync at most every 5 minutes

Callbacks

Pass callbacks to integrate with experiment trackers:
from ray.train.callbacks.mlflow import MLflowLoggerCallback

RunConfig(callbacks=[MLflowLoggerCallback(experiment_name="my-exp")])
Built-in callbacks include TBXLoggerCallback, WandbLoggerCallback, MLflowLoggerCallback, and CSVLoggerCallback.

Next steps

Checkpointing

Configure checkpoint behavior in detail.

Fault tolerance

Failure semantics for distributed training.