Run Config

Basic usage
Storage
Run name
Stop conditions
Failure handling
Sync config
Callbacks
Next steps

RunConfig captures everything about a run that isn’t model code or scaling: where checkpoints go, what to call the run, when to stop, how to handle failures.

Basic usage

from ray.train import RunConfig, CheckpointConfig, FailureConfig

RunConfig(
    name="resnet50-imagenet",
    storage_path="s3://bucket/runs/",
    checkpoint_config=CheckpointConfig(num_to_keep=3),
    failure_config=FailureConfig(max_failures=2),
)

Storage

storage_path is where Ray Train writes checkpoints, metrics, and trial state. Local paths, S3, GCS, and any pyarrow-supported filesystem are valid.

Run name

name controls the directory under storage_path. Defaults to a generated name like TorchTrainer_2025-04-30_12-34-56.

Stop conditions

RunConfig(stop={"training_iteration": 50, "loss": 0.01})

Stops when any condition is met. Pass a callable for more complex stopping logic:

def stopper(trial_id, result):
    return result["loss"] < 0.01 or result["epoch"] >= 50

RunConfig(stop=stopper)

Failure handling

FailureConfig(max_failures=3, fail_fast=False)

max_failures retries the trial after worker errors. fail_fast=True stops the entire run on the first failure.

Sync config

Ray Train syncs checkpoints from each worker’s local disk to storage_path. Tune behavior with SyncConfig:

from ray.train import SyncConfig

SyncConfig(sync_period=300)  # sync at most every 5 minutes

Callbacks

Pass callbacks to integrate with experiment trackers:

from ray.train.callbacks.mlflow import MLflowLoggerCallback

RunConfig(callbacks=[MLflowLoggerCallback(experiment_name="my-exp")])

Built-in callbacks include TBXLoggerCallback, WandbLoggerCallback, MLflowLoggerCallback, and CSVLoggerCallback.

Next steps

Checkpointing

Configure checkpoint behavior in detail.

Fault tolerance

Failure semantics for distributed training.

Scaling Config Data Loading for Ray Train

⌘I

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Run Config

Basic usage

Storage

Run name

Stop conditions

Failure handling

Sync config

Callbacks

Next steps

Checkpointing

Fault tolerance

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Documentation Index

​Basic usage

​Storage

​Run name

​Stop conditions

​Failure handling

​Sync config

​Callbacks

​Next steps

Checkpointing

Fault tolerance

Basic usage

Storage

Run name

Stop conditions

Failure handling

Sync config

Callbacks

Next steps