Trial Checkpoints

Save a checkpoint
Resume a trial
Best checkpoint of a result
Configure retention
PBT and checkpoints
Next steps

Tune trials can save checkpoints to durable storage. Use them for fault tolerance, PBT (which copies checkpoints between trials), and to recover the best trial after a run finishes.

Save a checkpoint

import tempfile
import os
import torch
from ray import train, tune

def train_fn(config):
    model = build_model(config)
    for epoch in range(10):
        train_one_epoch(model)
        with tempfile.TemporaryDirectory() as tmpdir:
            torch.save(model.state_dict(), f"{tmpdir}/model.pt")
            train.report(
                {"loss": loss},
                checkpoint=train.Checkpoint.from_directory(tmpdir),
            )

Each call to train.report with a checkpoint persists it to the trial’s storage directory.

Resume a trial

Inside the training function:

checkpoint = train.get_checkpoint()
if checkpoint:
    with checkpoint.as_directory() as path:
        model.load_state_dict(torch.load(f"{path}/model.pt"))

Best checkpoint of a result

result = tuner.fit().get_best_result(metric="loss", mode="min")
best_ckpt = result.checkpoint
with best_ckpt.as_directory() as path:
    model.load_state_dict(torch.load(f"{path}/model.pt"))

Configure retention

from ray.train import CheckpointConfig

ray.train.RunConfig(
    checkpoint_config=CheckpointConfig(
        num_to_keep=3,
        checkpoint_score_attribute="loss",
        checkpoint_score_order="min",
    )
)

PBT and checkpoints

PBT copies checkpoints between trials when promoting good performers. Your training function must:

Save a checkpoint at every reporting step.
Restore from a checkpoint at the start of each step (or at the start of the function, when applicable).

Next steps

Analyzing results

Inspect, compare, and export trial results.

Distributed tuning

Sync checkpoints between nodes.

Stoppers Analyzing Results

⌘I

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Trial Checkpoints

Save a checkpoint

Resume a trial

Best checkpoint of a result

Configure retention

PBT and checkpoints

Next steps

Analyzing results

Distributed tuning

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Documentation Index

​Save a checkpoint

​Resume a trial

​Best checkpoint of a result

​Configure retention

​PBT and checkpoints

​Next steps

Analyzing results

Distributed tuning

Save a checkpoint

Resume a trial

Best checkpoint of a result

Configure retention

PBT and checkpoints

Next steps