Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Tune trials can save checkpoints to durable storage. Use them for fault tolerance, PBT (which copies checkpoints between trials), and to recover the best trial after a run finishes.

Save a checkpoint

import tempfile
import os
import torch
from ray import train, tune

def train_fn(config):
    model = build_model(config)
    for epoch in range(10):
        train_one_epoch(model)
        with tempfile.TemporaryDirectory() as tmpdir:
            torch.save(model.state_dict(), f"{tmpdir}/model.pt")
            train.report(
                {"loss": loss},
                checkpoint=train.Checkpoint.from_directory(tmpdir),
            )
Each call to train.report with a checkpoint persists it to the trial’s storage directory.

Resume a trial

Inside the training function:
checkpoint = train.get_checkpoint()
if checkpoint:
    with checkpoint.as_directory() as path:
        model.load_state_dict(torch.load(f"{path}/model.pt"))

Best checkpoint of a result

result = tuner.fit().get_best_result(metric="loss", mode="min")
best_ckpt = result.checkpoint
with best_ckpt.as_directory() as path:
    model.load_state_dict(torch.load(f"{path}/model.pt"))

Configure retention

from ray.train import CheckpointConfig

ray.train.RunConfig(
    checkpoint_config=CheckpointConfig(
        num_to_keep=3,
        checkpoint_score_attribute="loss",
        checkpoint_score_order="min",
    )
)

PBT and checkpoints

PBT copies checkpoints between trials when promoting good performers. Your training function must:
  1. Save a checkpoint at every reporting step.
  2. Restore from a checkpoint at the start of each step (or at the start of the function, when applicable).

Next steps

Analyzing results

Inspect, compare, and export trial results.

Distributed tuning

Sync checkpoints between nodes.