Use this file to discover all available pages before exploring further.
A Checkpoint is a directory of files that Ray Train manages: copying it to durable storage, attaching it to a trial, and exposing it through the result object.
Inside train_loop_per_worker, recover state from the latest checkpoint:
checkpoint = ray.train.get_checkpoint()if checkpoint: with checkpoint.as_directory() as ckpt_dir: state = torch.load(f"{ckpt_dir}/model.pt") model.load_state_dict(state)
Save model and optimizer state plus epoch/step counters in a single dict. On restore, you’ll need all three to resume mid-epoch.
Avoid saving raw Python objects in checkpoints. Use framework-native serializers (torch.save, model.save_pretrained, keras.save_model) so checkpoints are portable across Ray versions.