Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Trials are pending forever

Tune is waiting on cluster resources.
  • Check resource availability with ray status.
  • Verify your per-trial requests fit on the available nodes.
  • Lower max_concurrent_trials.
  • If autoscaling, check the autoscaler log for nodes failing to launch.

OOM during a trial

A worker hit the memory cap.
  • Reduce batch size or model size.
  • Increase resources_per_trial={"memory": ...}.
  • Profile with the dashboard’s memory tab.

Trials produce identical results

Every trial gets the same config — likely a bug in the search space.
  • Make sure param_space uses tune.uniform, tune.choice, etc., not constants.
  • If using tune.with_parameters, confirm the wrapped function accepts the config.

ASHA is too aggressive

Trials get killed before they show their potential.
  • Increase grace_period.
  • Lower reduction_factor from 4 to 2 or 3.
  • Use a longer max_t.

Search algorithm doesn’t seem to learn

Bayesian methods need a few warm-up trials before they’re useful.
  • Increase num_samples.
  • Some algorithms (BayesOpt, BOHB) require setting an initial random budget.
  • Verify the metric and mode are passed correctly to both the algorithm and the scheduler.

Storage path errors

Trials fail to write checkpoints.
  • Confirm credentials for cloud storage are present on every node.
  • Verify the bucket/path exists.
  • For NFS, check that it’s mounted at the same path on every node.

Resuming an experiment fails

tuner = tune.Tuner.restore("s3://bucket/runs/my-experiment", trainable=train_fn)
  • The trainable argument must point to the same function used originally.
  • The pickle of the search algorithm is stored alongside results — same Python version is required to unpickle.

Diagnosing slowness

  • Use ray timeline -o /tmp/timeline.json and open in chrome://tracing.
  • Check whether one slow trial is blocking others (max_concurrent_trials).
  • Profile the training function on a single trial first to confirm it’s not the bottleneck.

Next steps

Observability

Cluster-wide metrics and logs.

Distributed tuning

Scale Tune correctly.