Documentation Index
Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Trials are pending forever
Tune is waiting on cluster resources.- Check resource availability with
ray status. - Verify your per-trial requests fit on the available nodes.
- Lower
max_concurrent_trials. - If autoscaling, check the autoscaler log for nodes failing to launch.
OOM during a trial
A worker hit the memory cap.- Reduce batch size or model size.
- Increase
resources_per_trial={"memory": ...}. - Profile with the dashboard’s memory tab.
Trials produce identical results
Every trial gets the same config — likely a bug in the search space.- Make sure
param_spaceusestune.uniform,tune.choice, etc., not constants. - If using
tune.with_parameters, confirm the wrapped function accepts the config.
ASHA is too aggressive
Trials get killed before they show their potential.- Increase
grace_period. - Lower
reduction_factorfrom 4 to 2 or 3. - Use a longer
max_t.
Search algorithm doesn’t seem to learn
Bayesian methods need a few warm-up trials before they’re useful.- Increase
num_samples. - Some algorithms (BayesOpt, BOHB) require setting an initial random budget.
- Verify the metric and mode are passed correctly to both the algorithm and the scheduler.
Storage path errors
Trials fail to write checkpoints.- Confirm credentials for cloud storage are present on every node.
- Verify the bucket/path exists.
- For NFS, check that it’s mounted at the same path on every node.
Resuming an experiment fails
- The
trainableargument must point to the same function used originally. - The pickle of the search algorithm is stored alongside results — same Python version is required to unpickle.
Diagnosing slowness
- Use
ray timeline -o /tmp/timeline.jsonand open inchrome://tracing. - Check whether one slow trial is blocking others (
max_concurrent_trials). - Profile the training function on a single trial first to confirm it’s not the bottleneck.
Next steps
Observability
Cluster-wide metrics and logs.
Distributed tuning
Scale Tune correctly.