Troubleshooting Ray Tune

Trials are pending forever
OOM during a trial
Trials produce identical results
ASHA is too aggressive
Search algorithm doesn’t seem to learn
Storage path errors
Resuming an experiment fails
Diagnosing slowness
Next steps

Trials are pending forever

Tune is waiting on cluster resources.

Check resource availability with ray status.
Verify your per-trial requests fit on the available nodes.
Lower max_concurrent_trials.
If autoscaling, check the autoscaler log for nodes failing to launch.

OOM during a trial

A worker hit the memory cap.

Reduce batch size or model size.
Increase resources_per_trial={"memory": ...}.
Profile with the dashboard’s memory tab.

Trials produce identical results

Every trial gets the same config — likely a bug in the search space.

Make sure param_space uses tune.uniform, tune.choice, etc., not constants.
If using tune.with_parameters, confirm the wrapped function accepts the config.

ASHA is too aggressive

Trials get killed before they show their potential.

Increase grace_period.
Lower reduction_factor from 4 to 2 or 3.
Use a longer max_t.

Search algorithm doesn’t seem to learn

Bayesian methods need a few warm-up trials before they’re useful.

Increase num_samples.
Some algorithms (BayesOpt, BOHB) require setting an initial random budget.
Verify the metric and mode are passed correctly to both the algorithm and the scheduler.

Storage path errors

Trials fail to write checkpoints.

Confirm credentials for cloud storage are present on every node.
Verify the bucket/path exists.
For NFS, check that it’s mounted at the same path on every node.

Resuming an experiment fails

tuner = tune.Tuner.restore("s3://bucket/runs/my-experiment", trainable=train_fn)

The trainable argument must point to the same function used originally.
The pickle of the search algorithm is stored alongside results — same Python version is required to unpickle.

Diagnosing slowness

Use ray timeline -o /tmp/timeline.json and open in chrome://tracing.
Check whether one slow trial is blocking others (max_concurrent_trials).
Profile the training function on a single trial first to confirm it’s not the bottleneck.

Next steps

Observability

Cluster-wide metrics and logs.

Distributed tuning

Scale Tune correctly.

Distributed Hyperparameter Tuning Ray Serve Overview

⌘I

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Troubleshooting Ray Tune

Trials are pending forever

OOM during a trial

Trials produce identical results

ASHA is too aggressive

Search algorithm doesn’t seem to learn

Storage path errors

Resuming an experiment fails

Diagnosing slowness

Next steps

Observability

Distributed tuning

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Documentation Index

​Trials are pending forever

​OOM during a trial

​Trials produce identical results

​ASHA is too aggressive

​Search algorithm doesn’t seem to learn

​Storage path errors

​Resuming an experiment fails

​Diagnosing slowness

​Next steps

Observability

Distributed tuning

Trials are pending forever

OOM during a trial

Trials produce identical results

ASHA is too aggressive

Search algorithm doesn’t seem to learn

Storage path errors

Resuming an experiment fails

Diagnosing slowness

Next steps