Distributed Hyperparameter Tuning

Connect to a cluster
Per-trial resources
Distributed-training trials
Concurrency
Shared storage
Autoscaling
Next steps

Ray Tune is distributed by default — connect to a Ray cluster and Tune places trials across the available nodes.

Connect to a cluster

import ray
ray.init(address="auto")

Or set RAY_ADDRESS and let ray.init pick it up.

Per-trial resources

tuner = tune.Tuner(
    tune.with_resources(train_fn, {"cpu": 4, "gpu": 1}),
    ...
)

Each trial reserves the requested resources from the cluster. The cluster autoscaler (if enabled) adds nodes when more capacity is needed.

Distributed-training trials

Wrap a Ray Train trainer in tune.with_parameters:

from ray.train.torch import TorchTrainer

def trainable(config):
    trainer = TorchTrainer(
        train_loop_per_worker,
        train_loop_config=config,
        scaling_config=ScalingConfig(num_workers=4, use_gpu=True),
    )
    trainer.fit()

tuner = tune.Tuner(
    trainable,
    param_space={"lr": tune.loguniform(1e-5, 1e-1)},
    tune_config=tune.TuneConfig(num_samples=8),
)

Each trial runs an entire 4-GPU training job. Tune places the placement group across the cluster.

Concurrency

tune.TuneConfig(max_concurrent_trials=4)

Without a cap, Tune runs as many trials as the cluster can hold. Set max_concurrent_trials to throttle.

Shared storage

For multi-node clusters, set storage_path to a shared filesystem (S3, GCS, NFS, EFS):

ray.train.RunConfig(storage_path="s3://bucket/runs/")

Workers in different nodes read/write checkpoints to this location.

Autoscaling

When running on Kubernetes (KubeRay) or VMs with the cluster launcher, Tune’s resource requests trigger the autoscaler. Configure node types in your cluster config to ensure GPUs are available.

Next steps

Trial checkpoints

Persist trial state to shared storage.

Cluster setup

Run Ray on Kubernetes or VMs.

Analyzing Results Troubleshooting Ray Tune

⌘I

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Distributed Hyperparameter Tuning

Connect to a cluster

Per-trial resources

Distributed-training trials

Concurrency

Shared storage

Autoscaling

Next steps

Trial checkpoints

Cluster setup

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Documentation Index

​Connect to a cluster

​Per-trial resources

​Distributed-training trials

​Concurrency

​Shared storage

​Autoscaling

​Next steps

Trial checkpoints

Cluster setup

Connect to a cluster

Per-trial resources

Distributed-training trials

Concurrency

Shared storage

Autoscaling

Next steps