Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Ray Tune is distributed by default — connect to a Ray cluster and Tune places trials across the available nodes.

Connect to a cluster

import ray
ray.init(address="auto")
Or set RAY_ADDRESS and let ray.init pick it up.

Per-trial resources

tuner = tune.Tuner(
    tune.with_resources(train_fn, {"cpu": 4, "gpu": 1}),
    ...
)
Each trial reserves the requested resources from the cluster. The cluster autoscaler (if enabled) adds nodes when more capacity is needed.

Distributed-training trials

Wrap a Ray Train trainer in tune.with_parameters:
from ray.train.torch import TorchTrainer

def trainable(config):
    trainer = TorchTrainer(
        train_loop_per_worker,
        train_loop_config=config,
        scaling_config=ScalingConfig(num_workers=4, use_gpu=True),
    )
    trainer.fit()

tuner = tune.Tuner(
    trainable,
    param_space={"lr": tune.loguniform(1e-5, 1e-1)},
    tune_config=tune.TuneConfig(num_samples=8),
)
Each trial runs an entire 4-GPU training job. Tune places the placement group across the cluster.

Concurrency

tune.TuneConfig(max_concurrent_trials=4)
Without a cap, Tune runs as many trials as the cluster can hold. Set max_concurrent_trials to throttle.

Shared storage

For multi-node clusters, set storage_path to a shared filesystem (S3, GCS, NFS, EFS):
ray.train.RunConfig(storage_path="s3://bucket/runs/")
Workers in different nodes read/write checkpoints to this location.

Autoscaling

When running on Kubernetes (KubeRay) or VMs with the cluster launcher, Tune’s resource requests trigger the autoscaler. Configure node types in your cluster config to ensure GPUs are available.

Next steps

Trial checkpoints

Persist trial state to shared storage.

Cluster setup

Run Ray on Kubernetes or VMs.