Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Head node

The head runs cluster-wide services that don’t have to scale with workers:
  • GCS (Global Control Service): cluster metadata, actor registry, placement-group state.
  • Cluster autoscaler: requests new worker nodes when the workload needs more resources.
  • Dashboard: web UI and API for inspecting the cluster.
  • Driver process: optional — drivers can run on the head or anywhere else with network access.
For high availability, configure GCS to persist state to external Redis or another KV store. Without this, head loss takes the cluster down.

Worker node

Workers run user code. Each worker hosts:
  • Raylet: schedules tasks and actors locally; coordinates with raylets on other nodes.
  • Object store (Plasma): a node-local shared-memory segment for object data.
  • Python (or Java/C++) worker processes: run tasks and actors.

Driver

A driver is the process that owns the top-level ray.init() call. The driver submits tasks and actors and consumes their results. There can be many drivers connected to one cluster simultaneously.

Resources

Each node advertises a resource bundle (CPUs, GPUs, memory, custom labels). Tasks and actors request resources; the scheduler matches requests to nodes. See Scheduling.

Cluster autoscaler

Watches pending resource requests and node utilization. When demand outstrips supply, it requests new nodes from the underlying provider (Kubernetes, AWS Auto Scaling Groups, GCP Managed Instance Groups, etc.). When nodes have been idle past a timeout, it terminates them.

Dashboard

Available at http://<head-ip>:8265. Shows:
  • Node-level resource utilization
  • Live task and actor lists
  • Logs and stack traces per worker
  • Ray Train, Ray Tune, Ray Serve, and Ray Data sub-tabs

Object spilling

When the object store on a node fills, Ray spills cold objects to local disk (or external storage like S3). Configure spilling targets to handle workloads larger than aggregate memory.

Networking

Ports a cluster typically uses:
  • 6379: GCS port (ray start --port).
  • 10001: Ray Client server.
  • 8265: Dashboard / Jobs API.
  • Random ports: raylet, object manager, worker shims.
For Kubernetes, the operator opens these via headless services. For VMs, the cluster launcher uses security groups.

Next steps

CLI

Cluster lifecycle commands.

Dashboard

Configure and secure the dashboard.