Every Ray node exposes metrics on port 8080 by default. Configure a Prometheus scrape job pointing at the Ray-generated service-discovery file:Documentation Index
Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Core metrics
| Metric | Description |
|---|---|
ray_node_cpu_utilization | Per-node CPU usage. |
ray_node_mem_used / ray_node_mem_total | Memory usage. |
ray_object_store_memory_used | Object store bytes in use. |
ray_tasks{state} | Counts of pending/running/completed tasks. |
ray_actors{state} | Counts of actors by state. |
ray_health_check_rpc_latency_ms | Health-check RPC latency. |
Library metrics
- Ray Data: per-stage throughput, block sizes, spilling.
- Ray Train: per-worker progress, iterations per second.
- Ray Serve: per-deployment QPS, p50/p95/p99 latency, replica count, queue depth.
- Ray Tune: per-trial progress, scheduler decisions.
Custom metrics
Grafana
Ray writes default Grafana dashboards to/tmp/ray/session_latest/metrics/grafana/dashboards/. Import them or use the bundled kuberay-monitoring chart on Kubernetes.
Next steps
Logging
Pair metrics with logs.
Tracing
Distributed tracing.