Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Every Ray node exposes metrics on port 8080 by default. Configure a Prometheus scrape job pointing at the Ray-generated service-discovery file:
scrape_configs:
  - job_name: ray
    file_sd_configs:
      - files: [/tmp/ray/prom_metrics_service_discovery.json]

Core metrics

MetricDescription
ray_node_cpu_utilizationPer-node CPU usage.
ray_node_mem_used / ray_node_mem_totalMemory usage.
ray_object_store_memory_usedObject store bytes in use.
ray_tasks{state}Counts of pending/running/completed tasks.
ray_actors{state}Counts of actors by state.
ray_health_check_rpc_latency_msHealth-check RPC latency.

Library metrics

  • Ray Data: per-stage throughput, block sizes, spilling.
  • Ray Train: per-worker progress, iterations per second.
  • Ray Serve: per-deployment QPS, p50/p95/p99 latency, replica count, queue depth.
  • Ray Tune: per-trial progress, scheduler decisions.

Custom metrics

from ray.util.metrics import Counter, Gauge, Histogram

predictions = Counter("my_app_predictions_total", "Predictions made")
predictions.inc()
These show up in Prometheus alongside Ray’s built-in metrics.

Grafana

Ray writes default Grafana dashboards to /tmp/ray/session_latest/metrics/grafana/dashboards/. Import them or use the bundled kuberay-monitoring chart on Kubernetes.

Next steps

Logging

Pair metrics with logs.

Tracing

Distributed tracing.