Metrics

Core metrics
Library metrics
Custom metrics
Grafana
Next steps

Every Ray node exposes metrics on port 8080 by default. Configure a Prometheus scrape job pointing at the Ray-generated service-discovery file:

scrape_configs:
  - job_name: ray
    file_sd_configs:
      - files: [/tmp/ray/prom_metrics_service_discovery.json]

Core metrics

Metric	Description
`ray_node_cpu_utilization`	Per-node CPU usage.
`ray_node_mem_used` / `ray_node_mem_total`	Memory usage.
`ray_object_store_memory_used`	Object store bytes in use.
`ray_tasks{state}`	Counts of pending/running/completed tasks.
`ray_actors{state}`	Counts of actors by state.
`ray_health_check_rpc_latency_ms`	Health-check RPC latency.

Library metrics

Ray Data: per-stage throughput, block sizes, spilling.
Ray Train: per-worker progress, iterations per second.
Ray Serve: per-deployment QPS, p50/p95/p99 latency, replica count, queue depth.
Ray Tune: per-trial progress, scheduler decisions.

Custom metrics

from ray.util.metrics import Counter, Gauge, Histogram

predictions = Counter("my_app_predictions_total", "Predictions made")
predictions.inc()

These show up in Prometheus alongside Ray’s built-in metrics.

Grafana

Ray writes default Grafana dashboards to /tmp/ray/session_latest/metrics/grafana/dashboards/. Import them or use the bundled kuberay-monitoring chart on Kubernetes.

Next steps

Logging

Pair metrics with logs.

Tracing

Distributed tracing.

Ray Dashboard Logging

⌘I

Ray Clusters

Observability

Core metrics

Library metrics

Custom metrics

Grafana

Next steps

Logging

Tracing

Ray Clusters

Observability

Documentation Index

​Core metrics

​Library metrics

​Custom metrics

​Grafana

​Next steps

Logging

Tracing

Core metrics

Library metrics

Custom metrics

Grafana

Next steps