Monitoring and Logging

Report metrics
Built-in loggers
Ray dashboard
TensorBoard
Custom callbacks
Profiling workers
Next steps

Report metrics

Use ray.train.report to push metrics from train_loop_per_worker to the run.

ray.train.report({"loss": loss.item(), "accuracy": acc, "epoch": epoch})

Reported values appear in:

The Ray dashboard’s Train tab
The Result object returned by trainer.fit()
Any logger callbacks attached to the run

Built-in loggers

from ray.train.callbacks.mlflow import MLflowLoggerCallback
from ray.train.callbacks.wandb import WandbLoggerCallback
from ray.train.callbacks.tbx import TBXLoggerCallback

run_config = RunConfig(callbacks=[
    MLflowLoggerCallback(experiment_name="finetune"),
    WandbLoggerCallback(project="ray-train-demo"),
    TBXLoggerCallback(),
])

Ray dashboard

ray dashboard (or the URL printed by ray.init) shows:

Per-run metrics over time
Worker utilization (CPU, GPU, memory)
Reported checkpoints

TensorBoard

Logs land under <storage_path>/<run_name>/. Point TensorBoard at the directory:

tensorboard --logdir s3://bucket/runs/

Custom callbacks

Subclass TrainCallback for custom integrations:

from ray.train.callbacks import TrainCallback

class SlackNotifier(TrainCallback):
    def on_trial_result(self, iteration, trials, trial, result, **info):
        if result["loss"] < 0.1:
            send_slack(f"Trial {trial.trial_id} hit loss < 0.1")

run_config = RunConfig(callbacks=[SlackNotifier()])

Profiling workers

Use the dashboard’s “Stack Trace” and “py-spy” actions on a worker to capture a flame graph or stack snapshot of a running training job.

Next steps

Observability

Cluster-wide metrics, logs, and tracing.

Run config

All callback options.

Fault Tolerance in Ray Train Ray Tune Overview

⌘I

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Monitoring and Logging

Report metrics

Built-in loggers

Ray dashboard

TensorBoard

Custom callbacks

Profiling workers

Next steps

Observability

Run config

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Documentation Index

​Report metrics

​Built-in loggers

​Ray dashboard

​TensorBoard

​Custom callbacks

​Profiling workers

​Next steps

Observability

Run config

Report metrics

Built-in loggers

Ray dashboard

TensorBoard

Custom callbacks

Profiling workers

Next steps