Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Engine arguments

engine_kwargs is forwarded to the underlying engine (vLLM by default). Common knobs:
FieldEffect
tensor_parallel_sizeNumber of GPUs to shard the model across.
max_model_lenMaximum context length.
dtypeWeight dtype (auto, bfloat16, float16, float32).
quantizationawq, gptq, fp8, etc.
gpu_memory_utilizationFraction of GPU memory the engine may use. Default 0.9.
enforce_eagerDisable CUDA graph capture. Use during debugging.

Resource requests

deployment_config={
    "ray_actor_options": {
        "num_cpus": 2,
        "num_gpus": 4,            # match tensor_parallel_size
        "accelerator_type": "H100",
    }
}
accelerator_type pins replicas to specific GPU SKUs.

Autoscaling

deployment_config={
    "num_replicas": "auto",
    "autoscaling_config": {
        "min_replicas": 1,
        "max_replicas": 8,
        "target_ongoing_requests": 8,
        "downscale_delay_s": 600,
    },
}
LLM replicas have multi-minute startup; long downscale_delay_s avoids thrash.

Quantization

engine_kwargs={"quantization": "awq", "dtype": "auto"}
AWQ and GPTQ models cut memory footprint substantially with minor quality loss. FP8 (on H100/H200) is a good middle ground.

Speculative decoding

engine_kwargs={
    "speculative_model": "meta-llama/Llama-3.2-1B-Instruct",
    "num_speculative_tokens": 5,
}
Use a small “draft” model to propose tokens that the larger model verifies in parallel. Helpful for long-tail latency.

Next steps

Multi-LoRA

Adapter-level configuration.

Serving

End-to-end serving guide.