Documentation Index
Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Engine arguments
engine_kwargs is forwarded to the underlying engine (vLLM by default). Common knobs:
| Field | Effect |
|---|---|
tensor_parallel_size | Number of GPUs to shard the model across. |
max_model_len | Maximum context length. |
dtype | Weight dtype (auto, bfloat16, float16, float32). |
quantization | awq, gptq, fp8, etc. |
gpu_memory_utilization | Fraction of GPU memory the engine may use. Default 0.9. |
enforce_eager | Disable CUDA graph capture. Use during debugging. |
Resource requests
accelerator_type pins replicas to specific GPU SKUs.
Autoscaling
downscale_delay_s avoids thrash.
Quantization
Speculative decoding
Next steps
Multi-LoRA
Adapter-level configuration.
Serving
End-to-end serving guide.