LLM Configuration

Engine arguments
Resource requests
Autoscaling
Quantization
Speculative decoding
Next steps

Engine arguments

engine_kwargs is forwarded to the underlying engine (vLLM by default). Common knobs:

Field	Effect
`tensor_parallel_size`	Number of GPUs to shard the model across.
`max_model_len`	Maximum context length.
`dtype`	Weight dtype (`auto`, `bfloat16`, `float16`, `float32`).
`quantization`	`awq`, `gptq`, `fp8`, etc.
`gpu_memory_utilization`	Fraction of GPU memory the engine may use. Default 0.9.
`enforce_eager`	Disable CUDA graph capture. Use during debugging.

Resource requests

deployment_config={
    "ray_actor_options": {
        "num_cpus": 2,
        "num_gpus": 4,            # match tensor_parallel_size
        "accelerator_type": "H100",
    }
}

accelerator_type pins replicas to specific GPU SKUs.

Autoscaling

deployment_config={
    "num_replicas": "auto",
    "autoscaling_config": {
        "min_replicas": 1,
        "max_replicas": 8,
        "target_ongoing_requests": 8,
        "downscale_delay_s": 600,
    },
}

LLM replicas have multi-minute startup; long downscale_delay_s avoids thrash.

Quantization

engine_kwargs={"quantization": "awq", "dtype": "auto"}

AWQ and GPTQ models cut memory footprint substantially with minor quality loss. FP8 (on H100/H200) is a good middle ground.

Speculative decoding

engine_kwargs={
    "speculative_model": "meta-llama/Llama-3.2-1B-Instruct",
    "num_speculative_tokens": 5,
}

Use a small “draft” model to propose tokens that the larger model verifies in parallel. Helpful for long-tail latency.

Next steps

Multi-LoRA

Adapter-level configuration.

Serving

End-to-end serving guide.

LLM Batch Inference Multi-LoRA

⌘I

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

LLM Configuration

Engine arguments

Resource requests

Autoscaling

Quantization

Speculative decoding

Next steps

Multi-LoRA

Serving

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Documentation Index

​Engine arguments

​Resource requests

​Autoscaling

​Quantization

​Speculative decoding

​Next steps

Multi-LoRA

Serving

Engine arguments

Resource requests

Autoscaling

Quantization

Speculative decoding

Next steps