AI

vLLM vs TensorRT-LLM vs HF TGI vs LMDeploy, A Deep Technical Comparison for Production LLM Inference

The LLM Production Service is now a systems issue, not a problem generate() episode. For real workloads, the choice of inference stack drives you Symbols per second, Tail cuminAnd eventually Cost per million tokens On a specific GPU fleet.

This comparison focuses on 4 widely used stacks:

  • vLLM
  • Nvidia TensorRT-LLM
  • Hugging Face Text Generation Inference (TGI v3)
  • LMDeploy

1. vLLM, PagedAttention as open baseline

Basic idea

vLLM is built around it PagedAttention,ATTENTION Implementation treats the KV cache like paged ,virtual memory instead of a single contiguous buffer for each ,sequence.

Instead of allocating one large KV area per request, vLLM:

  • It divides the KV cache into blocks of fixed size
  • Maintains a block table that maps logical codes to physical blocks
  • Shares blocks between sequences wherever primers overlap

This reduces external fragmentation and allows the scheduler to pack many concurrent sequences into the same VRAM.

Productivity and latency

vLLM improves productivity by 2-4× Across systems like FasterTransformer and Orca with similar latency, with greater gains for longer sequences.

Main characteristics of operators:

  • Constant mixing (also called on-board batching) consolidates incoming requests into existing GPU batches instead of waiting for fixed batch windows.
  • In typical chat workloads, throughput metrics approach linear concurrency until memory KV or compute saturation is reached.
  • Latency P50 remains low for moderate concurrency, but P99 can deteriorate once queues are long or memory KV is tight, especially for heavy pre-populated queries.

vLLM detects OpenAI is HTTP API compatible It integrates well with RaySurf and other formatters, which is why it is widely used as an open baseline.

KV and multi-tenant

  • PagedAttention gives Near zero KV waste Flexible prefix sharing within and across requests.
  • Each process serves a vLLM One model,Multi-tenant and multi-model setups are typically created using an ,external router or API gateway that is distributed over ,multiple vLLM instances.

2. TensorRT-LLM, hardware limit on NVIDIA GPUs

Basic idea

TensorRT-LLM is NVIDIA’s optimized inference library for its GPUs. The library provides custom attention kernels, in-plane pooling, KV caching, quantization down to FP4 and INT4, and speculative decoding.

It is tightly coupled to NVIDIA hardware, including the FP8 tension cores in Hopper and Blackwell.

Measured performance

NVIDIA’s H100 vs A100 rating is the most realistic general reference:

  • On the H100 with FP8, TensorRT-LLM reaches More than 10,000 output codes/sec At peak productivity for 64 concurrent requestswith ~100 ms Time to first code.
  • The H100 FP8 achieves up to 4.6x higher than maximum throughput and 4.4x faster first code response time From A100 on the same models.

For latency sensitive modes:

  • The TensorRT-LLM on the H100 can run TTFT Less than 10 ms In batch 1 configurations, at the expense of lower overall productivity.

These numbers are specific to the model and shape, but give a realistic measure.

Prepackaging vs. decoding

TensorRT-LLM optimizes both stages:

  • Prefilling takes advantage of the high-throughput FP8 attention kernel and tensor parallelism
  • Decoding benefits of CUDA graphs, speculative decoding, quantum weights and KV, and kernel merging.

The result is very high codes/symbols over a wide range of input and output lengths, especially when the engine is tuned for this model and batch profile.

KV and multi-tenant

TensorRT-LLM provides:

  • KV paged cache With formal layout
  • Support long sequences, KV reuse and offloading
  • Onboarding aggregation and priority-aware scheduling

NVIDIA pairs this with Ray or Triton-based coordination patterns for multi-tenant clusters. Multiple models are supported at the orchestrator level, not within a single TensorRT-LLM engine instance.

3. Face-hugging TGI v3, a responsive, multi-backgate specialist

Basic idea

Text Generation Inference (TGI) is a service stack based on Rust and Python that adds:

  • HTTP and gRPC APIs
  • Continuous partition scheduling
  • Possibility of monitoring and automatic measuring hooks
  • Pluggable backends, including vLLM style engines, TensorRT-LLM, and other runtimes

Version 3 focuses on fast, long through processing Shredding and prefix caching.

Long wave standard vs vLLM

The TGI v3 documentation provides a clear standard:

  • In long claims with more than 200,000 tokens-Respond to the conversation that takes 27.5 seconds in vLLM It can be served in approx 2 seconds in TGI v3.
  • This is reported as a 13 x acceleration on this workload.
  • TGI v3 is capable of processing approx 3 x additional codes in the same GPU memory By reducing memory footprint and exploiting slicing and caching.

The mechanism is:

  • TGI keeps the original conversation context in a file Prefix cacheso subsequent turns only pay for additional tokens
  • Cache search is performed in order microsecondsis negligible for the prepacking account

This is a targeted optimization for workloads where claims are very long and are reused across turns, for example RAG pipelines and analytical summarization.

Architecture and latency behavior

Main ingredients:

  • ShredderVery long prompts are broken down into manageable segments for KV and scheduling
  • Prefix caching,A data structure for sharing long context across turns
  • Constant mixing,Incoming requests join batches of already running sequences
  • PagedAttention and merged cores On the GPU backends

For short chat style workloads, throughput and response time are in the same ballpark as vLLM. For long cacheable contexts, the latency of both P50 and P99 improves by an order of magnitude because the engine avoids frequent prepacking.

Multiple backend and multiple model

TGI is designed to serve as Router in addition to the typical server Build. maybe:

  • Route requests across multiple models and replicas
  • Target different backends, e.g. TensorRT-LLM on H100 as well as smaller CPU or GPUs for low priority traffic

This makes it suitable as a centralized service layer in multi-tenant environments.

4. LMDeploy, TurboMind with blocked KV and aggressive quantization

Basic idea

LMDeploy from the InternLM ecosystem is a toolkit for compressing and serving LLMs, centered around Turbo Mind engine. It focuses on:

  • High productivity ordering service
  • Blocked KV cache
  • Continuous mixing (continuous mixing)
  • Quantization of weights and KV cache

Relative productivity versus vLLM

The project stipulates:

  • ‘LMDeploy provides up to 1.8x higher request throughput than vLLM‘, supported by continuous batching, blocked KV, dynamic and valve partitioning, tensor parallelism and an optimized CUDA kernel.

KV, quantization and latency

LMDeploy includes:

  • Blocked KV cache,Similar to paged KV, which helps in packing many ,sequences into VRAM
  • support for KV cache quantizationusually int8 or int4, to cut off KV memory and bandwidth
  • Only weight quantization paths such as 4-bit AWQ
  • A benchmark that reports token throughput, request throughput, and first token response time

This makes LMDeploy attractive when you want to run larger open models like InternLM or Qwen on mid-range GPUs with strong compression while preserving good codes/symbols.

Multiple model deployments

LMDeploy Proxy server Able to handle:

  • Multiple model deployments
  • Multiple devices, multiple GPU setups
  • Routing logic to define forms based on request metadata

So architecturally it is closer to a TGI than a single engine.

What to use when?

  • If you want maximum throughput and very low TTFT on NVIDIA GPUs
    • TensorRT-LLM It is the basic option
    • It uses FP8, lower precision, dedicated cores, and speculative decoding to drive symbols/tokens and keep the TTFT below 100ms at high concurrency and below 10ms at low concurrency.
  • If long claims with reuse dominate you, such as RAG on large contexts
    • TGI v3 It is a strong default
    • Cache prefix and hashing give in to 3 x nominal capacity and 13x lower latency of vLLM in published long routing standards, without additional configuration
  • If you want an open, simple engine with strong core performance and an OpenAI-style API
    • vLLM The standard baseline remains
    • PagedAttention and Continuous Payment make that happen 2-4x faster Of the older stacks with similar latency, it integrates cleanly with the Ray and K8s
  • If you are targeting open models like InternLM or Qwen and appreciate strong quantization by introducing multiple models
    • LMDeploy It is a good fit
    • Blocked KV cache, continuous batches and int8 or int4 KV quantization give Up to 1.8x higher request throughput than vLLM On supported models, with the router layer included

In practice, many development teams mix these systems, for example using TensorRT-LLM for high-volume private chat, TGI v3 for long-form context analytics, or vLLM or LMDeploy for experimental and open workloads. The key is to align throughput, latency and KV behavior with the actual token distributions in your traffic, and then calculate the cost per million tokens/tokens measured on your own hardware.


References

  1. vLLM/PagedAttention
  2. TensorRT-LLM performance and overview
  3. Long-running HF Text Generation Inference (TGI v3) behavior
  4. LMDeploy/TurboMind


Michel Sutter is a data science specialist and holds a Master’s degree in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michelle excels at transforming complex data sets into actionable insights.

🙌 FOLLOW MARKTECHPOST: Add us as a favorite source on Google.

Don’t miss more hot News like this! AI/" target="_blank" rel="noopener">Click here to discover the latest in AI news!

2025-11-20 07:21:00

Related Articles

Back to top button