Skip to content

vLLM Metrics in Production

January 28, 2026· #vllm #llm #inference #metrics #performance #monitoring

Why this cheat sheet exists

If you operate vLLM in production, you already know the feeling: the model is "up", GPUs look busy, but users complain that the chat is sluggish. Someone suggests "add more GPUs", someone else says "the prompts got longer", and the discussion goes in circles.

The fastest way to stop guessing is to watch a small set of vLLM metrics and interpret them consistently.

This post is a practical playbook for software and AI engineers:

It's intentionally opinionated and optimized for on-call reality.

How vLLM metrics are named

vLLM exports Prometheus metrics with the vllm: prefix. When Prometheus scrapes them, the colon becomes an underscore.

Also pay attention to metric types, because they decide how you query:

One more thing: vLLM shows you what happens inside the inference engine. For end-to-end production visibility, you still need gateway metrics (HTTP status codes, upstream latency, timeouts) and infrastructure metrics (GPU clocks/thermals/memory).

The two latency numbers that define user experience

In practice, you can reduce "LLM feels slow" to two metrics:

TTFT (Time To First Token) — how quickly the user sees anything

TPOT (Time Per Output Token) — how fast text streams once it starts

There are also two supporting latency metrics that help you pinpoint the cause:

PromQL: the percentile queries you'll actually use

P95 TTFT

histogram_quantile(
  0.95,
  sum by (le) (rate(vllm_time_to_first_token_seconds_bucket[5m]))
)

P99 TPOT

histogram_quantile(
  0.99,
  sum by (le) (rate(vllm_time_per_output_token_seconds_bucket[5m]))
)

P95 Queue Time

histogram_quantile(
  0.95,
  sum by (le) (rate(vllm_request_queue_time_seconds_bucket[5m]))
)

vLLM reports latencies in seconds. If your brain works in milliseconds, just set the Grafana panel unit to Time → milliseconds (ms).

The interpretation shortcut

This is the fastest diagnostic rule I know:

Keep that in your head. It prevents most "random tuning".

Throughput: are we doing useful work?

Latency tells you what users feel. Throughput tells you whether your system is doing real work.

vLLM exposes token counters:

PromQL

Input tokens per second

sum(rate(vllm_prompt_tokens_total[5m]))

Output tokens per second

sum(rate(vllm_generation_tokens_total[5m]))

If output tokens/s drops while traffic stays similar, you're usually looking at cache pressure, swapping, or too much concurrency.

Saturation & concurrency: how close are we to the cliff?

When a system degrades, it rarely does it politely. For vLLM, the "cliff" is typically KV cache pressure and swapping.

Request state (scheduler pressure)

If you only alert on one gauge, make it this:

Any swapped requests (num_requests_swapped > 0) should be treated as a production incident.

KV cache pressure (the silent killer)

The reason this matters: once swapping starts, you'll see brutal latency spikes because you're moving blocks between GPU and CPU.

Errors: why requests finish (and what "abort" usually means)

vLLM tracks why requests stop via:

Common reasons:

PromQL: breakdown by finish reason

sum by (finish_reason) (rate(vllm_finished_request_total[5m]))

What I've seen in production:

The "golden signals" dashboard for vLLM

If you want one dashboard that's worth opening during an incident, build it around these signals:

  1. P95 TTFT — responsiveness
  2. P99 TPOT — generation speed stability
  3. Queue size (num_requests_waiting) — capacity pressure
  4. Swapped requests — must be zero
  5. GPU KV cache usage — predictive saturation
  6. Token throughput — real work (input/output tokens per second)

A practical Grafana dashboard JSON

You can import this as a starting point:

{
  "dashboard": {
    "id": null,
    "title": "vLLM Production Golden Signals",
    "tags": ["vllm", "llm", "inference"],
    "timezone": "browser",
    "schemaVersion": 38,
    "panels": [
      {
        "title": "System Saturation (GPU Cache & Swapping)",
        "type": "gauge",
        "gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
        "targets": [
          { "expr": "vllm_gpu_cache_usage_perc * 100", "legendFormat": "GPU Cache Fill %" }
        ],
        "fieldConfig": {
          "defaults": {
            "min": 0, "max": 100, "unit": "percent",
            "thresholds": {
              "mode": "absolute",
              "steps": [
                { "color": "green", "value": null },
                { "color": "orange", "value": 80 },
                { "color": "red", "value": 95 }
              ]
            }
          }
        }
      },
      {
        "title": "Active & Waiting Requests",
        "type": "timeseries",
        "gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 },
        "targets": [
          { "expr": "vllm_num_requests_running", "legendFormat": "Running (Concurrency)" },
          { "expr": "vllm_num_requests_waiting", "legendFormat": "Waiting in Queue" },
          { "expr": "vllm_num_requests_swapped", "legendFormat": "Swapped (Preemption!)" }
        ]
      },
      {
        "title": "Latency P95 (TTFT vs TPOT)",
        "type": "timeseries",
        "gridPos": { "h": 8, "w": 12, "x": 0, "y": 8 },
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum by (le) (rate(vllm_time_to_first_token_seconds_bucket[5m])))",
            "legendFormat": "P95 TTFT"
          },
          {
            "expr": "histogram_quantile(0.95, sum by (le) (rate(vllm_time_per_output_token_seconds_bucket[5m])))",
            "legendFormat": "P95 TPOT"
          }
        ],
        "fieldConfig": { "defaults": { "unit": "s" } }
      },
      {
        "title": "Token Throughput",
        "type": "timeseries",
        "gridPos": { "h": 8, "w": 12, "x": 12, "y": 8 },
        "targets": [
          { "expr": "sum(rate(vllm_generation_tokens_total[5m]))", "legendFormat": "Output Tokens/s" },
          { "expr": "sum(rate(vllm_prompt_tokens_total[5m]))", "legendFormat": "Input Tokens/s" }
        ]
      }
    ],
    "templating": {
      "list": [
        {
          "name": "datasource",
          "type": "datasource",
          "query": "prometheus",
          "refresh": 1
        }
      ]
    }
  }
}

Alerts that save you before users start complaining

You don't need 30 alerts. You need 5 that are hard to ignore.

PrometheusRule example

groups:
- name: vLLM.Alerts
  rules:

    - alert: vLLMHighTTFT
      expr: histogram_quantile(0.95, sum by (le, instance) (rate(vllm_time_to_first_token_seconds_bucket[5m]))) > 3
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "High TTFT on instance {{ $labels.instance }}"
        description: "P95 TTFT is {{ $value | printf \"%.2f\" }}s. Users see long delays before the first token."

    - alert: vLLMQueueBacklog
      expr: histogram_quantile(0.95, sum by (le, instance) (rate(vllm_request_queue_time_seconds_bucket[5m]))) > 5
      for: 2m
      labels:
        severity: warning
      annotations:
        summary: "Requests are queuing on {{ $labels.instance }}"
        description: "P95 Queue Time is {{ $value | printf \"%.2f\" }}s. Engine cannot keep up."

    - alert: vLLMRequestSwapping
      expr: vllm_num_requests_swapped > 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "vLLM is swapping requests to CPU"
        description: "Instance {{ $labels.instance }} has {{ $value }} swapped requests. This indicates severe KV cache pressure and causes latency spikes."

    - alert: vLLMGPUCacheFull
      expr: vllm_gpu_cache_usage_perc > 0.95
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "GPU KV Cache is nearly full"
        description: "Cache usage on {{ $labels.instance }} is {{ $value | printf \"%.2f\" }} (fraction). Expect queueing or swapping soon."

    - alert: vLLMHighAbortRate
      expr: |
        sum by (instance) (rate(vllm_finished_request_total{finish_reason="abort"}[5m]))
        /
        sum by (instance) (rate(vllm_finished_request_total[5m])) > 0.1
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "High abort rate on {{ $labels.instance }}"
        description: "Abort ratio > 10% (current: {{ $value | printf \"%.2f\" }}). Investigate client timeouts and engine errors."

Runbook: metrics → diagnosis → action

When an alert fires or users complain, here's a quick guide to interpret the signals and take action.

Scenario 1: TTFT is high

What you see

What it usually means

What to do

Scenario 2: TPOT is high (generation slowed)

What you see

What it usually means

What to do

Scenario 3: Swapping detected

What you see

What it means

What to do (in order)

  1. Reduce load or concurrency fast (stabilize the system)
  2. Reduce memory pressure (often: reduce max_num_batched_tokens)
  3. Scale out (more replicas / more GPU capacity)
  4. Separate model pools and route workloads more carefully

Scenario 4: Abort rate spikes

What you see

What it usually means

What to do

Final checklist

If you want one mental model: queueing means you’re out of capacity; swapping means you’ve crossed the line.