Qwen3-8B speculative decoding

On this page

This example sets up n-gram (prompt-lookup) speculative decoding for Qwen3-8B on a single NVIDIA L4. On copy-heavy output, editing a pasted code block where most output tokens are copied from the prompt, it roughly doubles decode throughput and halves the time per output token:

Metric	Without speculation	With n-gram speculation
Output token throughput (tok/s)	16.10	39.01
Mean TPOT (ms/token)	60.20	24.21

Measured on a single L4 (vllm/vllm-openai:v0.23.0, Qwen3-8B, 30 copy-heavy prompts at concurrency 1) against the same model without --speculative-config; the speculative run accepted 65% of drafted tokens, a mean acceptance length of 4.27 of 5. Speculation proposes several tokens per decode step and verifies them in one forward pass, so when the output repeats the prompt most proposed tokens are accepted at once, without changing what the model would have generated.

This recipe was run end to end on GKE; the ModelDeployment below is the exact manifest from that run, which served a valid completion, and the numbers above are from the same run. Apply the platform side first, then the ML side.

Setup

Qwen3-8B is an 8.2B dense chat model, served as one Standalone vLLM engine on a single NVIDIA L4 with no cache and weights pulled straight from Hugging Face. The deployment shape is incidental here. The speculative config is what matters. Modelplane supports n-gram (prompt-lookup) speculative decoding, which proposes tokens by matching the prompt and so needs no draft model or second set of weights.

Platform

The platform side is the single-L4 shape shared with the Qwen3-8B example: one InferenceClass and a single-node InferenceCluster for one NVIDIA L4. Follow its Platform section to create them, then apply the ML side below.

Deployment

# Qwen3-8B served on a single NVIDIA L4 by vLLM with n-gram (prompt-lookup)
# speculative decoding, validated end to end (the model layer is cloud-agnostic;
# the same manifest serves on EKS and GKE).
#
# n-gram speculation proposes the next tokens by matching a short suffix of what
# has been generated so far against earlier text in the prompt, then verifies the
# guess in one forward pass. It needs no draft model and no second set of weights,
# so it stays a single Standalone engine with no ModelCache. That is deliberate:
# Modelplane cannot yet stage a separate draft model on cache (modelplaneai/
# modelplane#281), so this is the speculative flavor that works today.
#
#   --speculative-config            method=ngram with num_speculative_tokens=5
#                                   proposes up to 5 tokens per step;
#                                   prompt_lookup_min/max=2..4 set the n-gram
#                                   suffix lengths matched against the prompt.
#                                   It pays off only when output repeats the
#                                   input - e.g. editing a pasted code block,
#                                   where most output tokens are copied verbatim.
#   --default-chat-template-kwargs  turns thinking off. Qwen3 thinks by default,
#                                   and a <think> block is novel text absent from
#                                   the prompt, so prompt-lookup cannot accelerate
#                                   it. Off, the output is mostly the copied code,
#                                   which is exactly what n-gram speeds up.
#   --max-model-len / --gpu-memory-utilization  L4 fit, not correctness. n-gram
#                                   adds only a small proposal buffer, no weights,
#                                   so the budget matches the plain Qwen3-8B recipe.
#
# No --port or --host: Modelplane's routing expects the engine on its default
# :8000 with a /health probe, and passes args through verbatim.
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: qwen3-8b-spec
  namespace: ml-team
spec:
  # One replica, matched to any compatible InferenceCluster by device capacity.
  replicas: 1
  engines:
  - name: qwen3-8b-spec
    members:
    # A single self-contained vLLM pod. The container named "engine" is the
    # inference server; its image and args pass through verbatim.
    - role: Standalone
      nodeSelector:
        devices:
        - name: gpu
          count: 1
          selectors:
          # An 8B model needs most of an L4. >=20Gi selects the L4 (which reports
          # ~23Gi) without over-constraining. DRA evaluates this CEL against the
          # InferenceClass device, then against the GPU's ResourceSlice on bind.
          - cel: |
              device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("20Gi")) >= 0
      template:
        spec:
          containers:
          - name: engine
            image: vllm/vllm-openai:v0.23.0
            args:
            - "--model=Qwen/Qwen3-8B"
            # The id clients pass as "model" in OpenAI requests.
            - "--served-model-name=qwen3-8b-spec"
            # Cap the context so the KV cache fits beside the weights on the L4.
            - "--max-model-len=16384"
            - "--gpu-memory-utilization=0.92"
            # Enable n-gram speculative decoding (no draft model, no cache).
            - "--speculative-config={\"method\": \"ngram\", \"num_speculative_tokens\": 5, \"prompt_lookup_max\": 4, \"prompt_lookup_min\": 2}"
            # Thinking off, so output copies the prompt and prompt-lookup pays off.
            - "--default-chat-template-kwargs={\"enable_thinking\": false}"

# Exposes the qwen3-8b-spec deployment's endpoints as a single OpenAI-compatible
# URL. Modelplane labels each composed ModelEndpoint with the deployment name, so
# this selector reaches every replica. Read the public address from
# status.address:
#   kubectl get ms qwen3-8b-spec -n ml-team -o jsonpath='{.status.address}'
apiVersion: modelplane.ai/v1alpha1
kind: ModelService
metadata:
  name: qwen3-8b-spec
  namespace: ml-team
spec:
  endpoints:
  - selector:
      matchLabels:
        modelplane.ai/deployment: qwen3-8b-spec

Speculation is active when the engine logs its SpeculativeConfig at startup (method='ngram'). The call below pastes a code block and asks for a small edit, the copy-heavy case n-gram accelerates, so most output tokens are matched straight from the prompt:

ADDR=$(kubectl get ms qwen3-8b-spec -n ml-team -o jsonpath='{.status.address}')
curl -s "$ADDR/v1/chat/completions" -H 'Content-Type: application/json' -d '{
  "model": "qwen3-8b-spec",
  "messages": [{"role":"user","content":"Return this Python function unchanged except rename the variable `total` to `subtotal`. Output only the code.\n\ndef cart(items):\n    total = 0\n    for item in items:\n        total += item.price\n    return total"}],
  "max_tokens": 200, "temperature": 0 }'

With the engine running, its logs report how many proposed tokens it accepts:

kubectl logs -n ml-team -l modelplane.ai/deployment=qwen3-8b-spec \
  | grep "SpecDecoding metrics"