Block Diffusion for
Flash Speculative Decoding

Replace the autoregressive drafter with a lightweight block-diffusion model. Produce entire blocks of tokens in a single parallel pass.

Get Started Read Paper

The Autoregressive Bottleneck

Classical speculative decoding uses a small autoregressive drafter. While faster than the target model, the drafter itself must generate tokens sequentially. This creates a memory-bound serial bottleneck that limits acceleration.

T-1

...

Serial Memory Bound

The DFlash Idea

DFlash replaces the drafter with a lightweight block-diffusion model. It produces an entire block of draft tokens in a single parallel forward pass, conditioned on hidden-state features from the target model.

Standard Speculative

Drafter generates token by token. Each step depends on the previous one.

DFlash Block Diffusion

Drafter generates the whole block at once. Target verifies in one pass.

Reported Results

Benchmarked across multiple model families and tasks.

x Lossless Acceleration

vs Baseline AR Decoding

x Higher Speedup

vs EAGLE-3 (SOTA)

Forward Pass

For entire block drafting

How to Run It

Installation and usage examples for vLLM and SGLang.

Installation

git clone https://github.com/z-lab/dflash.git
cd dflash
uv pip install -e ".[vllm]"
uv pip install -U vllm --torch-backend=auto \
    --extra-index-url https://wheels.vllm.ai/nightly

vLLM Serve

vllm serve Qwen/Qwen3.5-27B \
  --speculative-config '{"method":"dflash",
    "model":"z-lab/Qwen3.5-27B-DFlash",
    "num_speculative_tokens":15}' \
  --attention-backend flash_attn \
  --max-num-batched-tokens 32768

SGLang Serve

python -m sglang.launch_server \
  --model-path Qwen/Qwen3.5-35B-A3B \
  --speculative-algorithm DFLASH \
  --speculative-draft-model-path z-lab/Qwen3.5-35B-A3B-DFlash \
  --speculative-num-draft-tokens 16 \
  --tp-size 1 \
  --attention-backend trtllm_mha \
  --speculative-draft-attention-backend fa4 \
  --mem-fraction-static 0.75 \
  --mamba-scheduler-strategy extra_buffer \
  --trust-remote-code

Benchmark

python -m dflash.benchmark --backend vllm \
  --base-url http://127.0.0.1:8000 \
  --model Qwen/Qwen3.5-27B \
  --dataset gsm8k --num-prompts 128 --concurrency 1

Drafter Collection

Available drafters as of 2026-04-22. Select the right drafter for your target model.

Qwen3.6-35B-A3B-DFlash

0.5B Parameters

Target: Qwen3.6-35B-A3B

Qwen3.5-27B-DFlash

2B Parameters

Target: Qwen3.5-27B

Qwen3.5-4B-DFlash

Lightweight

Target: Qwen3.5-4B

Qwen3.5-9B-DFlash

Lightweight

Target: Qwen3.5-9B

Qwen3.5-35B-A3B-DFlash

Balanced

Target: Qwen3.5-35B-A3B

gpt-oss-20b-DFlash

Open Source

Target: gpt-oss-20b

gpt-oss-120b-DFlash

High Capacity

Target: gpt-oss-120b

Kimi-K2.5-DFlash

3B Parameters

Target: Kimi-K2.5

Block Diffusion for Flash Speculative Decoding