Replace the autoregressive drafter with a lightweight block-diffusion model. Produce entire blocks of tokens in a single parallel pass.
Classical speculative decoding uses a small autoregressive drafter. While faster than the target model, the drafter itself must generate tokens sequentially. This creates a memory-bound serial bottleneck that limits acceleration.
DFlash replaces the drafter with a lightweight block-diffusion model. It produces an entire block of draft tokens in a single parallel forward pass, conditioned on hidden-state features from the target model.
Drafter generates token by token. Each step depends on the previous one.
Drafter generates the whole block at once. Target verifies in one pass.
Benchmarked across multiple model families and tasks.
Installation and usage examples for vLLM and SGLang.
git clone https://github.com/z-lab/dflash.git
cd dflash
uv pip install -e ".[vllm]"
uv pip install -U vllm --torch-backend=auto \
--extra-index-url https://wheels.vllm.ai/nightly
vllm serve Qwen/Qwen3.5-27B \
--speculative-config '{"method":"dflash",
"model":"z-lab/Qwen3.5-27B-DFlash",
"num_speculative_tokens":15}' \
--attention-backend flash_attn \
--max-num-batched-tokens 32768
python -m sglang.launch_server \
--model-path Qwen/Qwen3.5-35B-A3B \
--speculative-algorithm DFLASH \
--speculative-draft-model-path z-lab/Qwen3.5-35B-A3B-DFlash \
--speculative-num-draft-tokens 16 \
--tp-size 1 \
--attention-backend trtllm_mha \
--speculative-draft-attention-backend fa4 \
--mem-fraction-static 0.75 \
--mamba-scheduler-strategy extra_buffer \
--trust-remote-code
python -m dflash.benchmark --backend vllm \
--base-url http://127.0.0.1:8000 \
--model Qwen/Qwen3.5-27B \
--dataset gsm8k --num-prompts 128 --concurrency 1
Available drafters as of 2026-04-22. Select the right drafter for your target model.