WWW.YOUINFO.SITE
标签聚合 207tps

/tag/207tps

www.v2ex.com · 2026-05-03 06:06:17+08:00 · tech

https://github.com/Luce-Org/lucebox-hub DFlash DDtree Qwen3.5 & Qwen3.6 27B GGUF on RTX 3090 First GGUF port of DFlash speculative decoding. Qwen3.5-27B on a single RTX 3090, Q4_K_M target + BF16 draft, DDTree budget=22. Up to 207 tok/s in the demo (207.6 tok/s DFlash vs 38.0 tok/s AR, 5.46×) 129.5 tok/s mean on the HumanEval 10-prompt bench 3.43× faster than autoregressive (+15% over chain speculative decoding) 2.8× faster than SGLang AWQ on the same hardware Up to 256K context in 24 GB via TurboQuant TQ3_0 KV cache (128K Q4_0 bench: 134.78 tok/s at ctx=131072) PFlash speculative prefill on RTX 3090 In-process speculative prefill, C++/CUDA only. A drafter (Qwen3-0.6B BF16) loaded directly into the dflash daemon scores per-token importance over a long prompt; the heavy target (Qwen3.6-27B Q4_K_M) only prefills the spans that matter. Both models share the same ggml allocator on a single RTX 3090. No Python, no Triton, no PyTorch at runtime — just the dflash binary and four custom CUDA kernels (mean_K → score → select → sparse_fwd) plus BSA (mit-han-lab/Block-Sparse-Attention, FA-2 derived, sm_80+) for the long-context drafter forward. ~10.4× TTFT on 128K context: 24.8 s dflash daemon vs ~257 s llama.cpp (FA on, Q4_0 KV). 10.0× TTFT on 64K context: 13.5 s dflash vs 134.95 s llama.cpp. NIAH single-needle retrieved at every measured context (32K → 128K), keep_ratio=0.05, DFLASH_FP_ALPHA=0.85.

www.v2ex.com · 2026-05-03 06:06:17+08:00 · tech

https://github.com/Luce-Org/lucebox-hub DFlash DDtree Qwen3.5 & Qwen3.6 27B GGUF on RTX 3090 First GGUF port of DFlash speculative decoding. Qwen3.5-27B on a single RTX 3090, Q4_K_M target + BF16 draft, DDTree budget=22. Up to 207 tok/s in the demo (207.6 tok/s DFlash vs 38.0 tok/s AR, 5.46×) 129.5 tok/s mean on the HumanEval 10-prompt bench 3.43× faster than autoregressive (+15% over chain speculative decoding) 2.8× faster than SGLang AWQ on the same hardware Up to 256K context in 24 GB via TurboQuant TQ3_0 KV cache (128K Q4_0 bench: 134.78 tok/s at ctx=131072) PFlash speculative prefill on RTX 3090 In-process speculative prefill, C++/CUDA only. A drafter (Qwen3-0.6B BF16) loaded directly into the dflash daemon scores per-token importance over a long prompt; the heavy target (Qwen3.6-27B Q4_K_M) only prefills the spans that matter. Both models share the same ggml allocator on a single RTX 3090. No Python, no Triton, no PyTorch at runtime — just the dflash binary and four custom CUDA kernels (mean_K → score → select → sparse_fwd) plus BSA (mit-han-lab/Block-Sparse-Attention, FA-2 derived, sm_80+) for the long-context drafter forward. ~10.4× TTFT on 128K context: 24.8 s dflash daemon vs ~257 s llama.cpp (FA on, Q4_0 KV). 10.0× TTFT on 64K context: 13.5 s dflash vs 134.95 s llama.cpp. NIAH single-needle retrieved at every measured context (32K → 128K), keep_ratio=0.05, DFLASH_FP_ALPHA=0.85.

www.v2ex.com · 2026-05-03 05:06:17+08:00 · tech

https://github.com/Luce-Org/lucebox-hub DFlash DDtree Qwen3.5 & Qwen3.6 27B GGUF on RTX 3090 First GGUF port of DFlash speculative decoding. Qwen3.5-27B on a single RTX 3090, Q4_K_M target + BF16 draft, DDTree budget=22. Up to 207 tok/s in the demo (207.6 tok/s DFlash vs 38.0 tok/s AR, 5.46×) 129.5 tok/s mean on the HumanEval 10-prompt bench 3.43× faster than autoregressive (+15% over chain speculative decoding) 2.8× faster than SGLang AWQ on the same hardware Up to 256K context in 24 GB via TurboQuant TQ3_0 KV cache (128K Q4_0 bench: 134.78 tok/s at ctx=131072) PFlash speculative prefill on RTX 3090 In-process speculative prefill, C++/CUDA only. A drafter (Qwen3-0.6B BF16) loaded directly into the dflash daemon scores per-token importance over a long prompt; the heavy target (Qwen3.6-27B Q4_K_M) only prefills the spans that matter. Both models share the same ggml allocator on a single RTX 3090. No Python, no Triton, no PyTorch at runtime — just the dflash binary and four custom CUDA kernels (mean_K → score → select → sparse_fwd) plus BSA (mit-han-lab/Block-Sparse-Attention, FA-2 derived, sm_80+) for the long-context drafter forward. ~10.4× TTFT on 128K context: 24.8 s dflash daemon vs ~257 s llama.cpp (FA on, Q4_0 KV). 10.0× TTFT on 64K context: 13.5 s dflash vs 134.95 s llama.cpp. NIAH single-needle retrieved at every measured context (32K → 128K), keep_ratio=0.05, DFLASH_FP_ALPHA=0.85.

www.v2ex.com · 2026-05-03 04:06:17+08:00 · tech

https://github.com/Luce-Org/lucebox-hub DFlash DDtree Qwen3.5 & Qwen3.6 27B GGUF on RTX 3090 First GGUF port of DFlash speculative decoding. Qwen3.5-27B on a single RTX 3090, Q4_K_M target + BF16 draft, DDTree budget=22. Up to 207 tok/s in the demo (207.6 tok/s DFlash vs 38.0 tok/s AR, 5.46×) 129.5 tok/s mean on the HumanEval 10-prompt bench 3.43× faster than autoregressive (+15% over chain speculative decoding) 2.8× faster than SGLang AWQ on the same hardware Up to 256K context in 24 GB via TurboQuant TQ3_0 KV cache (128K Q4_0 bench: 134.78 tok/s at ctx=131072) PFlash speculative prefill on RTX 3090 In-process speculative prefill, C++/CUDA only. A drafter (Qwen3-0.6B BF16) loaded directly into the dflash daemon scores per-token importance over a long prompt; the heavy target (Qwen3.6-27B Q4_K_M) only prefills the spans that matter. Both models share the same ggml allocator on a single RTX 3090. No Python, no Triton, no PyTorch at runtime — just the dflash binary and four custom CUDA kernels (mean_K → score → select → sparse_fwd) plus BSA (mit-han-lab/Block-Sparse-Attention, FA-2 derived, sm_80+) for the long-context drafter forward. ~10.4× TTFT on 128K context: 24.8 s dflash daemon vs ~257 s llama.cpp (FA on, Q4_0 KV). 10.0× TTFT on 64K context: 13.5 s dflash vs 134.95 s llama.cpp. NIAH single-needle retrieved at every measured context (32K → 128K), keep_ratio=0.05, DFLASH_FP_ALPHA=0.85.

v2ex.com · 2026-05-03 03:06:17+08:00 · tech

https://github.com/Luce-Org/lucebox-hub DFlash DDtree Qwen3.5 & Qwen3.6 27B GGUF on RTX 3090 First GGUF port of DFlash speculative decoding. Qwen3.5-27B on a single RTX 3090, Q4_K_M target + BF16 draft, DDTree budget=22. Up to 207 tok/s in the demo (207.6 tok/s DFlash vs 38.0 tok/s AR, 5.46×) 129.5 tok/s mean on the HumanEval 10-prompt bench 3.43× faster than autoregressive (+15% over chain speculative decoding) 2.8× faster than SGLang AWQ on the same hardware Up to 256K context in 24 GB via TurboQuant TQ3_0 KV cache (128K Q4_0 bench: 134.78 tok/s at ctx=131072) PFlash speculative prefill on RTX 3090 In-process speculative prefill, C++/CUDA only. A drafter (Qwen3-0.6B BF16) loaded directly into the dflash daemon scores per-token importance over a long prompt; the heavy target (Qwen3.6-27B Q4_K_M) only prefills the spans that matter. Both models share the same ggml allocator on a single RTX 3090. No Python, no Triton, no PyTorch at runtime — just the dflash binary and four custom CUDA kernels (mean_K → score → select → sparse_fwd) plus BSA (mit-han-lab/Block-Sparse-Attention, FA-2 derived, sm_80+) for the long-context drafter forward. ~10.4× TTFT on 128K context: 24.8 s dflash daemon vs ~257 s llama.cpp (FA on, Q4_0 KV). 10.0× TTFT on 64K context: 13.5 s dflash vs 134.95 s llama.cpp. NIAH single-needle retrieved at every measured context (32K → 128K), keep_ratio=0.05, DFLASH_FP_ALPHA=0.85.

www.v2ex.com · 2026-05-03 02:34:07+08:00 · tech

https://github.com/Luce-Org/lucebox-hub DFlash DDtree Qwen3.5 & Qwen3.6 27B GGUF on RTX 3090 First GGUF port of DFlash speculative decoding. Qwen3.5-27B on a single RTX 3090, Q4_K_M target + BF16 draft, DDTree budget=22. Up to 207 tok/s in the demo (207.6 tok/s DFlash vs 38.0 tok/s AR, 5.46×) 129.5 tok/s mean on the HumanEval 10-prompt bench 3.43× faster than autoregressive (+15% over chain speculative decoding) 2.8× faster than SGLang AWQ on the same hardware Up to 256K context in 24 GB via TurboQuant TQ3_0 KV cache (128K Q4_0 bench: 134.78 tok/s at ctx=131072) PFlash speculative prefill on RTX 3090 In-process speculative prefill, C++/CUDA only. A drafter (Qwen3-0.6B BF16) loaded directly into the dflash daemon scores per-token importance over a long prompt; the heavy target (Qwen3.6-27B Q4_K_M) only prefills the spans that matter. Both models share the same ggml allocator on a single RTX 3090. No Python, no Triton, no PyTorch at runtime — just the dflash binary and four custom CUDA kernels (mean_K → score → select → sparse_fwd) plus BSA (mit-han-lab/Block-Sparse-Attention, FA-2 derived, sm_80+) for the long-context drafter forward. ~10.4× TTFT on 128K context: 24.8 s dflash daemon vs ~257 s llama.cpp (FA on, Q4_0 KV). 10.0× TTFT on 64K context: 13.5 s dflash vs 134.95 s llama.cpp. NIAH single-needle retrieved at every measured context (32K → 128K), keep_ratio=0.05, DFLASH_FP_ALPHA=0.85.

v2ex.com · 2026-05-03 00:47:58+08:00 · tech

https://github.com/Luce-Org/lucebox-hub DFlash DDtree Qwen3.5 & Qwen3.6 27B GGUF on RTX 3090 First GGUF port of DFlash speculative decoding. Qwen3.5-27B on a single RTX 3090, Q4_K_M target + BF16 draft, DDTree budget=22. Up to 207 tok/s in the demo (207.6 tok/s DFlash vs 38.0 tok/s AR, 5.46×) 129.5 tok/s mean on the HumanEval 10-prompt bench 3.43× faster than autoregressive (+15% over chain speculative decoding) 2.8× faster than SGLang AWQ on the same hardware Up to 256K context in 24 GB via TurboQuant TQ3_0 KV cache (128K Q4_0 bench: 134.78 tok/s at ctx=131072) PFlash speculative prefill on RTX 3090 In-process speculative prefill, C++/CUDA only. A drafter (Qwen3-0.6B BF16) loaded directly into the dflash daemon scores per-token importance over a long prompt; the heavy target (Qwen3.6-27B Q4_K_M) only prefills the spans that matter. Both models share the same ggml allocator on a single RTX 3090. No Python, no Triton, no PyTorch at runtime — just the dflash binary and four custom CUDA kernels (mean_K → score → select → sparse_fwd) plus BSA (mit-han-lab/Block-Sparse-Attention, FA-2 derived, sm_80+) for the long-context drafter forward. ~10.4× TTFT on 128K context: 24.8 s dflash daemon vs ~257 s llama.cpp (FA on, Q4_0 KV). 10.0× TTFT on 64K context: 13.5 s dflash vs 134.95 s llama.cpp. NIAH single-needle retrieved at every measured context (32K → 128K), keep_ratio=0.05, DFLASH_FP_ALPHA=0.85.

v2ex.com · 2026-05-02 22:14:54+08:00 · tech

https://github.com/Luce-Org/lucebox-hub DFlash DDtree Qwen3.5 & Qwen3.6 27B GGUF on RTX 3090 First GGUF port of DFlash speculative decoding. Qwen3.5-27B on a single RTX 3090, Q4_K_M target + BF16 draft, DDTree budget=22. Up to 207 tok/s in the demo (207.6 tok/s DFlash vs 38.0 tok/s AR, 5.46×) 129.5 tok/s mean on the HumanEval 10-prompt bench 3.43× faster than autoregressive (+15% over chain speculative decoding) 2.8× faster than SGLang AWQ on the same hardware Up to 256K context in 24 GB via TurboQuant TQ3_0 KV cache (128K Q4_0 bench: 134.78 tok/s at ctx=131072) PFlash speculative prefill on RTX 3090 In-process speculative prefill, C++/CUDA only. A drafter (Qwen3-0.6B BF16) loaded directly into the dflash daemon scores per-token importance over a long prompt; the heavy target (Qwen3.6-27B Q4_K_M) only prefills the spans that matter. Both models share the same ggml allocator on a single RTX 3090. No Python, no Triton, no PyTorch at runtime — just the dflash binary and four custom CUDA kernels (mean_K → score → select → sparse_fwd) plus BSA (mit-han-lab/Block-Sparse-Attention, FA-2 derived, sm_80+) for the long-context drafter forward. ~10.4× TTFT on 128K context: 24.8 s dflash daemon vs ~257 s llama.cpp (FA on, Q4_0 KV). 10.0× TTFT on 64K context: 13.5 s dflash vs 134.95 s llama.cpp. NIAH single-needle retrieved at every measured context (32K → 128K), keep_ratio=0.05, DFLASH_FP_ALPHA=0.85.

www.v2ex.com · 2026-05-02 21:05:12+08:00 · tech

https://github.com/Luce-Org/lucebox-hub DFlash DDtree Qwen3.5 & Qwen3.6 27B GGUF on RTX 3090 First GGUF port of DFlash speculative decoding. Qwen3.5-27B on a single RTX 3090, Q4_K_M target + BF16 draft, DDTree budget=22. Up to 207 tok/s in the demo (207.6 tok/s DFlash vs 38.0 tok/s AR, 5.46×) 129.5 tok/s mean on the HumanEval 10-prompt bench 3.43× faster than autoregressive (+15% over chain speculative decoding) 2.8× faster than SGLang AWQ on the same hardware Up to 256K context in 24 GB via TurboQuant TQ3_0 KV cache (128K Q4_0 bench: 134.78 tok/s at ctx=131072) PFlash speculative prefill on RTX 3090 In-process speculative prefill, C++/CUDA only. A drafter (Qwen3-0.6B BF16) loaded directly into the dflash daemon scores per-token importance over a long prompt; the heavy target (Qwen3.6-27B Q4_K_M) only prefills the spans that matter. Both models share the same ggml allocator on a single RTX 3090. No Python, no Triton, no PyTorch at runtime — just the dflash binary and four custom CUDA kernels (mean_K → score → select → sparse_fwd) plus BSA (mit-han-lab/Block-Sparse-Attention, FA-2 derived, sm_80+) for the long-context drafter forward. ~10.4× TTFT on 128K context: 24.8 s dflash daemon vs ~257 s llama.cpp (FA on, Q4_0 KV). 10.0× TTFT on 64K context: 13.5 s dflash vs 134.95 s llama.cpp. NIAH single-needle retrieved at every measured context (32K → 128K), keep_ratio=0.05, DFLASH_FP_ALPHA=0.85.

v2ex.com · 2026-05-02 21:05:12+08:00 · tech

https://github.com/Luce-Org/lucebox-hub DFlash DDtree Qwen3.5 & Qwen3.6 27B GGUF on RTX 3090 First GGUF port of DFlash speculative decoding. Qwen3.5-27B on a single RTX 3090, Q4_K_M target + BF16 draft, DDTree budget=22. Up to 207 tok/s in the demo (207.6 tok/s DFlash vs 38.0 tok/s AR, 5.46×) 129.5 tok/s mean on the HumanEval 10-prompt bench 3.43× faster than autoregressive (+15% over chain speculative decoding) 2.8× faster than SGLang AWQ on the same hardware Up to 256K context in 24 GB via TurboQuant TQ3_0 KV cache (128K Q4_0 bench: 134.78 tok/s at ctx=131072) PFlash speculative prefill on RTX 3090 In-process speculative prefill, C++/CUDA only. A drafter (Qwen3-0.6B BF16) loaded directly into the dflash daemon scores per-token importance over a long prompt; the heavy target (Qwen3.6-27B Q4_K_M) only prefills the spans that matter. Both models share the same ggml allocator on a single RTX 3090. No Python, no Triton, no PyTorch at runtime — just the dflash binary and four custom CUDA kernels (mean_K → score → select → sparse_fwd) plus BSA (mit-han-lab/Block-Sparse-Attention, FA-2 derived, sm_80+) for the long-context drafter forward. ~10.4× TTFT on 128K context: 24.8 s dflash daemon vs ~257 s llama.cpp (FA on, Q4_0 KV). 10.0× TTFT on 64K context: 13.5 s dflash vs 134.95 s llama.cpp. NIAH single-needle retrieved at every measured context (32K → 128K), keep_ratio=0.05, DFLASH_FP_ALPHA=0.85.