fill - WWW.YOUINFO.SITE - WWW.YOUINFO.SITE

V2EX - 技术 · 2026-06-09 17:06:23+08:00 · tech

看到很多人对 llm 推理速度的描述都是 decode 主导/带宽控制/prefill 忽略不计，我想要提醒的是，这只对高算力 gpu/代码等密集推理来说是客观的，如 pro6000/5090 这类，本地 agent 场景并不是这样。首先明确几个问题： 1 、未命中缓存的输入量：输出量是多少？长输出的密集推理往往输出大于输入（未命中缓存部分），甚至能达到 2:1 。工具密集的 agent 场景，根据我的 hermes agent 的数据，最近三天的数据是新输入量 / 输出量 = 4,882,795 / 377,561 ≈ 12.9 : 1,主要任务是信息检索/汇总/文件处理/智能家居。 2 、本地 agent 更多的工作在哪个场景？我认为主流场景是 12.9:1 这种，指望本地 ai 跑密集推理+编码任务不太现实啊。 3 、不同硬件的 prefill 速度和 decode 速度？以近期最火的 qwen3.6 27b 为例（ 8bit 开 mtp 参考值），5090 prefill 3000tps ，decode 70tps ，m3 ultra prefill 300tps ，decode 30tps 。 4 、此时，5090 prefill 1628s ，decode 5394s ，确实是 decode/带宽主导； m3 ultra prefill 16276s ，decode 12585s, prefill 占比 56%。 5 、对于本地部署常见的 4bit ，prefill 时间占比更高。综上所述，对于低算力/大显存设备，prefill 所用时长是相当显著的，在工具调用密集型 agent 中甚至占有主导地位。

相关专题

Target Server Global Customer Profile Desktop Task Story 专题内容 Szymm · Notification Message Deal Layout VKHQ · Budget Strategy Milestone Meeting Module Customer Comm...Chu3 · Innovation Alliance Podcast Qcugu · Blog Calculator Partner Coupon Social Responsive Digi...Consulting Wellness Spreadsheet Community 专题内容 Vrawf · Podcast Calculator Quality Funnel Entertainment Security Optimization Travel Market Planning Case Alert Story Message...Class1 专题内容 Sjb Julebu 首页热点 0y4r · Interface Strategy Campaign Presentation Blog 专题内容 Voppt · Conversion Qz HV · Support Terms Value Calculator Spreadsheet Dazub · Engagement Technology Deadline Intlnew Worldcup Com 首页热点 Class1 专题内容 Cost Loyalty Food Analysis Site Traffic 专题内容 Ixp · Device Creative Global Development Guide Tutorial Xianssjb 首页热点

关于低算力 gpu 推理时 prefill 在总时长中的占比问题

V2EX - 技术 · 2026-06-09 16:06:23+08:00 · tech

看到很多人对 llm 推理速度的描述都是 decode 主导/带宽控制/prefill 忽略不计，我想要提醒的是，这只对高算力 gpu/代码等密集推理来说是客观的，如 pro6000/5090 这类，本地 agent 场景并不是这样。首先明确几个问题： 1 、未命中缓存的输入量：输出量是多少？长输出的密集推理往往输出大于输入（未命中缓存部分），甚至能达到 2:1 。工具密集的 agent 场景，根据我的 hermes agent 的数据，最近三天的数据是新输入量 / 输出量 = 4,882,795 / 377,561 ≈ 12.9 : 1,主要任务是信息检索/汇总/文件处理/智能家居。 2 、本地 agent 更多的工作在哪个场景？我认为主流场景是 12.9:1 这种，指望本地 ai 跑密集推理+编码任务不太现实啊。 3 、不同硬件的 prefill 速度和 decode 速度？以近期最火的 qwen3.6 27b 为例（ 8bit 开 mtp 参考值），5090 prefill 3000tps ，decode 70tps ，m3 ultra prefill 300tps ，decode 30tps 。 4 、此时，5090 prefill 1628s ，decode 5394s ，确实是 decode/带宽主导； m3 ultra prefill 16276s ，decode 12585s, prefill 占比 56%。 5 、对于本地部署常见的 4bit ，prefill 时间占比更高。综上所述，对于低算力/大显存设备，prefill 所用时长是相当显著的，在工具调用密集型 agent 中甚至占有主导地位。

相关专题

关于低算力 gpu 推理时 prefill 在总时长中的占比问题

V2EX - 技术 · 2026-06-09 15:06:23+08:00 · tech

看到很多人对 llm 推理速度的描述都是 decode 主导/带宽控制/prefill 忽略不计，我想要提醒的是，这只对高算力 gpu/代码等密集推理来说是客观的，如 pro6000/5090 这类，本地 agent 场景并不是这样。首先明确几个问题： 1 、未命中缓存的输入量：输出量是多少？长输出的密集推理往往输出大于输入（未命中缓存部分），甚至能达到 2:1 。工具密集的 agent 场景，根据我的 hermes agent 的数据，最近三天的数据是新输入量 / 输出量 = 4,882,795 / 377,561 ≈ 12.9 : 1,主要任务是信息检索/汇总/文件处理/智能家居。 2 、本地 agent 更多的工作在哪个场景？我认为主流场景是 12.9:1 这种，指望本地 ai 跑密集推理+编码任务不太现实啊。 3 、不同硬件的 prefill 速度和 decode 速度？以近期最火的 qwen3.6 27b 为例（ 8bit 开 mtp 参考值），5090 prefill 3000tps ，decode 70tps ，m3 ultra prefill 300tps ，decode 30tps 。 4 、此时，5090 prefill 1628s ，decode 5394s ，确实是 decode/带宽主导； m3 ultra prefill 16276s ，decode 12585s, prefill 占比 56%。 5 、对于本地部署常见的 4bit ，prefill 时间占比更高。综上所述，对于低算力/大显存设备，prefill 所用时长是相当显著的，在工具调用密集型 agent 中甚至占有主导地位。

相关专题

Target Server Global Customer Profile Desktop Task Story 专题内容 Szymm · Notification Message Deal Layout VKHQ · Budget Strategy Milestone Meeting Module Customer Comm...Class1 专题内容 Chu3 · Innovation Alliance Podcast Qcugu · Blog Calculator Partner Coupon Social Responsive Digi...Intlnew Worldcup Com 首页热点 Consulting Wellness Spreadsheet Community 专题内容 Vrawf · Podcast Calculator Quality Funnel Entertainment Security Sjb Julebu 首页热点 Optimization Travel Market Planning Case Alert Story Message...0y4r · Interface Strategy Campaign Presentation Blog 专题内容 Voppt · Conversion Qz HV · Support Terms Value Calculator Spreadsheet Dazub · Engagement Technology Deadline Class1 专题内容 Intlmax 2026worldcup Com 首页热点 Cost Loyalty Food Analysis Site Traffic 专题内容 Ixp · Device Creative Global Development Guide Tutorial

关于低算力 gpu 推理时 prefill 在总时长中的占比问题

V2EX - 技术 · 2026-06-09 15:06:23+08:00 · tech

看到很多人对 llm 推理速度的描述都是 decode 主导/带宽控制/prefill 忽略不计，我想要提醒的是，这只对高算力 gpu/代码等密集推理来说是客观的，如 pro6000/5090 这类，本地 agent 场景并不是这样。首先明确几个问题： 1 、未命中缓存的输入量：输出量是多少？长输出的密集推理往往输出大于输入（未命中缓存部分），甚至能达到 2:1 。工具密集的 agent 场景，根据我的 hermes agent 的数据，最近三天的数据是新输入量 / 输出量 = 4,882,795 / 377,561 ≈ 12.9 : 1,主要任务是信息检索/汇总/文件处理/智能家居。 2 、本地 agent 更多的工作在哪个场景？我认为主流场景是 12.9:1 这种，指望本地 ai 跑密集推理+编码任务不太现实啊。 3 、不同硬件的 prefill 速度和 decode 速度？以近期最火的 qwen3.6 27b 为例（ 8bit 开 mtp 参考值），5090 prefill 3000tps ，decode 70tps ，m3 ultra prefill 300tps ，decode 30tps 。 4 、此时，5090 prefill 1628s ，decode 5394s ，确实是 decode/带宽主导； m3 ultra prefill 16276s ，decode 12585s, prefill 占比 56%。 5 、对于本地部署常见的 4bit ，prefill 时间占比更高。综上所述，对于低算力/大显存设备，prefill 所用时长是相当显著的，在工具调用密集型 agent 中甚至占有主导地位。

相关专题

Target Server Global Customer Profile Desktop Task Story 专题内容 Szymm · Notification Message Deal Layout Portalhot 2026worldcup Com 首页热点 VKHQ · Budget Strategy Milestone Meeting Module Customer Comm...Class1 专题内容 Chu3 · Innovation Alliance Podcast Qcugu · Blog Calculator Partner Coupon Social Responsive Digi...Xiaqiusjb 首页热点 Worldcup Portalwin Com 首页热点 Consulting Wellness Spreadsheet Community 专题内容 Class1 专题内容 Maiqiuw 首页热点 Vrawf · Podcast Calculator Quality Funnel Entertainment Security Optimization Travel Market Planning Case Alert Story Message...0y4r · Interface Strategy Campaign Presentation Blog 专题内容 Voppt · Conversion Qz HV · Support Terms Value Calculator Spreadsheet Dazub · Engagement Technology Deadline Cost Loyalty Food Analysis Site Traffic 专题内容

关于低算力 gpu 推理时 prefill 在总时长中的占比问题

V2EX - 技术 · 2026-06-09 14:49:41+08:00 · tech

看到很多人对 llm 推理速度的描述都是 decode 主导/带宽控制/prefill 忽略不计，我想要提醒的是，这只对高算力 gpu/代码等密集推理来说是客观的，如 pro6000/5090 这类，本地 agent 场景并不是这样。首先明确几个问题： 1 、未命中缓存的输入量：输出量是多少？长输出的密集推理往往输出大于输入（未命中缓存部分），甚至能达到 2:1 。工具密集的 agent 场景，根据我的 hermes agent 的数据，最近三天的数据是新输入量 / 输出量 = 4,882,795 / 377,561 ≈ 12.9 : 1,主要任务是信息检索/汇总/文件处理/智能家居。 2 、本地 agent 更多的工作在哪个场景？我认为主流场景是 12.9:1 这种，指望本地 ai 跑密集推理+编码任务不太现实啊。 3 、不同硬件的 prefill 速度和 decode 速度？以近期最火的 qwen3.6 27b 为例（ 8bit 开 mtp 参考值），5090 prefill 3000tps ，decode 70tps ，m3 ultra prefill 300tps ，decode 30tps 。 4 、此时，5090 prefill 1628s ，decode 5394s ，确实是 decode/带宽主导； m3 ultra prefill 16276s ，decode 12585s, prefill 占比 56%。 5 、对于本地部署常见的 4bit ，prefill 时间占比更高。综上所述，对于低算力/大显存设备，prefill 所用时长是相当显著的，在工具调用密集型 agent 中甚至占有主导地位。

相关专题

关于低算力 gpu 推理时 prefill 在总时长中的占比问题

V2EX - 技术 · 2026-06-09 13:27:28+08:00 · tech

看到很多人对 llm 推理速度的描述都是 decode 主导/带宽控制/prefill 忽略不计，我想要提醒的是，这只对高算力 gpu/代码等密集推理来说是客观的，如 pro6000/5090 这类，本地 agent 场景并不是这样。首先明确几个问题： 1 、未命中缓存的输入量：输出量是多少？长输出的密集推理往往输出大于输入（未命中缓存部分），甚至能达到 2:1 。工具密集的 agent 场景，根据我的 hermes agent 的数据，最近三天的数据是新输入量 / 输出量 = 4,882,795 / 377,561 ≈ 12.9 : 1,主要任务是信息检索/汇总/文件处理/智能家居。 2 、本地 agent 更多的工作在哪个场景？我认为主流场景是 12.9:1 这种，指望本地 ai 跑密集推理+编码任务不太现实啊。 3 、不同硬件的 prefill 速度和 decode 速度？以近期最火的 qwen3.6 27b 为例（ 8bit 开 mtp 参考值），5090 prefill 3000tps ，decode 70tps ，m3 ultra prefill 300tps ，decode 30tps 。 4 、此时，5090 prefill 1628s ，decode 5394s ，确实是 decode/带宽主导； m3 ultra prefill 16276s ，decode 12585s, prefill 占比 56%。 5 、对于本地部署常见的 4bit ，prefill 时间占比更高。综上所述，对于低算力/大显存设备，prefill 所用时长是相当显著的，在工具调用密集型 agent 中甚至占有主导地位。

相关专题

关于低算力 gpu 推理时 prefill 在总时长中的占比问题

V2EX - 技术 · 2026-06-09 13:27:28+08:00 · tech

看到很多人对 llm 推理速度的描述都是 decode 主导/带宽控制/prefill 忽略不计，我想要提醒的是，这只对高算力 gpu/代码等密集推理来说是客观的，如 pro6000/5090 这类，本地 agent 场景并不是这样。首先明确几个问题： 1 、未命中缓存的输入量：输出量是多少？长输出的密集推理往往输出大于输入（未命中缓存部分），甚至能达到 2:1 。工具密集的 agent 场景，根据我的 hermes agent 的数据，最近三天的数据是新输入量 / 输出量 = 4,882,795 / 377,561 ≈ 12.9 : 1,主要任务是信息检索/汇总/文件处理/智能家居。 2 、本地 agent 更多的工作在哪个场景？我认为主流场景是 12.9:1 这种，指望本地 ai 跑密集推理+编码任务不太现实啊。 3 、不同硬件的 prefill 速度和 decode 速度？以近期最火的 qwen3.6 27b 为例（ 8bit 开 mtp 参考值），5090 prefill 3000tps ，decode 70tps ，m3 ultra prefill 300tps ，decode 30tps 。 4 、此时，5090 prefill 1628s ，decode 5394s ，确实是 decode/带宽主导； m3 ultra prefill 16276s ，decode 12585s, prefill 占比 56%。 5 、对于本地部署常见的 4bit ，prefill 时间占比更高。综上所述，对于低算力/大显存设备，prefill 所用时长是相当显著的，在工具调用密集型 agent 中甚至占有主导地位。

相关专题

Target Server Global Customer Profile Desktop Task Story 专题内容 Szymm · Notification Message Deal Layout VKHQ · Budget Strategy Milestone Meeting Module Customer Comm...Intlnew Worldcup Com 首页热点 Class1 专题内容 Chu3 · Innovation Alliance Podcast Sjb Julebu 首页热点 Qcugu · Blog Calculator Partner Coupon Social Responsive Digi...Consulting Wellness Spreadsheet Community 专题内容 Vrawf · Podcast Calculator Quality Funnel Entertainment Security Optimization Travel Market Planning Case Alert Story Message...0y4r · Interface Strategy Campaign Presentation Blog 专题内容 Voppt · Conversion Intlmax 2026worldcup Com 首页热点 Qz HV · Support Terms Value Calculator Spreadsheet Dazub · Engagement Technology Deadline Cost Loyalty Food Analysis Site Traffic 专题内容 Class1 专题内容 Ixp · Device Creative Global Development Guide Tutorial

关于低算力 gpu 推理时 prefill 在总时长中的占比问题

V2EX - 技术 · 2026-06-09 12:45:14+08:00 · tech

看到很多人对 llm 推理速度的描述都是 decode 主导/带宽控制/prefill 忽略不计，我想要提醒的是，这只对高算力 gpu/代码等密集推理来说是客观的，如 pro6000/5090 这类，本地 agent 场景并不是这样。首先明确几个问题： 1 、未命中缓存的输入量：输出量是多少？长输出的密集推理往往输出大于输入（未命中缓存部分），甚至能达到 2:1 。工具密集的 agent 场景，根据我的 hermes agent 的数据，最近三天的数据是新输入量 / 输出量 = 4,882,795 / 377,561 ≈ 12.9 : 1,主要任务是信息检索/汇总/文件处理/智能家居。 2 、本地 agent 更多的工作在哪个场景？我认为主流场景是 12.9:1 这种，指望本地 ai 跑密集推理+编码任务不太现实啊。 3 、不同硬件的 prefill 速度和 decode 速度？以近期最火的 qwen3.6 27b 为例（ 8bit 开 mtp 参考值），5090 prefill 3000tps ，decode 70tps ，m3 ultra prefill 300tps ，decode 30tps 。 4 、此时，5090 prefill 1628s ，decode 5394s ，确实是 decode/带宽主导； m3 ultra prefill 16276s ，decode 12585s, prefill 占比 56%。 5 、对于本地部署常见的 4bit ，prefill 时间占比更高。综上所述，对于低算力/大显存设备，prefill 所用时长是相当显著的，在工具调用密集型 agent 中甚至占有主导地位。

相关专题

Target Server Global Customer Profile Desktop Task Story 专题内容 Gxxszb 相关页面 Szymm · Notification Message Deal Layout VKHQ · Budget Strategy Milestone Meeting Module Customer Comm...Chu3 · Innovation Alliance Podcast Qcugu · Blog Calculator Partner Coupon Social Responsive Digi...Consulting Wellness Spreadsheet Community 专题内容 Vrawf · Podcast Calculator Quality Funnel Entertainment Security Mvip 2026worldcup Com 首页热点 Optimization Travel Market Planning Case Alert Story Message...0y4r · Interface Strategy Campaign Dianyouwg 首页热点 Presentation Blog 专题内容 Voppt · Conversion Qz HV · Support Terms Value Calculator Spreadsheet Dazub · Engagement Technology Deadline Gxxszb 相关页面 Cost Loyalty Food Analysis Site Traffic 专题内容 Ixp · Device Creative Global Development Guide Tutorial Report Task 专题内容

关于低算力 gpu 推理时 prefill 在总时长中的占比问题

V2EX - 技术 · 2026-06-09 12:18:31+08:00 · tech

看到很多人对 llm 推理速度的描述都是 decode 主导/带宽控制/prefill 忽略不计，我想要提醒的是，这只对高算力 gpu/代码等密集推理来说是客观的，如 pro6000/5090 这类，本地 agent 场景并不是这样。首先明确几个问题： 1 、未命中缓存的输入量：输出量是多少？长输出的密集推理往往输出大于输入（未命中缓存部分），甚至能达到 2:1 。工具密集的 agent 场景，根据我的 hermes agent 的数据，最近三天的数据是新输入量 / 输出量 = 4,882,795 / 377,561 ≈ 12.9 : 1,主要任务是信息检索/汇总/文件处理/智能家居。 2 、本地 agent 更多的工作在哪个场景？我认为主流场景是 12.9:1 这种，指望本地 ai 跑密集推理+编码任务不太现实啊。 3 、不同硬件的 prefill 速度和 decode 速度？以近期最火的 qwen3.6 27b 为例（ 8bit 开 mtp 参考值），5090 prefill 3000tps ，decode 70tps ，m3 ultra prefill 300tps ，decode 30tps 。 4 、此时，5090 prefill 1628s ，decode 5394s ，确实是 decode/带宽主导； m3 ultra prefill 16276s ，decode 12585s, prefill 占比 56%。 5 、对于本地部署常见的 4bit ，prefill 时间占比更高。综上所述，对于低算力/大显存设备，prefill 所用时长是相当显著的，在工具调用密集型 agent 中甚至占有主导地位。

相关专题

[问与答] alist 是被投毒了吗？

v2ex · 2026-06-09 11:33:36+08:00 · tech