M5 Max + MLX 本地栈

核心论点

M5 Max + MLX 模型 = 真正能用的本地 LLM。Dan 用 4 个模型在 M4 Max vs M5 Max 上跑 3 套 benchmark，得出 4 个核心 takeaway：MLX 碾压 GGUF（2x decode）、M5 比 M4 快 15-50%、8K-16K context 是分水岭、Agentic 任务本地可跑但限 8K context。

测试矩阵：4 模型 × 2 设备

模型	格式	大小	MoE	出品方
Qwen 3.5	GGUF (NVFP4)	35B params	3-4B active	Alibaba
Qwen 3.5	MLX	35B (A4B)	3-4B active	Apple silicon
Gemma 4	GGUF	~26B	dense	Google
Gemma 4	MLX community	26B	dense	Apple silicon

设备

🖥️ M4 Max (满配)

128GB RAM · baseline

🚀 M5 Max (满配)

128GB RAM · 新 super core · 35W vs 40W (M4)

3 套 Benchmark

Bench 1: 简单 prompt

→

Bench 2: Context Scaling (200→32K)

→

Bench 3: Pi Coding Agent

Benchmark 1: 简单 Prompt（prefill / decode / wall / RAM）

5 个简单问题（hash table / 两句话 / rate limiter 等）

指标	模型 / 格式	数字 (M5)	vs M4
Decode	Qwen GGUF	60 tok/s	baseline
Decode	Qwen MLX	118 tok/s ~2×	同比例快
Prefill	Gemma GGUF	550 tok/s	赢 MLX
RAM peak	Gemma MLX	16 GB	极小
整体加速	M5 vs M4	15-50%	—

🔑 关键发现

MLX decode 碾压：Qwen GGUF 60 → Qwen MLX 118 tok/s（~2x）
Gemma GGUF prefill 反超：在小 prompt 时 GGUF 反而更快
Wall time 才是真指标：tokens/s 不等于总时间
M5 比 M4 平均快 20%，prefill 接近 2x

Benchmark 2: Context Scaling（Graph Walks）

5 prompt 长度：200 / 500 / 1K / 8K / 16K / 32K tokens · 任务：BFS 遍历图找节点

Context 长度	本地模型表现	M4 vs M5	结论
< 8K	快 + 正确	M5 略快	本地模型无敌
8K - 16K	~30s 等待	差距明显	勉强可接受
32K	M4=400s / M5=280s	M5 改善 40%	Gemma 答错
64K+	跳过（太慢）	—	不可用

✅ 8K 以下能干真活

短总结、解析、分类、翻译
小段代码生成
单步 agent 任务
code completion

❌ 16K+ 回归 cloud

长对话 agent
多轮 reasoning
大型 code review
长文档分析

"Performance is great. The bottleneck is context window length. At 16K, you wait 30s. That's unusable. Just like LLMs who say they have 1M context — it's really 500-800K. Claude 4.6 is the only true 1M so far."

Benchmark 3: Pi Coding Agent（Agentic 实操）

6 任务：hello world → fib → 14-26 tool calls 的大包生成

任务 #	M4 (秒)	M5 (秒)	说明
1	9	7	hello world
2	10	10	fibonacci
3	20	14	—
4	40	25	—
5	60	50	—
6	160	100-180	大包生成
⚠️ M4 任务 6 放弃：1 tool call 后卡死，给出"非合法结果"

实测结果

正确率：两台设备相同（任务 1-5）
Token 消耗：两个设备消耗量接近
M4 最后任务失败：资源耗尽 + 推理断裂
Gemma MLX package gen：得 0.7（不完美但跑完 5 tool calls）

✅ 本地能干

简单 micro-agent 任务
解析 / 总结 / 小编码
写文件 + 执行 + 验证
工具调用 ≤ 10 次的 agent

❌ 仍需 cloud

多轮长对话 agent
复杂 reasoning chain
> 26 tool calls 任务
> 16K context agent

4 个核心 Takeaway（必读）

① MLX 永远赢 2×

"If you're on Mac, always find an MLX model. There's really no debate."

例外：Gemma 4 GGUF 的 prefill speed 比 MLX 更快（小 prompt 时）。

② M5 vs M4 = 15-50% +20% avg

平均 ~20% tokens/s，prefill 几乎 double（大 prompt 优势）

M5 风扇更安静：35W vs 40W（M4）

"The M5 doesn't need the performance core — super core 单独搞定"

③ Context Window = 真正瓶颈 8K-16K 极限

8K 以下无敌 · 8K-16K 勉强 · 16K+ 回归 cloud

Cloud 模型虚标 context：标 1M 实际 500-800K；Claude 4.6 是唯一真 1M

④ Agentic 工作流本地可行，但有限 8K 内

6 任务实测，8K context 内本地能干真活

适用：micro-agent / 解析 / 总结 / 小编码

不适用：长对话 / 长 reasoning / 复杂多步

适用 / 不适用场景

✅ 本地模型擅长	❌ 仍需 cloud
短总结（< 8K）	长文档分析（> 16K）
代码补全 / 小段生成	大型 code review
单步 micro-agent	多轮 reasoning chain
文件写入 + 执行 + 验证	> 26 tool calls 的 agent
隐私敏感数据（不出设备）	SOTA 推理质量（Opus 4 / Sonnet 4）
offline / 飞机 / 离线开发	多模态（image + audio 高质量）
高 QPS 低成本服务（cheap 模型）	产品级准确性要求

任务分层策略（Dan 的建议）

🐭 Small / Cheap

本地 SLM 8K context

解析、总结、分类、文件操作

🐴 Workhorse

Cloud Sonnet 4 / 等价

中等复杂度 agent / 编码

🦁 SOTA

Cloud Opus 4 / 等价

复杂 reasoning / 关键决策

实用建议（Dan 实战结论）

买 M5 Max 直接满配（128GB RAM）— "no purpose in lower tier unless base"
MLX 优先于 GGUF（2x speed up）
Plug in device — 跑模型耗电极快
35-50B parameter 是甜点 — 准确率 + 速度平衡
控制 agent harness — 2026 年大主题（Pi coding agent 的 customization）
任务分层 — small / workhorse / SOTA 模型分桶用
Micro-agent 思维 — 复杂任务拆给本地小模型分步做

"If you don't need a large model, don't use one. This especially matters for product engineering when you have hundreds, thousands, and hopefully hundreds of thousands of users hitting your service."

Dan 的 2026 预测

🔮 年底能跑 Sonnet / Opus 4.0 等级

"By the end of the year we should be able to run a Sonnet or Opus 4.0 level model on your device."

等待 M5 Ultra / M6 Mac Mini（500GB RAM 那个）
Gemma 4 已支持 image + audio（待基准）
Qwen 3.5 支持 text + image（多模态 SLM 趋势）

"Model providers want you and I super, super hooked on their Kool-Aid. The future is agentic — control your harness to control your results."

完整时间线（83 段字幕）

从 0:33 到 38:58 的关键节点