用 antirez 的 llama.cpp fork 把 DeepSeek v4 Flash 在本地跑起来了

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

For Existing Member Sign In

https://github.com/antirez/llama.cpp-deepseek-v4-flash

llama.cpp

DeepSeek

本地

15 replies • 2026-05-03 22:08:30 +08:00

Livid

MOD

PRO

Apr 28

洗车测试也过了，不过从思考过程来看是它知道这是一道 typical 测试题：

940i3s34v4F1HW41

PRO

Apr 28

Tink

PRO

Apr 28

是啥硬件跑的呢

Livid

MOD

PRO

Apr 28

@Tink M4 128G

ares001

PRO

Apr 28

实际运行起来占用多少显存？

Hermitist

Apr 28

https://github.com/TheTom/turboquant_plus 也可以试下这个.

sentinelK

Apr 28

相较而言，个人体感还是 Qwen3.6 35B A3B 在 localLLM 上跑的更顺一点，benchmark 评分也是和 v4 flash 互有胜负

Tathagatagarbha

Apr 28

向大佬学习

unnyxi

Apr 29

@sentinelK 如果 Qwen3.6 35B A3B 和 v4 flash 互有胜负，Qwen 3.6 27B 岂不是碾压 v4 flash 了...

elepant

Apr 29

跑起来和好用，真的是两码事。M4 本地跑 LLM ，响应是真的是慢。。。

sentinelK

Apr 29

@unnyxi 如果是默认的思考长度的话，是的，但是 27B 目前还打不过 flash 的 max 思考长度

PeterTanJJ

Apr 29

Qwen3.6 35B A3B 速度很快，有试过没？

这个 flah 感觉不如 minimax

xuhengjs

Apr 29

期待 qwen3.6-36B-A3B 的终极优化方案

PeterTanJJ

Apr 29

@unnyxi 27B 的输出速度不行

jinsongzhaocn

4 days ago

@PeterTanJJ 27b 的速度,参数和 post 结构影响很大. 我经历过 11 秒到 1 秒的提速