No, hm, hral som sa s tym. Snazil som sa vylpesit tie prompty, ale nejako mi to nefunguje podla ocakavania. S parametrami
-b 256 -ub 256 som bol niekde na
~25 tok/s. Potom som zistil, ze llama-cpp utilitka llama-bench.exe sa da pouzit na zistenie najlepsieho nastavenia v tomto. U mna som pouzil nasledovny prikaz (ub je defaultne 512):
Kód: Vybrať všetko
> llama-bench.exe --fit-ctx 200000 -ub 4,8,16,32,64,128,256,512 --model "Qwen3.6-35B-A3B-MTP-GGUF\Qwen3.6-35B-A3B-UD-Q8_K_XL.gguf"
| model | size | params | backend | ngl | n_ubatch | fitc | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | ----------: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q8_0 | 36.40 GiB | 35.51 B | CUDA | 99 | 4 | 200000 | pp512 | 69.31 + 1.25 |
| qwen35moe 35B.A3B Q8_0 | 36.40 GiB | 35.51 B | CUDA | 99 | 4 | 200000 | tg128 | 35.08 + 1.73 |
| qwen35moe 35B.A3B Q8_0 | 36.40 GiB | 35.51 B | CUDA | 99 | 8 | 200000 | pp512 | 92.67 + 0.61 |
| qwen35moe 35B.A3B Q8_0 | 36.40 GiB | 35.51 B | CUDA | 99 | 8 | 200000 | tg128 | 34.96 + 1.72 |
| qwen35moe 35B.A3B Q8_0 | 36.40 GiB | 35.51 B | CUDA | 99 | 16 | 200000 | pp512 | 123.96 + 2.38 |
| qwen35moe 35B.A3B Q8_0 | 36.40 GiB | 35.51 B | CUDA | 99 | 16 | 200000 | tg128 | 35.16 + 1.47 |
| qwen35moe 35B.A3B Q8_0 | 36.40 GiB | 35.51 B | CUDA | 99 | 32 | 200000 | pp512 | 76.55 + 1.99 |
| qwen35moe 35B.A3B Q8_0 | 36.40 GiB | 35.51 B | CUDA | 99 | 32 | 200000 | tg128 | 34.45 + 1.40 |
| qwen35moe 35B.A3B Q8_0 | 36.40 GiB | 35.51 B | CUDA | 99 | 64 | 200000 | pp512 | 114.68 + 5.95 |
| qwen35moe 35B.A3B Q8_0 | 36.40 GiB | 35.51 B | CUDA | 99 | 64 | 200000 | tg128 | 34.52 + 1.63 |
| qwen35moe 35B.A3B Q8_0 | 36.40 GiB | 35.51 B | CUDA | 99 | 128 | 200000 | pp512 | 173.19 + 7.62 |
| qwen35moe 35B.A3B Q8_0 | 36.40 GiB | 35.51 B | CUDA | 99 | 128 | 200000 | tg128 | 34.61 + 1.50 |
| qwen35moe 35B.A3B Q8_0 | 36.40 GiB | 35.51 B | CUDA | 99 | 256 | 200000 | pp512 | 276.33 + 28.79 |
| qwen35moe 35B.A3B Q8_0 | 36.40 GiB | 35.51 B | CUDA | 99 | 256 | 200000 | tg128 | 34.82 + 1.82 |
| qwen35moe 35B.A3B Q8_0 | 36.40 GiB | 35.51 B | CUDA | 99 | 512 | 200000 | pp512 | 436.13 + 41.87 |
| qwen35moe 35B.A3B Q8_0 | 36.40 GiB | 35.51 B | CUDA | 99 | 512 | 200000 | tg128 | 34.25 + 1.68 |
Co som sa dozvedel? Vlastne nic. Aj kebyze mi to ide na 256 size tych 28 tok/s, tak som vlastne tam, kde som bol pred celym MTP. A vsetko ostatne je len horsie a horsie.
Kedze to ma lepsie fungovat na dense modeloch, tak som to skusil s Qwen3.6-27B-MTP-GGUF (Q8). Tam v podstate bezpracne narastlo generovanie z ~1.5 tok/s na ~3.0 tok/s, co je dvojnasobny narast, ale v reali je to nepouzitelne. Tak mozno to naozaj nie je vhodne pre MoE. Skusim sa este pohrat s mensimi quantizaciami, ale zatial sa nic nepriblizuje pouzitelnosti Qwen3.6 bez MTP pre agentske veci.