LocalLLaMA

4738 readers

5 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

Rules:

Rule 1 - No harassment or personal character attacks of community members. I.E no namecalling, no generalizing entire groups of people that make up our community, no baseless personal insults.

Rule 2 - No comparing artificial intelligence/machine learning models to cryptocurrency. I.E no comparing the usefulness of models to that of NFTs, no comparing the resource usage required to train a model is anything close to maintaining a blockchain/ mining for crypto, no implying its just a fad/bubble that will leave people with nothing of value when it burst.

Rule 3 - No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they're still using the same algorithms since <over 10 years ago>.

Rule 4 - No implying that models are devoid of purpose or potential for enriching peoples lives.

founded 3 years ago

MODERATORS

pax@sh.itjust.works

noneabove1182@sh.itjust.works

Smokeydope@lemmy.world

MonsterBug@sh.itjust.works

Llama.cpp MTP Support merged - up to 2.5x speed increase (github.com)

submitted 1 week ago* (last edited 1 week ago) by TheCornCollector@piefed.zip to c/localllama@sh.itjust.works

2 comments fedilink hide all child comments

Qwen3.6-27B-MTP-UD-Q5_K_XL on my 7900XTX goes from 32 t/s to 50-72 t/s depending on the predictability of the task. So, a 1.5x increase on creative tasks up to a 2.2x increase on math.

MTP does not change the quality with the only cost being a few hundred MB extra VRAM usage. You will need to download a gguf model with MTP support to use it.
My parameters:

; Context memory usage  
ctx-size = 65536  
ctk = q8_0  
ctv = q8_0  

; Prompt processing speed  
batch-size = 1024  
ubatch-size = 1024  

; Speculative decoding  
np = 1  
spec-type = draft-mtp  
spec-draft-n-max = 3

Edit: did some more testing using Unsloth's parameters and with spec-draft-n-max = 6 I can get up to 82 tk/s, a 2.56x increase, on the same math prompt. But this comes at the cost of the creative writing task that now falls below 40 tk/s.
It seems like this should be tweaked depending on the prompt similar to the sampling parameters.

you are viewing a single comment's thread
view the rest of the comments

[–] TheCornCollector@piefed.zip 11 points 1 week ago

https://unsloth.ai/docs/models/qwen3.6#mtp-guide
Unsloth made a guide and has graphs with comparisons