this post was submitted on 25 Oct 2025

82 points (94.6% liked)

LocalLLaMA

3855 readers

9 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

Rules:

Rule 1 - No harassment or personal character attacks of community members. I.E no namecalling, no generalizing entire groups of people that make up our community, no baseless personal insults.

Rule 2 - No comparing artificial intelligence/machine learning models to cryptocurrency. I.E no comparing the usefulness of models to that of NFTs, no comparing the resource usage required to train a model is anything close to maintaining a blockchain/ mining for crypto, no implying its just a fad/bubble that will leave people with nothing of value when it burst.

Rule 3 - No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they're still using the same algorithms since <over 10 years ago>.

Rule 4 - No implying that models are devoid of purpose or potential for enriching peoples lives.

founded 2 years ago

MODERATORS

pax@sh.itjust.works

noneabove1182@sh.itjust.works

Smokeydope@lemmy.world

MonsterBug@sh.itjust.works

82

My 8gb vram system as i try to load GLM-4.6-Q0.00001_XXXS.gguf: (media1.tenor.com)

submitted 3 weeks ago* (last edited 3 weeks ago) by Xylight@lemdro.id to c/localllama@sh.itjust.works

13 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] afk_strats@lemmy.world 5 points 3 weeks ago (3 children)

Im not sure if it's a me issue but that's a static image. I figure you posted where they throw a brick into it.

Also, if this post was serious, how does a highly quantitized model compare to something less quantitized but with fewer parameters? I haven't seen benchmarks other than perplexity which isn't a good measure of capability?

[–] Xylight@lemdro.id 5 points 3 weeks ago (2 children)

It's a webp animation. Maybe your client doesn't display it right, i'll replace it with a gif

Regarding your other question, I tend to see better results with higher params + lower precision, versus low params + higher precision. That's just based on "vibes" though, I haven't done any real testing. Based on what I've seen, Q4 is the lowest safe quantization, and beyond that, the performance really starts to drop off. unfortunately even at 1 bit quantization I can't run GLM 4.6 on my system

[–] hendrik@palaver.p3x.de 3 points 3 weeks ago* (last edited 3 weeks ago) (1 children)

What's higher precision for you? What I remember from the old measurements for ggml is, lower than Q3 rarely makes sense and roughly at Q3 you'd think about switching to a smaller variant. But on the other hand everything above Q6 only shows marginal differences in perplexity, so Q6 or Q8 or full precision are basically the same thing.

[–] Xylight@lemdro.id 3 points 3 weeks ago* (last edited 3 weeks ago) (1 children)

As a memory-poor user (hence the 8gb vram card), I consider Q8+ to be is higher precision, Q4-Q5 is mid-low precision (what i typically use), and below that is low precision

[–] hendrik@palaver.p3x.de 4 points 3 weeks ago* (last edited 3 weeks ago)

Thanks. That sounds reasonable. Btw you're not the only poor person around, I don't even own a graphics card... I'm not a gamer so I never saw any reason to buy one before I took interest in AI. I'll do inference on my CPU and that's connected to more than 8GB of memory. It's just slow 😉 But I guess I'm fine with that. I don't rely on AI, it's just tinkering and I'm patient. And a few times a year I'll rent some cloud GPU by the hour. Maybe one day I'll buy one myself.

[–] afk_strats@lemmy.world 2 points 3 weeks ago

That fixed it.

I am a fan of this quant cook. He often posts perplexity charts.

https://huggingface.co/ubergarm

All of his quants require ik_llama which works best with Nvidia CUDA but they can do a lot with RAM+vRAM or even hard drive + rams. I don't know if 8gb is enough for everything.

[–] hendrik@palaver.p3x.de 4 points 3 weeks ago* (last edited 3 weeks ago)

I think perplexity is still central to evaluating models. It's notoriously difficult to come up with other ways to measure these things.

[–] afansfw@lemmynsfw.com 2 points 3 weeks ago (1 children)

Unsloth did a test and their dynamic quants were competitive even at 1 bit in aider benchmark https://docs.unsloth.ai/new/unsloth-dynamic-ggufs-on-aider-polyglot

[–] afk_strats@lemmy.world 1 points 3 weeks ago

Holy cow!