this post was submitted on 20 Jan 2026
451 points (98.5% liked)
Fuck AI
5305 readers
1197 users here now
"We did it, Patrick! We made a technological breakthrough!"
A place for all those who loathe AI to discuss things, post articles, and ridicule the AI hype. Proud supporter of working people. And proud booer of SXSW 2024.
AI, in this case, refers to LLMs, GPT technology, and anything listed as "AI" meant to increase market valuations.
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Yeah, accessibility is the big problem.
What I used depends.
For “chat” and creativity, I use my own version of GLM 4.6 350B quantized to just barely fit in 128GB RAM/24GB VRAM, with a fork of llama.cop called ik_llama.cpp:
https://huggingface.co/Downtown-Case/GLM-4.6-128GB-RAM-IK-GGUF
It’s complicated, but in a nutshell, the degradation vs the full model is reasonable even though it’s like 3 bits instead of 16, and it runs at 6-7 tokens/sec even with so much in CPU.
For the UI, it varies, but I tend to use mikupad so I can manipulate the chat syntax. LMStudio works pretty well though.
Now, for STEM stuff or papers? I tend to use Nemotron 49B quantized with exllamav3, or sometimes Seed-OSS 36B, as both are good at that and at long context stuff.
For coding, automation? It… depends. Sometimes I used Qwen VL 32B or 30B, in various runtimes, but it seems that GLM 4.7 Flash and GLM 4.6V will be better once I set them up.
Minimax is pretty good at making quick scripts, while being faster than GLM on my desktop.
For a front end, I’ve been switching around.
I also use custom sampling. I basically always use n-gram sampling in ik_llama.cpp where I can, with DRY at modest temperatures (0.6?). Or low or even zero temperature for more “objective” things. This is massively important, as default sampling is where so many LLM errors come from.
And TBH, I also use GLM 4.7 over API a lot, in situations where privacy does not matter. It’s so cheap it’s basically free.
So… Yeah. That’s the problem. If you just load up LMStudio with its default Llama 8B Q4KM, it’s really dumb and awful and slow. You almost have to be an enthusiast following the space to get usable results.
Thank you, very insightful.
Really the big disguishing feature is VRAM. Us consumers just don’t have enough. If I could have a 192GB VRAM system I prolly could run a local model comparable to what OpenAI and others offer, but here I am with a lowly 12GB
You mean an Nvidia 3060? You can run GLM 4.6, a 350B model, on 12GB VRAM if you have 128GB of CPU RAM. It's not ideal though.
More practically, you can run GLM Air or Flash quite comfortably. And that'll be considerably better than "cheap" or old models like Nano, on top of being private, uncensored, and hackable/customizable.
The big distinguishing feature is "it's not for the faint of heart," heh. It takes time and tinkering to setup, as all the "easy" preconfigurations are suboptimal.
That aside, even you have a toaster, you can invest a in API credits and run open weights models with relative privacy on a self hosted front end. Pick the jurisdiction of your choosing.
For example: https://openrouter.ai/z-ai/glm-4.6v
It's like a dollar or two per million words. You can even give a middle finger to Nvidia by using Cerebras or Groq, which don't use GPUs at all.