this post was submitted on 24 Apr 2026
58 points (98.3% liked)

LocalLLaMA

4738 readers
5 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

Rules:

Rule 1 - No harassment or personal character attacks of community members. I.E no namecalling, no generalizing entire groups of people that make up our community, no baseless personal insults.

Rule 2 - No comparing artificial intelligence/machine learning models to cryptocurrency. I.E no comparing the usefulness of models to that of NFTs, no comparing the resource usage required to train a model is anything close to maintaining a blockchain/ mining for crypto, no implying its just a fad/bubble that will leave people with nothing of value when it burst.

Rule 3 - No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they're still using the same algorithms since <over 10 years ago>.

Rule 4 - No implying that models are devoid of purpose or potential for enriching peoples lives.

founded 3 years ago
MODERATORS
 

Hi, i haven't seen anybody do what title above says. Idk, maybe everyone nowadays do this already :) But if not, I want to show off a little. There are my specs

12th Gen Intel(R) Core(TM) i5-12450H (12)z

GPU 1: NVIDIA GeForce RTX 3050 Mobile [Discret]

GPU 2: Intel UHD Graphics @ 1.20 GHz [Integrat]

16GB RAM DDR4

Running on cachyos (arch linux), because on Windows, proven by my tests, speed is lower (Gemma 4 E4B 40t/s on linux, and and 30t/s on windows). I used UD-IQ4_NL quant version (13.4GB), as it seems like the best compromise between quality and size. Using ik_llama.cpp fork due to optimizations with MoE and CPU + GPU hybrid work. These are the flags i use "$LLAMA_SERVER"
-m "$MODEL_PATH"
-ngl 99
-c 8000
-fa on
-ctk iq4_nl
-ctv iq4_nl
--parallel 1
-nkvo
-t 8
-tb 8
-b 256
-ub 256
-rtr
-amb 512
--no-mmap
--jinja
-mla 2
--cpu-moe
--mlock
--reasoning off

so there is very little batch size. Even 512 causes OOM. Prefill can take time when context becomes bigger. Not all flags are actually doing something, i just tried everything i found that can help.
Doing the most - cpu-moe (offloading experts to ram), little batch size, and nkvo (offloading kv cache to ram).

Result(u can see token speed) on screenshot.

15t/s - MoE architecture saves the day!

As the result:

  1. The chat quality is great. Facts are solid, instruction following great too
  2. Model is bad on agentic tasks sadly

Great model on just medium class device with limited VRAM, and prove (at least to myself) that26B models don't need 16GB VRAM to run PROPERLY.

The main problem now - is usable context window and prefill speed. On 8k the speed is 10t/s. Waiting for author of the ik_llama.cpp to implement turboquant to help solve the problem. Luckly he already works on that.

PS. tried running qwen3.6 35B. Again - the size is the main problem. Used Apex-i-mini version (14gb). It runs succesfully, speed is 20t/s, but quality is really bad. Will try to max out what i can on UD_IQ4_NL quantisation

UPD: UD_IQ4_NL too big, trying APEX-COMPACT

UPD 2: With a bit of tweaking here and there i balanced memory consumption on VRAM and RAM and APEX-COMPAT version of Qwen3.6 35B... attention... BLASTED with 30 tokens per second! That's just wow. Now problem is that there is only 100mb left on RAM and i can't even open the browser...

So for now, i connected to local server from my phone. And yeah - 30t/s. That's crazy. But no room for context really... Need to figure something out...

Last update, and closing the theme: with qwen 3.6 35B i turned off the prompt cache. Haven't noticed any difference in speed, but ram is kinda free now (at least 500-700mb). Maybe with turned on the speed would maintain better values, but who cares, cause i don't have ram to run this big contexts. Final results: great quality answers, speed is 30t/s. Drops to 20 on 4k context. That's kinda nuts. Now my laptop can be used as server to inference. No work on itself, tho. Waiting for more new quantisation technics (less models size, less kv cache size) and it will be even better.

I hope it was useful to anybody. Can't wait to have Claude code in the pocket :)

all 21 comments
sorted by: hot top controversial new old
[–] MalReynolds@slrpnk.net 11 points 4 weeks ago (1 children)

See also 85 TPS (106 TPS peak) on a single 3090 with Qwen3.6–27B. Shit's getting viable locally, sorry (not sorry) techbros,..

[–] NAwT@lemmy.world 3 points 4 weeks ago (1 children)

Thats actually awesome. Judging on some reviews its very strong model. Sadly, i dont have money for 3090 right now. So will max out my 3050m

[–] MalReynolds@slrpnk.net 3 points 4 weeks ago

Do what you can with what you have (I'm gonna have to wait coz mine are AMD but it'll get there eventually, thanks llama.cpp, that story was worth propagating anyway), future's bright.

[–] NAwT@lemmy.world 7 points 1 month ago (2 children)

If u have any advice to run it better, i'll appritiate that!

[–] panda_abyss@lemmy.ca 6 points 1 month ago

Your experience matches mine, it’s great to chat with, it was able to identify some paintings for me, but not great at agentic tasks.

I was hoping for an MOE 120b a3/4b model, but for 26b it’s great.

[–] corsicanguppy@lemmy.ca 5 points 4 weeks ago (1 children)

A write-up on this would be very valuable: infra, OS, installation, current config, test queries, etc.

[–] NAwT@lemmy.world 2 points 4 weeks ago

Maybe i will make some of these later. Killed lot of time trying to make this to work, but my family and main job are still calling :)

[–] hendrik@palaver.p3x.de 4 points 1 month ago (1 children)

These MoE models are great regarding speed. Half your 15T/s and you can run it entirely without a graphics card on an old computer. At least mine, which is several generations older manages to do 6-7 tokens a second, entirely on CPU. I guess that's a bit slow for some agent to waste 1M tokens on some very basic programming project... But it's enough to chat and ask questions, I guess?

[–] NAwT@lemmy.world 3 points 1 month ago* (last edited 1 month ago) (1 children)

yeah, 6-7 is slow (for me personally even for chat), but 15 feels great. Strange, but It can run even faster in generating progress. KV cache is hittin i guess.
I tried to create my own optimised version of coding agent and it even performes relatively good, but for programming it is surely slow. It would be ok, if it done all the code right from the first try, but it's not. It is not the model problem - even cloud agents do mistakes, but due to high speed they can fix it fast.

but for chat its great

[–] hendrik@palaver.p3x.de 2 points 4 weeks ago* (last edited 4 weeks ago) (1 children)

It took me until now to finally dabble in these coding agents. And I didn't realize at all how many tokens they burn through. I let it write some basic HTML & JavaScript browser game with some free OpenRouter model. I've done this before, just told a model to one-shot it in a single file. And now I tried OpenCode, let it ask me a few questions, come up with a plan and do an entire project structure... And it's at one million tokens way faster than I thought. If my math is correct, that'd take my computer 2 days and nights straight at 6T/s 👀

Guess it's really a bit (too) slow.

[–] NAwT@lemmy.world 4 points 4 weeks ago

the problem with coding agents is simple - THERE A LOT of System promts. Promts that correct the behavior of the model in process of creating project. That is needed becase even largest models are a dumb to some degree. They forget what tools they need to use and how to use them properly. So there hidden from you system promt (i tried Cline for example - it is 11k tokens only on system prompt!) that eats context like crazy. I tried to create similar agent with tools and system promts, that save on context (my custom tool "get_overview", instead of read_file; in mix with "search_content" tool that returns lines on search query, it can save a lot - model don't need to read full file) and mix just a tiny beat cheetsheet to every user msg, so model don't forget. Results were very good. Don't know why they need spam sysprmt like that.

So i think this problem is kinda solvable on local machine

[–] venusaur@lemmy.world 4 points 4 weeks ago (1 children)

That’s awesome! Congrats! Can’t wait til VRAM is more affordable.

[–] NAwT@lemmy.world 1 points 4 weeks ago

Absolutely 💯

[–] NAwT@lemmy.world 3 points 4 weeks ago (2 children)

UPD 2: With a bit of tweaking here and there i balanced memory consumption on VRAM and RAM and APEX-COMPAT version of Qwen3.6 35B... attention... BLASTED with 30 tokens per second! That's just wow. Now problem is that there is only 100mb left on RAM and i can't even open the browser...

So for now, i connected to local server from my phone. And yeah - 30t/s. That's crazy. But no room for context really... Need to figure something out...

[–] XTL@sopuli.xyz 1 points 3 weeks ago

Are you running ram compression or swap? That might help with oom, but naturally becomes another threshold top optimise.

[–] variety4me@lemmy.zip 2 points 4 weeks ago (1 children)

I have a cpu only build of llama.cpp on a 32GB LPDDR5, 6400 MT/s. The laptop has an AMD Radeon 680M, but that is used for wayland, browser and GPU accelerated terminals.

Running llama-server, this is the performance: gemma-4 26B - 10.53 t/s

here is my llama-server command:

llama-server --port 8080 --host 0.0.0.0 --models-preset /mnt/data/ai/models.ini --ctx-size 8192 -ngl 0 --mlock --no-mmap

and here is the model.ini file

version = 1

[*]
; Global defaults - CPU-optimized
seed = -1
top-p = 0.95
top-k = 20
min-p = 0.05
presence-penalty = 0.0
repeat-penalty = 1.1
models-max = 2
jinja = true
batch-size = 256
ubatch-size = 128
threads = 8
threads-batch = 4
cpu-range = 0-7
cpu-strict = 1
kv-offload = false
poll = 25
poll-batch = 50
cpu-moe = true

[gemma-4-26b]
model = /mnt/data/models/daily/google_gemma-4-26B-A4B-it-Q4_K_M.gguf
temperature = 0.65
reasoning-budget = 384
repeat-penalty = 1.05
[–] NAwT@lemmy.world 1 points 4 weeks ago

Yeah, I tried usual llama.cpp and got 12 t/s. Try ik_llama.cpp