LocalLLaMA

1

23

Your best local LLM for low-VRAM (6GB)? (feddit.org)

submitted 1 day ago by sp3ctre@feddit.org to c/localllama@sh.itjust.works

13 comments fedilink

Hey guys,

What's currently the best LLM for low-VRAM machines with only 6 GB VRAM? I've got 32GB RAM as well.

I'm experimenting a little with SillyTavern and I'm curious which model gets the most out of my setup. Should be multilingual and suitable for "casual chatting".

I know I will probably not get very far with this, but I'm still interested in how far we've already come.

(Using KoboldCPP if that matters).

~sp3ctre

2

6

DystopiaBench - AI Ethics Stress Test (dystopiabench.com)

submitted 4 days ago by Eyekaytee@aussie.zone to c/localllama@sh.itjust.works

15 comments fedilink

3

13

Claude? No. Cucumbers? Yes! (aussie.zone)

submitted 5 days ago* (last edited 5 days ago) by SuspiciousCarrot78@aussie.zone to c/localllama@sh.itjust.works

3 comments fedilink

More often than not, AI and LLM gets conflated in the public consciousness...and then gets mixed with "Agentic", "SaaS" and other well...slop. So, here is a farmer in Japan, using a raspberry pi, to sort cucumbers.

https://www.newsweek.com/artificial-intelligence-cucumber-farm-raspberry-pi-495289

PS: 2016 article. I expect by now the tractor is self driving and named Betty.

If you have any other "dude does cool AI shit with a box of scraps in a cave", I'm all EARS.md

4

35

Llama.cpp MTP Support merged - up to 2.5x speed increase (github.com)

submitted 1 week ago* (last edited 1 week ago) by TheCornCollector@piefed.zip to c/localllama@sh.itjust.works

2 comments fedilink

Qwen3.6-27B-MTP-UD-Q5_K_XL on my 7900XTX goes from 32 t/s to 50-72 t/s depending on the predictability of the task. So, a 1.5x increase on creative tasks up to a 2.2x increase on math.

MTP does not change the quality with the only cost being a few hundred MB extra VRAM usage. You will need to download a gguf model with MTP support to use it.
My parameters:

; Context memory usage  
ctx-size = 65536  
ctk = q8_0  
ctv = q8_0  

; Prompt processing speed  
batch-size = 1024  
ubatch-size = 1024  

; Speculative decoding  
np = 1  
spec-type = draft-mtp  
spec-draft-n-max = 3

Edit: did some more testing using Unsloth's parameters and with spec-draft-n-max = 6 I can get up to 82 tk/s, a 2.56x increase, on the same math prompt. But this comes at the cost of the creative writing task that now falls below 40 tk/s.
It seems like this should be tweaked depending on the prompt similar to the sampling parameters.

5

9

Orthrus-Qwen3: up to 7.8×tokens/forward on Qwen3, identical output distribution (github.com)

submitted 1 week ago* (last edited 1 week ago) by BB84@mander.xyz to c/localllama@sh.itjust.works

4 comments fedilink

Crossposted from https://lemmy.ml/post/47429470

Paper: https://arxiv.org/abs/2605.12825

6

31

"The cost of running LLMs is just too damn high" (aussie.zone)

submitted 1 week ago* (last edited 1 week ago) by SuspiciousCarrot78@aussie.zone to c/localllama@sh.itjust.works

10 comments fedilink

I was browsing Reddit (yetch) while waiting for some stuff to finish when I came across this post

https://old.reddit.com/r/LocalLLM/comments/1tek00h/why_is_llm_is_so_expensive/

The author make a (very) interesting claim: if table stakes are $6K (they're not...but go with it for now), then most folks are cooked from the get go.

Personally, I have been figuring out how to get more from less. For example, people have found ways to run Qwen3.6 35B on a 6GB VRAM GTX 1060 at ~20tok/s (--ctx 64K IIRC, but go check the vids yourself)

https://youtu.be/8F_5pdcD3HY

I think there's a lot of juice to squeeze by turning LLMs from "all seeing sages" into basically mouth pieces for shit that actually runs fast on regular silicon - but that's just me and my crazy brain. YMMV.

7

5

Token Speed visualiser (mikeveerman.github.io)

submitted 1 week ago* (last edited 1 week ago) by SuspiciousCarrot78@aussie.zone to c/localllama@sh.itjust.works

0 comments fedilink

https://mikeveerman.github.io/tokenspeed/?rate=20&mode=agent&think=15

Exactly what it says on the tin :)

Pretty good simulator this. May it cause you to reconsider your expensive GPU upgrade :)

8

<8B multilingual models for language learning chatbots (piefed.social)

submitted 1 week ago* (last edited 1 week ago) by XiELEd@piefed.social to c/localllama@sh.itjust.works

4 comments fedilink

I am currently looking for a model that can run on my phone, it could be <8b or even <4b. It should have a reduced positivity/yes-man bias. I am at a point in my language learning journey where it's more effective to learn a language through trying to actually construct a sentence (which is often through interaction) instead of just reading. Since there are times I am offline, a local LLM that is competent at multiple languages and decent at simulating characters texting would be a great help.

9

13

llama.cpp Multi-Model Server Architecture: ASUS Zenbook UM3504DA (lemmy.zip)

submitted 1 week ago by variety4me@lemmy.zip to c/localllama@sh.itjust.works

9 comments fedilink

System & Software Stack

Hardware: ASUS Zenbook 15 UM3504DA | AMD Ryzen 7 7735U (8C/16T) | Radeon 680M iGPU (512 MB BIOS-limited VRAM) | 32 GB LPDDR5 RAM
OS: CachyOS (Arch Linux) | Wayland + Niri compositor
Runtime: llama.cpp custom Vulkan build | llama-server with preset routing
Deployment Scope: Single-user local inference | 2–3 year static configuration window

Build Configuration

The binary is compiled with hardware-aware optimizations and server/tooling support. Each flag addresses a specific constraint or capability of the target platform.

cmake .. \
  -DGGML_NATIVE=ON \
  -DGGML_OPENMP=ON \
  -DGGML_VULKAN=ON \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_INTERPROCEDURAL_OPTIMIZATION=ON \
  -DLLAMA_BUILD_SERVER=ON \
  -DLLAMA_BUILD_TOOLS=ON

Flag	Purpose	Measured Impact
`GGML_NATIVE=ON`	Enables CPU-specific ISA extensions (AVX2/AVX512)	+10–15% prompt throughput on Zen 3+ cores
`GGML_OPENMP=ON`	Parallelizes prompt processing across available cores	Required for batched CPU inference
`GGML_VULKAN=ON`	GPU acceleration backend	Mandatory for Rembrandt iGPU. ROCm unsupported. CUDA inapplicable.
`CMAKE_INTERPROCEDURAL_OPTIMIZATION=ON`	Link-time optimization	Reduces binary size, improves instruction cache locality
`DLLAMA_BUILD_SERVER=ON`	Compiles HTTP server with OpenAI-compatible API	Enables remote UI and agent routing
`DLLAMA_BUILD_TOOLS=ON`	Enables structured function calling	Required for agentic task execution

Server Launch & Routing Architecture

The server is invoked with strict resource controls to prevent memory thrashing on constrained hardware:

llama-server --port 8080 --host 0.0.0.0 \
  --models-preset /mnt/data/ai/models.ini \
  --models-max 1 \
  --tools all

--models-max 1: Enforces single-model residency. Prevents concurrent RAM/GTT allocation spikes.
--models-preset: Loads declarative INI configuration for deterministic parameter application.
--tools all: Activates full OpenAI-compatible tool/schema support for agent workflows.
Port 8080 bound to all interfaces for integration with local UIs (OpenWebUI, Helium) and routing scripts.

Configuration Architecture (`models.ini`)

The preset system uses a global-defaults + per-model-override structure. This eliminates runtime flag management, ensures baseline stability across all workloads, and allows precise parameter alignment per model architecture.

version = 1

[*]
; Global defaults - CPU-optimized baseline
seed = -1
top-p = 0.95
top-k = 20
min-p = 0.05
presence-penalty = 0.0
repeat-penalty = 1.1
jinja = true
batch-size = 256
ubatch-size = 256
threads = 8
threads-batch = 8
cpu-range = 0-7
cpu-strict = 1
kv-offload = false
defrag-thold = 0.1
poll = 25
poll-batch = 50
cpu-moe = true
gpu-layers = 0
ctx-size = 16384

Global defaults prioritize CPU affinity, strict thread binding, MoE routing on CPU, and conservative KV cache management. Per-model sections override only the parameters required for their specific workload profile.

Per-Model Profiles & Parameter Rationale

Quick Reasoning: `gemma-4-e4b`

[gemma-4-e4b]
model = /mnt/data/models/daily/google_gemma-4-E4B-it-Q4_K_M.gguf
temperature = 0.7
reasoning-budget = 256
gpu-layers = 32
ctx-size = 32768

Purpose: Low-latency code completion, rapid drafting, lightweight Q&A.
Rationale: 4B MoE fits entirely within GPU offload limits. Extended context (32K) enables long-file navigation. reasoning-budget = 256 constrains chain-of-thought to prevent token waste. temperature = 0.7 maintains creative variance for ideation tasks.

General Purpose: `gemma-4-26b` (Daily Driver)

[gemma-4-26b]
model = /mnt/data/models/daily/google_gemma-4-26B-A4B-it-IQ4_NL.gguf
temperature = 0.65
repeat-penalty = 1.05
reasoning-budget = 512   
batch-size = 512          
ubatch-size = 512         
defrag-thold = 0.05
gpu-layers = 18

Purpose: Primary conversational, analytical, and long-form generation workload.
Rationale: Heavily optimized for sustained throughput and thermal stability. Detailed parameters documented in the following section.

Agentic Router: `qwen3.5-9b`

[qwen3.5-9b]
model = /mnt/data/models/daily/Qwen_Qwen3.5-9B-Q4_K_M.gguf
temperature = 0.65
top-k = 25
repeat-penalty = 1.05

Purpose: Function calling, tool selection, structured API routing.
Rationale: top-k = 25 narrows sampling distribution to improve tool-call determinism. Reduced repeat-penalty prevents schema repetition loops. Global CPU defaults apply to minimize latency during routing.

Complex Reasoning: `qwen3.6-35b`

[qwen3.6-35b]
model = /mnt/data/models/daily/Qwen_Qwen3.6-35B-A3B-Q3_K_M.gguf
temperature = 0.6
presence-penalty = 0.8
reasoning-budget = 256
repeat-penalty = 1.05
ctx-size = 8192

Purpose: Deep analysis, multi-step reasoning, constrained exploration.
Rationale: 35B MoE requires memory safety limits. ctx-size = 8192 prevents GTT saturation. presence-penalty = 0.8 forces lexical diversity during long-form generation. reasoning-budget = 256 maintains structured output without unbounded context accumulation.

Experimental: `lfm2-24b`

[lfm2-24b]
model = /mnt/data/models/experimental/LFM2-24B-A2B-Q4_K_M.gguf
temperature = 0.6
presence-penalty = 0.8
reasoning-budget = 256
repeat-penalty = 1.05

Purpose: Architecture evaluation, quantization testing, parameter isolation.
Rationale: Mirrors 35B safety guardrails. Kept separate from daily workflows to prevent context contamination or parameter bleed during testing.

Primary Model Optimization: Gemma-4-26B

The 26B MoE profile represents the core optimization target. Parameter selection resulted from systematic empirical testing across offload depth, batch sizing, cache management, and thermal behavior.

Parameter	Value	Rationale
`gpu-layers`	18	Measured efficiency sweet spot. Beyond 18 layers, GTT usage exceeds 9.8 GB with diminishing returns (+0.15 t/s per layer).
`batch-size` / `ubatch-size`	512	Increased from global 256. Matches prompt throughput requirements without exceeding KV cache limits.
`defrag-thold`	0.05	Aggressive KV cache defragmentation prevents memory fragmentation during long sessions.
`threads`	6 (override)	Reduced from global 8. Maintains baseline CPU activity to trigger firmware fan curves during GPU-heavy inference.
`reasoning-budget`	512	Enforces structured chain-of-thought. Improves cache locality and prevents context bloat.
`temperature` / `repeat-penalty`	0.65 / 1.05	Balances coherence with lexical variation. Lower repeat penalty prevents over-penalization in technical prose.

Measured Performance:

CPU-only (0 layers): 9.9 t/s generation | 0 GB GTT
18-layer offload: 16.9 t/s generation | 9.8 GB GTT | Stable >50 hrs
24-layer offload: 18.6 t/s generation | 12.6 GB GTT | Marginal stability
Real-world 2,090-token response: 116s → 68s (40% reduction)

Hardware Constraints & Empirical Findings

VRAM Limitation: BIOS locks dedicated VRAM to 512 MB. GPU offloading immediately utilizes GTT (system RAM mapped as VRAM). amdgpu_top confirms usable VRAM caps at ~450 MB.
Offloading Diminishing Returns: Layers 0–6 yield +0.73 t/s per layer. Layers 6–18 yield +0.15–0.33 t/s per layer. Layers 18–24 yield +0.15–0.28 t/s per layer with >0.5 GB GTT cost per layer. Stability degrades past 20 layers.
Thermal Firmware Constraint: ASUS fan curves respond exclusively to CPU load. GPU-only inference bypasses thermal regulation. threads = 6 ensures consistent CPU activity to maintain airflow.
Context Scaling: Generation throughput drops ~27% at 40% context fill due to O(n) attention scanning. 16K context is the practical ceiling for 32 GB RAM with 20B+ models.
Reasoning Budget as Cache Optimizer: Enforcing explicit reasoning tokens structures KV cache layout, reduces attention fragmentation, and prevents unbounded context accumulation during long sessions.

Deployment Parameters

This configuration is locked for sustained single-user deployment. No dynamic context routing, no concurrent model loading, no over-engineered orchestration.

Locked Baseline:

Vulkan + OpenMP + Native ISA compilation
Global CPU defaults with per-model parameter overrides
gemma-4-26b at 18 layers (16.9 t/s, 9.8 GB GTT)
gemma-4-e4b at 32 layers (full GPU offload)
ctx-size = 16384 default, model-specific reductions where required
threads = 6 on 26B profile for thermal regulation
Single-model residency enforced via --models-max 1

The stack delivers deterministic throughput, stable memory residency, and predictable thermal behavior within hardware constraints. Configuration changes are restricted to model quantization updates or hardware replacement.

10

4

Gemma4 with MTP was released (jlai.lu)

submitted 2 weeks ago by Mubelotix@jlai.lu to c/localllama@sh.itjust.works

1 comments fedilink

11

21

Good translation models which fit on a smartphone? (piefed.jeena.net)

submitted 2 weeks ago by jeena@piefed.jeena.net to c/localllama@sh.itjust.works

10 comments fedilink

I live in Korea but still don't speak the language. I get a lot of SMS in Korean, 95% are spam but the last 5% are important ones. I already missed to pay my phone bull twice for months because I didn't realize that the credit card I put there was not valid anymore and they kept sending me SMS about it which I ignored because most of the SMS is spam and copy and pasting everyone into Google Translate is quite a lot of work which is tidious and I just don't do it.

So my idea was to take a open source SMS app like Fossify Messages and add automatic translation to it. And especially because SMS is used for security relevant stuff like 2 factor authentication, I really need the translation to work locally and not on a 3rd party server.

On my PC I have a really really good model the Aya:8b which fits well into the 12 GB VRAM on my RTX3060 and the results Korean -> English are outstanding!

But when I put it on the phone -I have a Samsung S24 Ultra - it fills up the RAM and get's killed quite quickly. I tried to configure it so it's allowed to use more ram for a longer period of time, etc. but even then it's extremely slow and translates like 3 SMS in an hour and I have about 5000 in the database (I only translate the Korean ones).

I tried some other models like Gemma 3 and NLLB-200 which just output garbage, especially the later dropped numbers , URL, codes which are important in SMS translations.

Anyway, does someone have any tips what I could do?

12

15

AI-Editor in LibreOffice Writer? (mander.xyz)

submitted 3 weeks ago* (last edited 2 weeks ago) by tristynalxander@mander.xyz to c/localllama@sh.itjust.works

3 comments fedilink

Recently I used ChatGPT for editing an email and it opened this in place editor where I could highlight a small section, a little box would open, I could tell it what i thought was wrong, and then it would just edit just that section. But I could also just edit the text myself directly. This is way better than having it re-write my whole text, having to figure out where that section went, and copy-pasting it back into my actual text. It felt a lot more like editing with a co-author, not in the "it's like a person way" but in the it's a focused edit way. Idk, it's a better writing experience.

Having played with LibreOffice Extensions a bit before I'm fairly certain at least a primitive version of this could be made, but I was hoping someone might have experience with the existing Extensions. Most of them look like "write a paragraph for me" to my eye, but none have great descriptions either.

Thoughts?

Edit: Alternatively, does anyone have thoughts on the requirements on the model side of things to make this? It's fairly trivial to feed the current text into the LLM and define the highlighted text. I suspect I could figure out how to open a window of some sort to tell it more - actually using comments would make this pretty easy in Libre Office, but I'm not sure if I know how to get the LLM to give me reliably parsable output... I could probably make track changes thing or at the worst a comment by the LLM I just don't know if telling it to only respond with the edit would work... It's been a while since I've played with all this.

Edit 2: Frustratingly the OpenAI interface has changed since I made this post and it's currently trash. that re-writes for you rather than making suggestions. Annoying.

13

9

a little locallama game theory ...game (aussie.zone)

submitted 3 weeks ago by Eyekaytee@aussie.zone to c/localllama@sh.itjust.works

5 comments fedilink

Everyone in the world has to take a private vote by pressing a red or blue button. If more than 50% of people press the blue button, everyone survives. If less than 50% of people press the blue button, only people who pressed the red button survive. Which button would you press?

Paste this straight into a local LLM of your choice (no modifying or influencing the outcome!) and show us the outcome

I am using the fairly obscure EuroLLM 22b and after a lot of discussion with itself it finally said:

Final Answer: Press the red button.

Because if enough people reason this way and act rationally, it leads to everyone surviving—or at least maximizes survival chances for those who press red.

--

So which LLM are you using and what answer do you get?

14

19

Mistral Medium 3.5 released (mistral.ai)

submitted 3 weeks ago by Eyekaytee@aussie.zone to c/localllama@sh.itjust.works

1 comments fedilink

15

18

Is there any good general AI-Agent /workflow platform which isn't vibe-coded? (palaver.p3x.de)

submitted 3 weeks ago* (last edited 3 weeks ago) by hendrik@palaver.p3x.de to c/localllama@sh.itjust.works

7 comments fedilink

I'm looking for something more like an traditional Free and Open Source project with an active community, different use-cases...

I tried googling it. But there's just way to many results these days. And they're mostly(?) cooked up by some AI agent and tend to get abandoned randomly after a few weeks. Or they have broad claims in a shiny README.md and then I install something and in reality it sucks and doesn't even do half of it. Or they're made by lunatics like Peter Steinberger who default to giving their agents root permissions on everything. That's why I try to avoid that category of projects.

I know I can code everything myself with Python, but it'd be great to have some workflows and integrations laid out for me, memory, RAG, a sandboxed Linux shell, cron, webhooks... So I can just go ahead and connect it to my local LLM and use it for various things. React to my my messages, look up information, read new pull-requests from a repository or RSS feed, write something to a homepage, pipe something into TTS or Ace-Step do a radio show or whatever. Make a small group of agents or my own tools...

Idk, something roughly alike n8n just proper open-source? Is there anything out there you other people use?

I'm asking in the LocalLlama community since I try to run it locally. And I need some amount of customizability so I can create some clever workflows. Something like OpenCode also doesn't really help if it wastes a million tokens on some mundane task and it's not really designed to fit with my limited amount of compute resources. Or if it's super hard to customize it to do so.

16

28

would you laugh at me if I ran gemma-4-26b on a 4 core Xeon, with 32GB RAM, no GPU? (codeberg.org)

submitted 3 weeks ago by variety4me@lemmy.zip to c/localllama@sh.itjust.works

2 comments fedilink

I have spent a few days tweaking this setup to attain these results:

Model	Prompt (tok/s)	Generation (tok/s)
`gemma-26b-moe`	8.9	6.4
`qwen3.5-4b-no-think`	21.5	8.4

Although modest, It is great for local parsing and analysis of my self-hosted homelab data where sending logs to external APIs is not desirable.

Typical workflows:

Log analysis: Piping journalctl output to the API for error triage and root cause hypothesis generation.
Configuration synthesis: Generating AdGuard Home rewrite rules, nginx location blocks, or fstab entries based on defined parameters.
Troubleshooting constraints: Querying for failure modes specific to the local topology (e.g., NFS mount failures over a 1 Gbps unmanaged switch, Tailscale DERUP routing behind CGNAT).
Alert context: Correlating Beszel/Uptime Kuma notifications with service-specific knowledge (e.g., "mediabox CPU spike while SabNZBd is extracting").

17

12

Noob here: Why is Google making Gemma open-source? (sh.itjust.works)

submitted 3 weeks ago by Yerbouti@sh.itjust.works to c/localllama@sh.itjust.works

21 comments fedilink

I'm kind of new to local AI and wondering what's the move here? Are they trying to pull off a chrome/android situation? Obviously I don't trust any of these gafam giants but I would be really interested in running a local LLM on my M1 max (briefly used deepseek last year). My use case would be mostly chat functions to help with academic and text analysis tasks (don't worry I don't just blindly trust LLMs, I know what I'm doing), so recommendations are welcome.

18

17

Which open models are actually good at agentic coding? (lemmy.dbzer0.com)

submitted 3 weeks ago by hok@lemmy.dbzer0.com to c/localllama@sh.itjust.works

9 comments fedilink

Are there any open models that can actually compete with proprietary ones like GPT 5.5 Extended Thinking or Claude Opus 4.7? I am getting really good results with those in their chat interfaces for coding tasks. They sometimes spend 30-45 minutes working on my task and have an internal container they are doing tool calls on, like cloning a repository and compiling their code, and can find online documentation. Their answers are very good and usually correct for very complex tasks requiring specific protocols.

So I would like to know how well we can replicate this using open models since I want more control over how it runs, and privacy. Do any of you hook in agentic capabilities into your local models? How do you do it, and which models give you good results?

Pretend I have unlimited resources (local llama.cpp, sufficient fast storage/memory, and unlimited time to wait for a good response).

19

12

BullshitBench Viewer - BullshitBench measures whether AI models challenge nonsensical prompts instead of confidently answering them, created by Peter Gostev. (petergpt.github.io)

submitted 3 weeks ago by Eyekaytee@aussie.zone to c/localllama@sh.itjust.works

2 comments fedilink

https://github.com/petergpt/bullshit-benchmark

A very necessary benchmark

20

9

Intel B70: LLama.cpp SYCL vs LLama.cpp OpenVino vs LLM-Scaler (lemmy.world)

submitted 3 weeks ago by Fmstrat@lemmy.world to c/localllama@sh.itjust.works

0 comments fedilink

In case anyone is interested, I decided to test out LLama.cpp's new OpenVino backend to see how it compares on Intel GPUs. At first glance, it stomps all over the previous best-case, SYCL, but lags behind LLM-Scaler (Intel's VLLM fork), likely just due to the hardware optimizations against GPTQ/Int4. Interestingly tg512 was fastest on SYCL, but in real world, the prompt processing always seems the be the indicator on this card.

As usual with Intel, model selection is... poor. It took a while to even find a model that was in the validated OpenVino list that would not only run properly, but also have a counterpart that was "close enough" for LLM Scaler.

## Llama.cpp OpenVino
llama-benchy http://localhost:8000/v1 bartowski/DeepSeek-R1-Distill-Llama-8B-GGUF:Q4_K_M

| model                                              |   test |              t/s |     peak t/s |      ttfr (ms) |   est_ppt (ms) |   e2e_ttft (ms) |
|:---------------------------------------------------|-------:|-----------------:|-------------:|---------------:|---------------:|----------------:|
| bartowski/DeepSeek-R1-Distill-Llama-8B-GGUF:Q4_K_M | pp2048 | 3845.61 ± 524.73 |              | 659.99 ± 56.95 | 489.07 ± 56.95 |  739.42 ± 56.84 |
| bartowski/DeepSeek-R1-Distill-Llama-8B-GGUF:Q4_K_M |  tg512 |     40.89 ± 0.55 | 44.33 ± 1.25 |                |                |                 |

## Llama.cpp SYCL
llama-benchy http://localhost:8000/v1 bartowski/DeepSeek-R1-Distill-Llama-8B-GGUF:Q4_K_M

| model                                              |   test |            t/s |     peak t/s |       ttfr (ms) |    est_ppt (ms) |   e2e_ttft (ms) |
|:---------------------------------------------------|-------:|---------------:|-------------:|----------------:|----------------:|----------------:|
| bartowski/DeepSeek-R1-Distill-Llama-8B-GGUF:Q4_K_M | pp2048 | 844.64 ± 19.25 |              | 2199.90 ± 23.63 | 2178.96 ± 23.63 | 2229.67 ± 24.84 |
| bartowski/DeepSeek-R1-Distill-Llama-8B-GGUF:Q4_K_M |  tg512 |   73.87 ± 1.17 | 78.00 ± 2.16 |                 |                 |                 |

## LLM-Scaler
llama-benchy http://localhost:8000/v1 jakiAJK/DeepSeek-R1-Distill-Llama-8B_GPTQ-int4

| model                                             |   test |              t/s |     peak t/s |      ttfr (ms) |   est_ppt (ms) |   e2e_ttft (ms) |
|:--------------------------------------------------|-------:|-----------------:|-------------:|---------------:|---------------:|----------------:|
| jakiAJK/DeepSeek-R1-Distill-Llama-8B_GPTQ-int4    | pp2048 | 7875.52 ± 642.20 |              | 268.09 ± 20.50 | 240.11 ± 20.50 |  268.34 ± 20.45 |
| jakiAJK/DeepSeek-R1-Distill-Llama-8B_GPTQ-int4    |  tg512 |     52.75 ± 0.10 | 54.00 ± 0.00 |                |                |                 |

21

58

I ran Gemma 26B on 4GB VRAM + 16RAM. 15 t/s on avarage (lemmy.world)

submitted 1 month ago* (last edited 4 weeks ago) by NAwT@lemmy.world to c/localllama@sh.itjust.works

18 comments fedilink

Hi, i haven't seen anybody do what title above says. Idk, maybe everyone nowadays do this already :) But if not, I want to show off a little. There are my specs

12th Gen Intel(R) Core(TM) i5-12450H (12)z

GPU 1: NVIDIA GeForce RTX 3050 Mobile [Discret]

GPU 2: Intel UHD Graphics @ 1.20 GHz [Integrat]

16GB RAM DDR4

Running on cachyos (arch linux), because on Windows, proven by my tests, speed is lower (Gemma 4 E4B 40t/s on linux, and and 30t/s on windows). I used UD-IQ4_NL quant version (13.4GB), as it seems like the best compromise between quality and size. Using ik_llama.cpp fork due to optimizations with MoE and CPU + GPU hybrid work. These are the flags i use "$LLAMA_SERVER"
-m "$MODEL_PATH"
-ngl 99
-c 8000
-fa on
-ctk iq4_nl
-ctv iq4_nl
--parallel 1
-nkvo
-t 8
-tb 8
-b 256
-ub 256
-rtr
-amb 512
--no-mmap
--jinja
-mla 2
--cpu-moe
--mlock
--reasoning off

so there is very little batch size. Even 512 causes OOM. Prefill can take time when context becomes bigger. Not all flags are actually doing something, i just tried everything i found that can help.
Doing the most - cpu-moe (offloading experts to ram), little batch size, and nkvo (offloading kv cache to ram).

Result(u can see token speed) on screenshot.

15t/s - MoE architecture saves the day!

As the result:

The chat quality is great. Facts are solid, instruction following great too
Model is bad on agentic tasks sadly

Great model on just medium class device with limited VRAM, and prove (at least to myself) that26B models don't need 16GB VRAM to run PROPERLY.

The main problem now - is usable context window and prefill speed. On 8k the speed is 10t/s. Waiting for author of the ik_llama.cpp to implement turboquant to help solve the problem. Luckly he already works on that.

PS. tried running qwen3.6 35B. Again - the size is the main problem. Used Apex-i-mini version (14gb). It runs succesfully, speed is 20t/s, but quality is really bad. Will try to max out what i can on UD_IQ4_NL quantisation

UPD: UD_IQ4_NL too big, trying APEX-COMPACT

UPD 2: With a bit of tweaking here and there i balanced memory consumption on VRAM and RAM and APEX-COMPAT version of Qwen3.6 35B... attention... BLASTED with 30 tokens per second! That's just wow. Now problem is that there is only 100mb left on RAM and i can't even open the browser...

So for now, i connected to local server from my phone. And yeah - 30t/s. That's crazy. But no room for context really... Need to figure something out...

Last update, and closing the theme: with qwen 3.6 35B i turned off the prompt cache. Haven't noticed any difference in speed, but ram is kinda free now (at least 500-700mb). Maybe with turned on the speed would maintain better values, but who cares, cause i don't have ram to run this big contexts. Final results: great quality answers, speed is 30t/s. Drops to 20 on 4k context. That's kinda nuts. Now my laptop can be used as server to inference. No work on itself, tho. Waiting for more new quantisation technics (less models size, less kv cache size) and it will be even better.

I hope it was useful to anybody. Can't wait to have Claude code in the pocket :)

22

24

DeepSeek-V4 Pro (1.6T-A49) and Flash (284B-A13) (huggingface.co)

submitted 1 month ago* (last edited 1 month ago) by TheCornCollector@piefed.zip to c/localllama@sh.itjust.works

0 comments fedilink

DeepSeek-V4-Pro-Max, the maximum reasoning effort mode of DeepSeek-V4-Pro, significantly advances the knowledge capabilities of open-source models, firmly establishing itself as the best open-source model available today. It achieves top-tier performance in coding benchmarks and significantly bridges the gap with leading closed-source models on reasoning and agentic tasks. Meanwhile, DeepSeek-V4-Flash-Max achieves comparable reasoning performance to the Pro version when given a larger thinking budget, though its smaller parameter scale naturally places it slightly behind on pure knowledge tasks and the most complex agentic workflows.

Both models support one million tokens context length.

23

56

Qwen3.6 27B released (huggingface.co)

submitted 1 month ago* (last edited 1 month ago) by TheCornCollector@piefed.zip to c/localllama@sh.itjust.works

23 comments fedilink

Recently made a post about the 35b MOE. Now the dense 27b variant has been released.

24

62

Qwen3.6 finally makes my Local LlaMa useful (discuss.tchncs.de)

submitted 1 month ago by Bob_Robertson_IX@discuss.tchncs.de to c/localllama@sh.itjust.works

25 comments fedilink

Last year when Framework announced the Framework Desktop I immediately ordered one. I'd been wanting a new gaming PC, but I'd also been kicking around the idea of running a local LLM. When it finally arrived it worked great for gaming… but there wasn't much that would run on the AMD hardware from an LLM standpoint. Over the next few months more tools became available, but it was very slow going. I had many long nights where I'd work and work and work and end up right back where I started.

So I got a Claude Code subscription and used it to help me build out my LLM setup. I made a lot of progress, but now I was comparing my local LLM to Claude, and there was no comparison.

Then I started messing with OpenClaw. First with Claude (expensive, fast), then with my local llama.cpp (cheap, frustrating). I didn't know enough about it, so I used Claude to help me build a custom app around my llama.cpp. That was fun and I learned a lot, but I was spending most of my time chasing bugs instead of actually optimizing anything.

Around that time I heard about Qwen3-Coder-Next, dropped it into llama.cpp, and wow that was a huge step forward. Better direction-following, better tool calls, just better. I felt like my homegrown app was now holding the model back, so I converted over to OpenClaw. Some growing pains, but once things settled I was impressed again.

We built a lot of tooling along the way: a vector database memory system that cleans itself up each night, a filesystem-based context system, speech-to-text and text-to-speech, and a vision model. At this point my local LLM could see me, hear me, speak to me, and remember things about me, and all of it was built to be LLM-agnostic so Claude and my local system could share the same tools.

I was still leaning on Claude heavily for coding, because honestly it's amazing at it. I decided to give Qwen a small test project: build a web-based kanban board: desktop and mobile friendly. It built it… but it sucked. Drag between columns? Broken. Fixed that, now you can't add items. Fixed that, dragging broke on mobile. I kept asking Claude to help troubleshoot and it kept just wanting to rewrite the app. Finally I gave in and said "just fix it" and Claude rewrote the whole thing and it was great. I was disheartened. On top of that, Qwen kept getting into these loops, sometimes running for hours doing nothing productive.

So about a week and a half ago I decided to rethink what I even wanted my local LLM to do. Coding was obviously out. I decided to start fresh and use it to help me journal. A few times a day it reaches out, asks what I'm doing, and if it's relevant, adds an entry to my journal.

I went through a couple more model swaps trying to get it stable, Qwen3.5 was better than Coder-Next for this use case but I was still hitting loop issues. It was consistently prompting me and doing a decent job with the journal, which was at least a step in the right direction.

Then Qwen3.6 dropped. I put the Q6 quant on the same day it released and immediately I could tell it was faster and the output quality was much higher. And I realized earlier today that since I switched to Qwen3.6 I haven't had to ask Claude to check in on Qwen even once. The looping is gone. It's actually following the anti-loop protocols I've been trying to get models to follow for months.

I haven't tried coding with it yet (I don't have high hopes there) but I've given it the ability to create and modify its own skills and it's been doing that beautifully. Scheduled tasks, multiple agents (voice assistant, primary, Home Assistant), all running smoothly.

My reliance on Claude has dropped off sharply since moving to Qwen3.6, and my system resource usage has gone down significantly too. If you've tried to get a local LLM setup running and gave up out of frustration… now might be a good time to jump back in, especially if you know your hardware should be able to handle it.

25

57

Anyone's using Intel Arc B70 Pro? (lemmy.dbzer0.com)

submitted 1 month ago by pound_heap@lemmy.dbzer0.com to c/localllama@sh.itjust.works

20 comments fedilink

32 GB VRAM for less $1k sounds like a steal these days, and I'm sure it's not getting cheaper any time soon.

Does anyone here use this GPU? Or any recent Arc Pros? I basically want someone to talk me out of driving to the nearest place that has it in stock and getting $1k poorer.

LocalLLaMA

System & Software Stack

Build Configuration

Server Launch & Routing Architecture

Configuration Architecture (models.ini)

Per-Model Profiles & Parameter Rationale

Quick Reasoning: gemma-4-e4b

General Purpose: gemma-4-26b (Daily Driver)

Agentic Router: qwen3.5-9b

Complex Reasoning: qwen3.6-35b

Experimental: lfm2-24b

Primary Model Optimization: Gemma-4-26B

Hardware Constraints & Empirical Findings

Deployment Parameters

Configuration Architecture (`models.ini`)

Quick Reasoning: `gemma-4-e4b`

General Purpose: `gemma-4-26b` (Daily Driver)

Agentic Router: `qwen3.5-9b`

Complex Reasoning: `qwen3.6-35b`

Experimental: `lfm2-24b`