Selfhosted

58212 readers

613 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.
No spam posting.
Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.
Don't duplicate the full text of your blog or github here. Just post the link for folks to click.
Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).
No trolling.
No low-effort posts. This is subjective and will largely be determined by the community member reports.

Resources:

selfh.st Newsletter and index of selfhosted software and apps
awesome-selfhosted software
awesome-sysadmin resources
Self-Hosted Podcast from Jupiter Broadcasting

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 2 years ago

MODERATORS

HybridSarcasm@lemmy.world

HybridSarcasm@lemmy.hybridsarcasm.xyz

Which Llama Server Hardware do you use? (discuss.tchncs.de)

submitted 2 days ago* (last edited 2 days ago) by bazinga@discuss.tchncs.de to c/selfhosted@lemmy.world

21 comments fedilink hide all child comments

I realize, I need to upgrade my little NUC to something bigger for higher inference of bigger llama models. I want something that you still can have on your living room's tv bench, so no monster rack please, but that has also the necessary muscle when needed for llama. Budget doesn't matter right now, want to understand what's good and what's out there. Thanks

EDIT: Wow, thanks for the inspiration, guess I need to look at bit for "how to stuff a huge graphics card into a mini box". To clarify a bit more what I want with it: I want to build a responsive personal assistant. I am dreaming of models bigger than 8B, good tool calling for things like memory, websearch etc., no coding, no image generation, no video generation required. Image recognition would be good but not a must. Regarding footprint, the no monster ;) Something that you can have in your livingroom, and could be wife approved - so no big gaming rig with exhaust pipes and stuff, needs to be good looking ;)

you are viewing a single comment's thread
view the rest of the comments

[–] tal@lemmy.today 11 points 2 days ago* (last edited 2 days ago) (1 children)

I use a 128GB Framework Desktop. Back when I got it, it was $2,500 with 8TB of SSD storage, but the RAM shortage has driven prices up to substantially more. That system's interesting in that you can tell Linux to use essentially all of the memory as video memory; it has an APU with unified memory, so the GPU can access all that memory.

That'll get you 70B models like llama 3-based stuff at Q6_K with 128K of context window, which is the model max. Speeds with that level of model are okay for chatbot-like operation, but you won't want to run code generation with that.

For some tasks, you may be better-off using a higher-bandwidth-but-less-memory video card and an MoE model; this doesn't keep all of the model active and in video memory, only loading relevant expert models. I can't suggest much there, as I've spent less time with that.

If you don't care about speed

you probably do

you can run just about anything with llama.cpp using the CPU and main memory, as long as you have enough memory. That might be useful if you just want to evaluate the quality of a given model's output, if you want to get a feel for what you can get out of a given model before buying hardware.

You might want to ask on !localllama@sh.itjust.works, as there'll be more people familiar there (though I'm not on there myself).

EDIT: I also have a 24GB Radeon 7900 XTX, but for LLM stuff like llama.cpp, I found the lack of memory to be too constraining. It does have higher memory bandwidth, so for models that fit, it's faster than the Framework Desktop. In my experience, discrete GPUs were more interesting for image diffusion models like Stable Diffusion

most open-weight image diffusion models are less-memory hungry -- than LLM stuff. Though if you want to do Flux v2, I wasn't able to fit it on that card. I could run it on the Framework Desktop, but at the resolutions I wanted to run it at, the poor ol' Framework took about 6 or 7 minutes to generate an image.

EDIT2: I use all AMD hardware, though I agree with @anamethatisnt@sopuli.xyz that Nvidia hardware is going to be easier to get working; a lot of the AMD software is much more bleeding edge, as Nvidia got on all this earlier. That being said, Nvidia also charges a premium because of that. I understand that a DGX Spark is something of an Nvidia analog to the Framework Desktop and similar AI Max-based systems, has unified memory, but you'll pay for it, something like $4k.

[–] bazinga@discuss.tchncs.de 2 points 2 days ago (1 children)

Thanks, will also ask in the other group you mentioned. I am still having a gaming rig here with rx6900xt as well but way too big to get it wife approved into the living room and have no man cave to run it 24/7. ;) But maybe good for testing what I actually need in model size, I think it is just 1 generation before all the ai hype took off but going to try now right away.

[–] tal@lemmy.today 1 points 2 days ago* (last edited 2 days ago)

I am still having a gaming rig here with rx6900xt as well but way too big to get it wife approved into the living room and have no man cave to run it 24/7.

It's pretty trivial to make use of an LLM compute box remotely; in fact, most of the software out there is designed around doing this, since lots of people use cloud-based LLM compute machines. I use the Framework Desktop in this fashion

I leave it headless, just as an LLM compute node for whatever machine is running software that needs number-crunching done. So if your gaming machine is fine for you in terms of compute capability, you might want to just use it remotely from the living room with another machine being in the living room.

Another benefit of sticking the compute box elsewhere is that while my Framework Desktop is very quiet (single large fan, about 120W TDP, and is notable for being rather quieter than other AI Max-based systems), keeping my 7900 XTX loaded will spin up the fans. You may not want to have a heavy-duty number-crunching machine in the living room from a noise standpoint.