this post was submitted on 13 Dec 2025
2 points (66.7% liked)
Artificial Intelligence
1800 readers
1 users here now
Welcome to the AI Community!
Let's explore AI passionately, foster innovation, and learn together. Follow these guidelines for a vibrant and respectful community:
- Be kind and respectful.
- Share high-quality contributions.
- Stay on-topic.
- Enhance accessibility.
- Verify information.
- Encourage meaningful discussions.
You can access the AI Wiki at the following link: AI Wiki
Let's create a thriving AI community together!
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
You must run the proprietary Nvidia module to use CUDA. The proprietary stuff is not integrated into Linux.
You may find it helpful to know there are two separate sets of software involved in this space. There is the kernel stuff and then there is the user space stuff. In gaming, the user space stuff is what most people are talking about. In AI, we are talking about the kernel module, which is the actual firmware and code in the kernel. In Linux, the kernel uses "modules".
A lot of the ambiguity in this space is related to machines with a UEFI bootloader and secure boot enabled. You may not have this issue yourself, but I am trying to help you contextualize information you may find.
Linux is a monolithic kernel. This means it includes all the drivers/modules in the source code. If you were to compile the kernel yourself, it is actually just a long menu system in a terminal that lets you select whatever hardware modules you need. Then it only complies the kernel for the options you select. Most of the oddball options are for very rare hardware that was basically built wrong in nonstandard ways. Like most modules are built to support a standard way of doing stuff. In other words, most of the modules support a large number of devices from many manufacturers. When you select some oddball option, the kernel just includes that source code. Nothing stops you from manually adding extra junk as models at this point. That is how Android works. These are orphan proprietary kernels when that extra source code is never merged into the mainline Linux kernel.
Some of the Nvidia source code for the firmware is open source. There is a module that can be built into the kernel using this source code. The actual underlying issue is that this open source code is only able to compile using the proprietary nvcc compiler.
Android is a scheme to leverage proprietary orphan kernels to steal device ownership from consumers. Similarly, nvcc is proprietary in order to maintain the option to steal ownership of and control device deprecation.
In other words, you may need to use an older kernel to make it work.
In any case, there are two primary underlying software chains people use, ollama and llama.cpp. Ollama is only able to run models in one hardware space, but is generally easier to use. This means it needs to fit entirely into VRAM. That is a bad option for 2GB. Llama.cpp splits the model between GPU and CPU to run everything at once. Oobabooga Textgen WebUI is a good option for this, or just run a llama.cpp server directly. These have the option to create a local network api that is compatible with open ai, so everything just works the same. It is not hard to do. Getting the sampling settings dialed, and the trial and error of getting a good model split and loaded is a bit more challenging. It is probably better to look for people talking about what they are running on similar hardware and trying those things to get a baseline entry point. In general, a very large quantized model is FAR better than a small model at full precision.
Thanks for this complete explanation