I’ve built some projects recently that include integrations with LLMs. Specifically, I’ve found an interest in agentic applications where the LLM has some responsibility over the control flow of the application. Integrating these features into my existing development workflow led me to explore running local LLMs in depth.
Why Run an LLM Locally?
When I talk about running an LLM locally, I mean that I’m running a temporary instance of a model on my development machine. This is not intended to be advice on self-hosting an AI application.
Let’s be clear. It’s going to be a long time before running a local LLM will produce the type of results that you can get from querying ChatGPT or Claude. (You would need an insanely powerful homelab to produce that kind of results). If all you need is a quick chat with an LLM, a hosted service will be far more convenient than setting up a local one.
So when might you want to run your own LLM?
- When privacy is critical
- When expenses are sensitive
- When response time or quality is not important
In my case, I’m still experimenting with agent-building techniques. This has me paranoid about accidentally introducing loops or mistakes that might drive up expenses if using a pay-as-you go API key. Additionally, when I’m iterating on a side project, I don’t care too much about response time or quality.
Options for Running Models
Ollama
Ollama seems to be emerging as the state-of-the-art option for local LLMs at the moment. It has a substantial library of models available and a clean, easy-to-use CLI. The library contains all of the most popular open-weight model providers in a wide variety of parameterizations and quantizations (more on this below) including LLaMa, Mistral, Qwen, and DeepSeek. The CLI reminds me of Docker’s with simple pull
, list
, and run
commands. It also supports more advanced functionality like creating and pushing your own models.
Ollama makes it simple to get a model up and running quickly. After installing the app, all it takes is ollama pull llama3.2
and ollama run llama3.2
. This is why I think most people in most situations should consider this tool first.
Llama.cpp
Llama.cpp is implemented in pure C/C++ (hence the name). This allows it to run nearly anywhere with reasonable performance. Llama.cpp’s benefits are its versatility and utilities.
The low-level implementation allows it to run on nearly any platform. This could be beneficial for resource-constrained systems like Raspberry Pis or older consumer PCs. It is also capable of running on Android devices and even directly in the browser with this handy web assembly wrapper.
The llama.cpp application offers a wide range of utilities directly out of the box. It is capable of integrating directly with Hugging Face which is one of the more popular LLM model repositories. The other tools that I found interesting were the benchmarking and perplexity measuring commands. These help you understand how different model configurations will operate directly on the hardware that will be executing them.
Llamafiles
Llamafiles are an interesting development out of Mozilla that allows running local LLMs with a single executable file. No application necessary. It actually uses Llama.cpp under the hood. These are interesting options for quickly sharing and distributing models with other developers. The process is simple: download a llamafile, make it executable, and run it. A browser interface for interacting with the model is automatically hosted on localhost.
Llamafiles are certainly less popular at the moment than formats like gguf which Llama.cpp uses. You can find some sample models linked on the llamafile Github repo or by filtering on Hugging Face.
Choosing the Right Model for Your Needs
Once you’ve settled on a method for running an LLM locally, the next step is choosing a model that fits both your needs and the capabilities of your machine. Not all models are created equal—some are optimized for power, others for efficiency, and the right choice depends on your specific use case.
Parameters and Quantization
The size of an LLM is typically described in terms of its parameter count. You’ll see sizes like 7B, 13B, and 65B denoting billions of parameters. Bigger models produce more coherent and nuanced responses, but they also demand significantly more memory and processing power. If you’re just experimenting or running models on a laptop, smaller parameter counts (7B or less) are the best starting point.
Most models are also available in various quantized formats. Quantizations come in flavors like Q4, Q6, and Q8. Quantization compresses models by reducing numerical precision. This trades response accuracy for better performance on less powerful hardware. Lower quantization levels like Q4 will run faster and require less memory but slightly degrade response quality, while higher levels like Q8 offer better fidelity at a much higher resource cost.
Capabilities and Tool Use
Not all models come with the same built-in capabilities. Some can directly use external tools like code interpreters, API calls, or search utilities. If you’re planning to use an LLM as part of an agentic application, look for models explicitly designed for tool use. Many open-weight models lack tool integration out of the box; I was surprised to find DeepSeek in this camp. LLaMa 3.2 is a good starting point if you want a local LLM with basic tool-calling.
Another capability you might consider is the skill-set of the model. Based on the training data used to train the model, different models excel at different tasks. Some are better at interpreting code while others thrive on standard language tasks. Benchmarking sites like LiveBench track a leaderboard of which models perform better in different categories.
Other Considerations
As you look through model repositories you’ll quickly find that model file sizes are huge. The small models usually fall in the range of a few gigabytes while larger models can be dozens of gigabytes. If you’re experimenting with different models, it’s easy to clutter your system with versions you no longer need, so keeping an eye on storage can save headaches down the road. Ollama manages its own directory of model versions on your machine to help avoid this problem.
A final word of caution: running local LLM models means executing code downloaded from the internet, sometimes in the form of pre-built binaries. Always verify the source and try to stick to trusted repositories like Hugging Face, Ollama, and official developer GitHub repositories.
With the right model and a bit of setup, running a local LLM can be an incredibly useful tool — whether for privacy, cost savings, or tinkering with cutting-edge AI tools.