Local AI: Run Large Language Models Securely on Your Own Infrastructure
The usage of Large Language Models (LLM) like OpenAIs ChatGPT has become very popular since OpenAI released ChatGPT in November of 2022. Since then, many more companies have released similar LLMs. Such as: Google’s Gemini, Anthropics Claude, xAI (Twitter) Grok and more. While this variety is a win for users, it presents a significant challenge for organizations: data sovereignty.
It is great to have this many choices but choosing one of these companies also means that you choose which company is allowed to handle all the data which you sent by interacting with their chat bot. Choosing a provider often means trusting them with every piece of information you input. From a personal standpoint, the privacy trade-off for a casual question might seem minor. However, for a business, the stakes are much higher. Organizations must consider a critical security question: Can we risk employees uploading proprietary source code, financial spreadsheets, or sensitive client documents to a third-party cloud?
Luckily, there are possibilities to still use the power of LLMs without sending all kinds of data to OpenAI etc. In this blog post, I will show you how to set-up LLMs locally on your laptop/PC or server and show how you can privately use a LLM on personal documents. While still enjoying the benefits LLMs can provide such as high-speed text summarization, creative brainstorming, and sophisticated data synthesis. By hosting these models on your own hardware, you gain the freedom to process sensitive datasets – like financial records or private intellectual property – without the fear of it being used for training or stored on a third-party server. It effectively bridges the gap between state-of-the-art intelligence and uncompromising data sovereignty.
Before I dive into how to set everything up and what is required, I want to provide some background on LLMs and what terms you will encounter when looking into this topic online. The setup part is relatively simple but when diving into this you will notice there are lots of different model types, file types, parameter sizes, quantization methods etc.
Model Files
There is a wide variety of public LLM models available on the internet. A popular site which distributes LLM models is huggingface.com/models. People upload their own models as well as companies, it is however not the most intuitive website. Most tools will also provide an easy way to download models. Finding free to download LLM models is very easy, choosing a specific model/file requires more effort. Popular free models to look into are Phi (microsoft, small models), Qwen (Alibaba), Llama (Meta), Mistral, DeepSeek and Gemma (Google). Just try some of the most popular models and see how they work for your specific use case.
File Types
Most LLM models you will find online will either be a .gguf or .safetensors file. This file will contain the whole LLM model and can thus be pretty large. Make sure to use a tool that supports one of these file types otherwise you will limit your choice and eventually the usefulness of your local deployment.
The Role of LLM
Most LLMs have been trained for a specific role/usecase. This is usually indicated by the file name, when you see “instruct” in the name of the file this will indicate that the model is for chat-like purposes, like ChatGPT. When you see “text” or no specific indication this usually means that the model is a base model and can only be used to continue text. You might also encounter something like “reason(ing)” or “r” version which is similar to instruct but the LLM will respond with the reasoning steps it took to get to its answer.
Parameter Size and Quantization
When selecting a model you will see that there are even more versions to choose for the same model. For example, if you know you want to use qwen2.5, you can choose from 0.5b, 1.5b, 3b, 7b, 14b, 32b and 72b. This number indicates the parameter size of the LLM (b is billion), the larger the number the larger the model and generally the more complex tasks the model can solve. But the file size will increase dramatically when the number of parameters increases and therefore the requirements on the RAM/VRAM required to load the model.Since these file sizes can get very large there are methods that try to reduce the size while trying to keep the same performance. This is called quantization, in the image below you can see all the different quantized versions of Qwen2.5 32b. The reason to pick a quantized version can simply be the fact that you are limited by your hardware but still want a certain number of parameters. Remember that this will come at some loss in the quality of the LLM. To keep it simple just look at the file size and see what works with your hardware i.e what fits into memory, more on that later.
What Do You Need?
It is possible to run LLMs on pretty much any computer, even embedded devices like a Raspberry Pi or smart surveillance camera. However, the performance and type of models is rather limited on those small devices. The two most important parts to keep in mind regarding hardware are CPU RAM and GPU VRAM. GPU’s, which are the best for running/training LLM’s, are limited in the amount of working memory they contain. To get the most performance, or increasingly, the ability to run a model; multiple dedicated GPU(s) (NVIDIA is generally better supported) are necessary. Because of the matrix calculations required for LLM’s CPU’s (general processing) can be used to run the models but performance will suffer compared to GPU (many small compute units) which are optimized for matrix calculations.
To determine the required (V)RAM, you can calculate this by taking the file sizes of the model.
There are tricks to improve performance on CPU’s but that is out of scope for now.
Note it is still possible to load larger models which do not fit in your RAM, this will cause the model to be loaded into swap memory i.e. on your SSD/HDD but this will make the LLM incredibly slow and will not provide a good experience.
Setup
For this example we’ll go with qwen3:8b-q4_K_M and my test system consists of a 12 core CPU and 32GB of RAM running Linux, no dedicated GPU.
The first tool I will use is called Ollama, Ollama is a CLI interface where you can interact with a LLM. It also exposes the LLM through http which can then be accessed through curl or python and javascript. To install Ollama simply go to the following page https://ollama.com/download and select your operating system. After the install is complete run the command “ollama run qwen3:8b-q4_K_M” which will automatically download the model and start it so that you can ask questions. That’s it, you now have a fully functional LLM running locally without sending any data to some datacenter. We can see an example in the picture below showing that qwen3 is a thinking model, so it reasons on the steps to take to provide an answer.
Ollama is great for quickly launching a LLM and when you want to create a LLM agent through python or javascript code. You can even combine it with the popular Langchain/Langgraph package in python. Ollama also supports creating embeddings (for documents) but this requires some programming. There are other tools which provide a user interface allowing you to create an experience similar to ChatGPT in the browser.
GPT4ALL
GPT4ALL provides a user interface where you can easily install models, create custom databases with information which the LLM can access and a chat interface which nicely shows the conversation. Do note that it does collect anonymous user data but you can turn this off in the settings or run it completely offline. What makes it great is that you can easily upload local documents and then ask the LLM questions about those documents.
Simply download GPT4ALL from https://www.nomic.ai/gpt4all . When it runs go to chat or models and install a model. It will show you how much RAM you need on your system for the model to run, this time I chose DeepSeek-R1-Distill-Qwen-14B. I also went to the LocalDocs tab and created my own document store. I selected the folder containing 1 pdf of the basic ruleset for Dungeons and Dragons, consisting of 115 pages. This will create embeddings i.e. it translates the words in the documents to vectors, so that the LLM can reference this information in the chat. Note that depending on the amount/size of documents it can take quite a while to embed them. This also depends on the system you are running on. By default it will only embed files with the following extensions: .docx, .pdf, .txt, .md, .rst. You can add more extensions but GPT4ALL cannot guarantee that it will work as intended.
Now go to Chats, select the model you downloaded and in the top right select the LocalDocs you just created. Simply ask a question about something in the documents and it will reference the documents and provide you with an answer. This might be a bit of a slow process so be patient. I asked DeepSeek: “What does the ruleset say about Duergar?” Its answer can be seen in the image below. Which matches the description of the Duergar in the ruleset.
Why Should You Use Local LLM?
Data protection
As said in the introduction, the power of local LLM is that all the questions you ask, all the documents you upload will stay within your local environment. No data will be shared to some data center. The model itself is also incapable of storing any data you put into it. This means you have full control over the data, allowing you to safely ask the LLM questions about proprietary source code or confidential documents.
This is also great for use cases where a system has to be air-gapped, i.e. disconnected from the internet. Both Ollama and GPT4ALL will keep the data within your local network and do not rely on any internet connection.
Cost savings
By running the model locally you avoid having to pay for API’s which can become expensive at high volume and constant querying. You do have the initial investment cost and of course the running costs (electricity etc.) but it does not matter whether you generate 10 tokens or 10 million tokens since you no longer pay per token. By making this switch you go from variable, unpredictable API costs to a fixed investment cost. An additional benefit is that paid LLM APIs can also have rate limits on the amount of requests you are allowed to make. Which can lead to possible downtime, or lead to additional costs to get a higher rate limit. When running locally there is no rate limit on the API, the only limiting factor is your hardware infrastructre.
Both Ollama and GPT4ALL can expose the models through their own API, which is very useful when you already have applications that query LLMs through an API. Reducing the costs those applications will otherwise incur by querying paid LLM APIs.
Latency
With the use of GPU’s for the locally hosted LLM the latency can be greatly reduced in comparison to making API calls to some data center. You talk to a local system dedicated to you which eliminates the longer path and shared nature of hosted services. Thus increasing the response speed of the model.
Trade offs
Of course there are also disadvantages to hosting the LLM yourself. You will need to maintain the LLM model check for updates and possibly perform technical troubleshooting. This requires time and effort and therefore is part of the TCO for the solution.
Further the hardware costs (CAPEX) can be a large investment upfront instead of a pay-as-you-go model (OPEX).
Ollama requires more setup but provides you with a lot of flexibility. GPT4ALL provides a more complete package with a UI and allows you to effortlessly upload documents, but with less flexibility. If you, for example, want to provide your employees the same experience as a ChatGPT-like tool, but locally, you can use GPT4ALL. If you have applications that rely on a generic interface with the LLM, Ollama is more appropriate since it gives you a generalized API that most tools can interact with.
Start Your Local LLM Journey Today
The shift toward local AI represents a fundamental change in how we interact with technology – moving from a model of ‘data for services’ to one of true digital sovereignty. As open-source models continue to bridge the gap compared to their closed-source counterparts, the argument for hosting your own intelligence becomes even more compelling. By starting your local LLM journey today, you aren’t just protecting your data; you are building a private foundation for the future of your digital workflow. If your organization is evaluating how to adopt AI securely and responsibly, we can help you assess your options and design a setup that balances performance, compliance, and long-term cost.