llms

Stop Paying for LLMs — Run Your Own Locally

If you’re tired of hitting limits, waiting on slow responses, or paying monthly just to use an online model, there’s a better option. With Ollama, you can run powerful LLMs directly on your own machine.

Lince Mathew

Nov 16, 2025 — 6 min read

If you’re trying to get something done with an online LLM model but keep hitting daily limits, slow responses, or rate-limits, there’s another way. You can run these models directly on your own machine.

The only catch: you need a decent amount of RAM and ideally a GPU. But once it’s set up, you get full control, no limits, and everything works offline.

In this article, we’ll walk through how to set up Ollama, the tool that lets you run models locally. We’ll look at its useful features, the commands you need to know, how to pick the right model for each task, and how to connect it with other tools so you can get your work done entirely on your machine.

System Requirements

Before installing Ollama, it’s good to check whether your machine has enough resources to run local models smoothly. The better the hardware, the faster and more responsive the models will feel.

Hardware Requirements

CPU

Minimum: 64-bit processor with 2+ cores
Recommended: 4+ cores for smoother performance

Memory (RAM)

Minimum: 8GB RAM
Recommended: 16GB+ RAM, especially for models above 7B

Note: Larger models require more RAM to run efficiently

Storage

Ollama installation: Around 4GB for the app and dependencies
Model storage: Depends on the model size
- Small models (1–7B): ~1–8GB each
- Medium models (13–70B): ~8–80GB each
- Large models (70B+): 80GB+

Setting up Ollama

Setting up Ollama is pretty straightforward. On Linux, you can install it by running the one-line installer.
For Windows and macOS, you can download and install the official packages from the Ollama website.

Check out the download page for more details.

Once Ollama is running, it will automatically start a local server on port 11434. With that in place, the base setup is done.
Your next step is to pull a model that suits your task and start using it locally.

The Best Local Model for Your Task

You can download the model that fits your task using the command:

ollama pull <model-name>

Different models serve different purposes. Here’s a simple comparison of popular local models you can pull with Ollama.

1. Gemma — 2B / 7B / 9B

Pull size:
- 2B (gemma:2b) → ~1.5 GB
- 7B (gemma:7b) → ~4.4 GB
- 9B (gemma2:9b) → ~5.2 GB
Best for: Quick answers, lightweight tasks, everyday writing
Why: Small, fast, and works well even without a strong GPU.

2. Qwen — 1.8B → 72B

Pull size:
- 1.8B (qwen:1.8b) → ~1.1 GB
- 7B (qwen:7b) → ~4.5 GB
- 14B (qwen:14b) → ~8.8 GB
- 32B (qwen:32b) → ~18 GB
- 72B (qwen:72b) → ~41 GB
Best for: Long-form writing, complex tasks, multilingual work
Why: Strong at summarizing, reasoning, and long context.

3. Phi-3 — 3.8B / 14B

Pull size:
- Mini (3.8B) (phi3:mini) → ~2.2 GB
- Medium (14B) (phi3:medium) → ~7.8 GB
Best for: Laptops without GPUs, small projects, quick coding help
Why: Lightweight and efficient, good for low-resource devices.

4. Mistral & Mixtral

Pull size:
- Mistral 7B (mistral:7b) → ~4.1 GB
- Mixtral 8x7B (mixtral:8x7b) → ~26 GB
Best for:
- Mistral 7B: Structured output, reasoning, balanced performance
- Mixtral 8x7B: High-end performance, complex reasoning, handling nuanced tasks (requires a powerful GPU)
Why:
- Mistral: Stable and well-rounded with a great quality–speed balance.
- Mixtral: A "Mixture of Experts" (MoE) model that is significantly more powerful, but also much larger.

5. Llama 3 — 8B / 70B

Pull size:
- 8B (llama3:8b) → ~4.7 GB
- 70B (llama3:70b) → ~39–40 GB
Best for: Coding, debugging, deeper reasoning tasks
Why: Strong all-around performance, especially in coding and problem-solving.

6. Codestral — 22B & Qwen Coder — 1.5B → 32B

Pull size:
- Codestral 22B (codestral:22b) → ~12–13 GB
- Qwen Coder 1.5B (qwen2.5-coder:1.5b) → ~986 MB
- Qwen Coder 7B (qwen2.5-coder:7b) → ~4.5 GB
- Qwen Coder 32B (qwen2.5-coder:32b) → ~20 GB
Best for: Code generation, refactoring, explaining code
Why: Built specifically for software development workflows.

Customize Ollama for Your Use Case

Once you get the basics running, you can shape Ollama to work exactly the way you want.

You can create custom models, tune performance, adjust memory usage, or even change how Ollama runs in the background.

Here are the most useful ways to personalize your setup.

Build Your Own Custom Model

If you want a model that behaves in a specific way — for example, a coding assistant that always follows best practices — you can create one using a Modelfile.

This lets you:

Choose a base model
Adjust generation settings
Add a system message
Define a custom template

Here’s a simple example of a coding-focused custom Modelfile:

FROM codellama:7b

PARAMETER temperature 0.1
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.1

SYSTEM """
You are a senior  engineer specializing in clean, efficient code.
Always include error handling, linting proper comments and follow best practices.

Explain complex concepts in simple terms.
"""

TEMPLATE """{{ if .System }}<|system|>
{{ .System }}<|end|>
{{ end }}{{ if .Prompt }}<|user|>
{{ .Prompt }}<|end|>
{{ end }}<|assistant|>
{{ .Response }}<|end|>
"""

Once done with modelfile run:

ollama create coding-expert -f Modelfile.coding

This gives you a model fine-tuned for cleaner and more consistent code output.

Speed Up Models with Quantization

Quantization reduces model size and speeds up inference without a big drop in quality.

It’s one of the easiest ways to boost performance.

Check available quantization levels:

ollama show llama2 --quantization

Create a quantized model:

ollama create llama2-q4 --quantize q4_0 -f Modelfile.llama2

Check model sizes:

ollama list | grep llama2

Tune the Ollama Server to Your Needs

You can also configure the Ollama local server which running in the background


# Enable detailed logging
OLLAMA_DEBUG=1 ollama serve

# Change host and port
OLLAMA_HOST=0.0.0.0:11435 ollama serve

# Use a custom models directory
OLLAMA_MODELS=/custom/path/models ollama serve

# Pick a specific GPU
CUDA_VISIBLE_DEVICES=0 ollama serve

Hidden Settings for Power Users

These environment variables aren’t obvious but give you deeper control:


# Allow more parallel requests
export OLLAMA_NUM_PARALLEL=4

# Change context length
export OLLAMA_CONTEXT_SIZE=8192

# Enable experimental features
export OLLAMA_EXPERIMENTAL=true

# Control GPU memory usage
export OLLAMA_GPU_MEMORY_FRACTION=0.8

# Adjust request timeout
export OLLAMA_REQUEST_TIMEOUT=300

These are great for heavy workloads or server-style setups.

Manage Memory and Performance

You can customize how much memory Ollama uses.

# Check active models and memory usage
ollama ps

# Preload a model for faster startup
ollama run llama2 --preload

# Limit memory usage
ollama run llama2 --memory-limit 4GB

# Remove unused models
ollama rm --force unused-model

Get the Most Out of Your GPU

If you have an NVIDIA GPU, you can squeeze out more performance with a few settings:

# Monitor GPU usage
nvidia-smi -l 1

# Split GPU memory across models
export OLLAMA_GPU_SPLIT=true

# Enable tensor parallelism
export OLLAMA_TENSOR_PARALLEL=2

# CUDA debugging (optional)
export CUDA_LAUNCH_BLOCKING=1

Using Local LLMs as a Coding Agent

One of the best ways to use local LLMs in VS Code as a coding agent is to integrate them with GitHub Copilot. GitHub Copilot supports local LLMs directly within its chat/agent window.

To enable this, follow these steps:

First, make sure Ollama is running.
If it's not, run the ollama serve command in your terminal.
Then, in VS Code, go to the GitHub Copilot extension settings.
Navigate to "Manage Models" and enable the Ollama models available on your system.

Once its done you can interact freely in the chat window

Get Everything Out of the Box

We discussed how to run Ollama and explored a few configurations you can tweak to get the best performance and speed up your daily tasks.

We introduced Ollama into our workflow to solve a bulk-processing LLM problem while working on freedevtool, a platform that provides over 125k free developer resources. To streamline the workflow, we are building a CLI tool called dprompts, which helps teams distribute large batches of LLM tasks across the local machines of developers and aggregate the results in a database. This approach removes the dependency on paid cloud LLMs and eliminates worries about cost, quotas, and rate limits.

You can learn more about dprompts and how to set it up by visiting the project’s GitHub repository and checking the README.