Running Local LLMs: Ollama, LM Studio, and Beyond

Running Local LLMs: Ollama, LM Studio, and Beyond

You don't need an API key to use an LLM. Modern open-weight models run on consumer hardware, giving you AI capabilities with zero cloud dependency, no usage costs, and complete data privacy.

Why Run Models Locally?

ReasonDetails
PrivacyYour prompts and data never leave your machine
CostNo per-token billing — pay once for hardware
Offline accessWorks without an internet connection
No rate limitsQuery as fast as your GPU can handle
CustomizationFine-tune, quantize, or modify models freely
LatencyNo network round-trip for local applications

The trade-off: local models are typically smaller and less capable than frontier cloud models (GPT-4o, Claude Opus, Gemini Ultra). But for many tasks — code completion, summarization, data extraction, chat — they are more than sufficient.

The Tools

Ollama

The simplest way to run LLMs locally. One command to install, one command to run a model.

# Install
curl -fsSL https://ollama.com/install.sh | sh

# Run a model
ollama run llama3.1

# Run with a specific size
ollama run llama3.1:70b

# List downloaded models
ollama list

# Serve as an API (OpenAI-compatible)
ollama serve

Ollama exposes an OpenAI-compatible API at http://localhost:11434, so any tool that works with the OpenAI API can point to Ollama instead.

Best for: Developers who want a CLI-first, API-first experience. Great for integrating local models into applications, scripts, and agent frameworks.

LM Studio

A desktop application with a GUI for discovering, downloading, and running models. Built on llama.cpp under the hood.

Key features:

  • Model discovery — Browse and download from Hugging Face directly
  • Chat interface — Test models interactively
  • Local API server — OpenAI-compatible endpoint
  • Parameter tuning — Adjust temperature, top-p, context length through the UI
  • Multi-model — Load and switch between models easily

Best for: Users who prefer a visual interface. Great for exploring and comparing different models.

llama.cpp

The foundational C++ inference engine that powers both Ollama and LM Studio. Use it directly for maximum control:

# Build from source
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j

# Run a model
./llama-cli -m models/llama-3.1-8b-q4_K_M.gguf -p "Explain TCP/IP in simple terms"

Best for: Advanced users who need fine-grained control over inference parameters, custom builds, or integration into C/C++ applications.

vLLM

A high-throughput inference engine optimized for serving models to multiple users. Uses PagedAttention for efficient GPU memory management.

pip install vllm
vllm serve meta-llama/Llama-3.1-8B-Instruct

Best for: Serving models to multiple users or applications. Not ideal for single-user desktop use.

Hardware Requirements

How Much Do You Need?

The key constraint is RAM (or VRAM for GPU inference). A rough rule of thumb for quantized models:

Model SizeQ4 QuantizedRAM/VRAM NeededExample Hardware
1-3B~2 GB4 GB+Any modern laptop
7-8B~4-5 GB8 GB+M1/M2 MacBook, GTX 3060
13B~8 GB12 GB+M2 Pro, RTX 3080
34B~20 GB24 GB+M2 Max, RTX 4090
70B~40 GB48 GB+M2 Ultra, 2x RTX 4090

Apple Silicon Advantage

Apple's M-series chips are excellent for local LLMs because they have unified memory — the GPU and CPU share the same RAM pool. A MacBook Pro with 36 GB unified memory can comfortably run 34B-parameter models.

GPU vs CPU Inference

GPUCPU
Speed30-100+ tokens/sec5-20 tokens/sec
MemoryLimited to VRAMCan use all system RAM
CostGPU hardware is expensiveWorks on any machine

For interactive chat, you want at least 10 tokens/second. For batch processing, speed matters less.

Quantization — Making Models Fit

Full-precision models are huge. A 70B model at FP16 needs ~140 GB. Quantization reduces the precision of model weights to make them smaller and faster:

FormatBitsSize ReductionQuality Impact
FP1616BaselineNone
Q8_08~50%Negligible
Q5_K_M5~65%Very minor
Q4_K_M4~75%Minor — sweet spot
Q3_K_M3~80%Noticeable on complex tasks
Q2_K2~85%Significant degradation

Q4_K_M is the recommended default — best balance of size, speed, and quality.

GGUF is the standard file format for quantized models, used by llama.cpp, Ollama, and LM Studio.

ModelParametersStrengths
Llama 3.18B, 70B, 405BBest all-rounder from Meta
Mistral / Mixtral7B, 8x7BFast, strong coding and reasoning
Phi-3 / Phi-43.8B, 14BSurprisingly capable for their size
Qwen 2.57B, 72BStrong multilingual, good at code
DeepSeek Coder6.7B, 33BPurpose-built for code
Gemma 29B, 27BGoogle's open model, well-rounded
CodeLlama7B, 34BCode-specialized Llama variant

Connecting Local Models to Your Workflow

Use with VS Code

Point GitHub Copilot or Continue.dev to your local Ollama endpoint for code completion without cloud dependency.

Use as MCP Server Backend

Run a local model as the LLM behind an agent workflow — combine with MCP servers for a fully offline AI assistant.

Use in Applications

The OpenAI-compatible API means you can swap https://api.openai.com with http://localhost:11434 in any application using the OpenAI SDK:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="unused",  # Ollama doesn't need a key
)

response = client.chat.completions.create(
    model="llama3.1",
    messages=[{"role": "user", "content": "Explain DNS in one paragraph"}],
)
print(response.choices[0].message.content)

When to Use Local vs Cloud

Use CaseLocalCloud
Sensitive/private dataYesNo
Offline environmentsYesNo
High-volume batch processingYes (no cost per token)Expensive
State-of-the-art reasoningLimitedYes
Quick prototypingYesYes
Production at scaleConsider vLLMEasier

Conclusion

Running LLMs locally is practical, private, and increasingly capable. Ollama makes it dead simple to get started — install it, pull a model, and you're running AI on your own hardware in under five minutes. For most development tasks, an 8B quantized model on a decent laptop is surprisingly effective.

Start with ollama run llama3.1 and see how far it takes you before reaching for a cloud API.