🚀 Local LLM Scaling Guide¶
From Ollama beginner to maximum performance expert
This guide covers the full spectrum of local LLM deployment options, from the simplest setup to bleeding-edge performance optimization.
Inference Backend Tiers¶
Tier 1: Ollama (Current - Easy)¶
What you're using now. Perfect for getting started.
| Pros | Cons |
|---|---|
| ✅ One-command install | ❌ Less control over parameters |
| ✅ Automatic GPU detection | ❌ Some performance overhead |
| ✅ Great model library | ❌ Limited quantization options |
| ✅ Background service | ❌ No multi-GPU support |
| ✅ OpenAI-compatible API |
Best for: Quick setup, casual use, experimentation
Performance: ⭐⭐⭐ (Good)
Tier 2: LM Studio (Intermediate)¶
GUI-based model management with more control than Ollama.
Download: lmstudio.ai
Setup Steps¶
- Download and install LM Studio
- Browse models in the built-in catalog
- Download models directly (GGUF format)
- Start local server for API access
Features¶
- Visual model browser and downloader
- Real-time performance metrics
- Easy parameter tweaking
- OpenAI-compatible API server
- Multiple model sessions
API Usage¶
# Start server in LM Studio (default port 1234)
$body = @{
model = "local-model"
messages = @(
@{ role = "user"; content = "Hello!" }
)
temperature = 0.7
} | ConvertTo-Json
Invoke-RestMethod -Uri "http://localhost:1234/v1/chat/completions" -Method POST -Body $body -ContentType "application/json"
| Pros | Cons |
|---|---|
| ✅ User-friendly GUI | ❌ Slightly slower than CLI tools |
| ✅ Easy model management | ❌ Resource overhead from GUI |
| ✅ Good for experimentation | ❌ Less scriptable |
| ✅ Built-in chat interface |
Best for: Users who prefer GUI, experimentation, model comparison
Performance: ⭐⭐⭐ (Good)
Tier 3: llama.cpp / koboldcpp (Advanced)¶
Maximum control and excellent performance. The foundation most other tools build on.
llama.cpp Setup (Windows)¶
Option A: Pre-built Binaries
- Download latest release from llama.cpp releases
- Get the
cudart-llama-*-win-x64.zipfor CUDA support - Extract and add to PATH
Option B: Build from Source (Maximum Performance)
# Install prerequisites
winget install cmake
winget install ninja
# Clone repository
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Build with CUDA support for RTX 3090
cmake -B build -G Ninja -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86
cmake --build build --config Release
RTX 3090 Optimization
Use -DCMAKE_CUDA_ARCHITECTURES=86 for Ampere GPUs (RTX 30xx series).
For RTX 40xx, use 89.
Running Models¶
# Download a GGUF model from HuggingFace
# Example: TheBloke/Mixtral-8x7B-v0.1-GGUF
# Run with optimal settings for RTX 3090
./build/bin/llama-server `
-m "path/to/model.gguf" `
-c 8192 `
-ngl 999 `
--host 0.0.0.0 `
--port 8080
Recommended Flags for RTX 3090¶
./llama-server `
-m model.gguf `
-c 8192 ` # Context length
-ngl 999 ` # All layers on GPU
-t 8 ` # CPU threads (for layers not on GPU)
--mlock ` # Lock model in RAM
-b 512 ` # Batch size
--host 0.0.0.0 `
--port 8080
koboldcpp Alternative¶
KoboldCpp is a llama.cpp fork with extra features:
# Download from GitHub releases
# Run with GUI
koboldcpp.exe --model model.gguf --gpulayers 999 --contextsize 8192
Features: - Built-in web UI - Story mode and adventure mode - Easy model switching - Streaming support
| Pros | Cons |
|---|---|
| ✅ Maximum performance | ❌ More setup required |
| ✅ Full quantization control | ❌ Manual model downloads |
| ✅ Active development | ❌ CLI knowledge needed |
| ✅ CUDA optimization | ❌ No model catalog |
Best for: Performance enthusiasts, production deployments
Performance: ⭐⭐⭐⭐ (Excellent)
Tier 4: vLLM / Text Generation Inference (Expert)¶
Production-grade serving with advanced features.
vLLM (Recommended for High Throughput)¶
# Docker setup (WSL2 required for Windows)
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model mistralai/Mixtral-8x7B-Instruct-v0.1 \
--tensor-parallel-size 1
Key Features: - Continuous batching - handles multiple requests efficiently - PagedAttention - optimized memory management - Tensor parallelism - multi-GPU support - OpenAI-compatible API
Text Generation Inference (TGI)¶
HuggingFace's production inference server:
docker run --gpus all --shm-size 1g \
-p 8080:80 \
-v ~/.cache/huggingface:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id mistralai/Mixtral-8x7B-Instruct-v0.1 \
--quantize gptq
| Pros | Cons |
|---|---|
| ✅ Production-ready | ❌ Complex setup |
| ✅ Highest throughput | ❌ Docker/Linux required |
| ✅ Multi-user support | ❌ Higher resource usage |
| ✅ Continuous batching | ❌ Overkill for single user |
Best for: Multiple users, API serving, production deployments
Performance: ⭐⭐⭐⭐⭐ (Maximum throughput)
Tier 5: ExLlamaV2 (Maximum Single-User Performance)¶
The fastest inference for consumer GPUs. Period.
Installation¶
# Clone and install
git clone https://github.com/turboderp/exllamav2
cd exllamav2
pip install .
# For Windows with CUDA
pip install exllamav2 --extra-index-url https://download.pytorch.org/whl/cu121
Running the Server¶
# server.py
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache, ExLlamaV2Tokenizer
from exllamav2.generator import ExLlamaV2StreamingGenerator
config = ExLlamaV2Config("path/to/model")
config.max_seq_len = 8192
model = ExLlamaV2(config)
model.load()
cache = ExLlamaV2Cache(model)
tokenizer = ExLlamaV2Tokenizer(config)
generator = ExLlamaV2StreamingGenerator(model, cache, tokenizer)
# Use the generator...
Using TabbyAPI (Recommended)¶
TabbyAPI wraps ExLlamaV2 with an OpenAI-compatible server:
git clone https://github.com/theroyallab/tabbyAPI
cd tabbyAPI
pip install -r requirements.txt
# Edit config.yml with your model path
python main.py
EXL2 Quantization¶
ExLlamaV2 uses its own EXL2 format for best performance:
| Bits per Weight | Quality Loss | Speed | VRAM Savings |
|---|---|---|---|
| 8.0 bpw | Minimal | Fast | 50% |
| 6.0 bpw | Very Low | Faster | 62% |
| 4.0 bpw | Low | Fastest | 75% |
| 3.0 bpw | Moderate | Maximum | 81% |
Find EXL2 models on HuggingFace: search for "EXL2" or check TheBloke.
| Pros | Cons |
|---|---|
| ✅ Fastest consumer inference | ❌ Complex setup |
| ✅ Excellent quantization | ❌ Fewer models available |
| ✅ Low VRAM usage | ❌ Python knowledge required |
| ✅ Flash Attention built-in | ❌ Less documentation |
Best for: Maximum performance from consumer GPUs
Performance: ⭐⭐⭐⭐⭐ (Fastest)
Hardware Scaling Guide¶
Current Setup: RTX 3090 (24GB)¶
Your hardware can handle:
| Model Size | Quantization | Fits in VRAM? | Performance |
|---|---|---|---|
| 7-8B | FP16 | ✅ Easily | Excellent |
| 13B | FP16 | ✅ Yes | Very Good |
| 34B | Q4_K_M | ✅ Yes | Good |
| 70B | Q4_K_M | ⚠️ Tight (needs offload) | Usable |
| Mixtral 8x7B | Q4_K_M | ✅ Yes | Very Good |
Upgrade Paths¶
RTX 4090 (24GB) - Same Capacity, Faster¶
- 50-80% faster than 3090
- Same model capacity
- Better power efficiency
- ~$1,600-2,000
Dual RTX 3090 (NVLink) - 48GB VRAM¶
- Run 70B at Q8 quality
- Requires NVLink bridge
- Only works with specific motherboards
- Tensor parallelism support needed
- ~$800-1,000 for second card + bridge
A100 40GB/80GB - Professional Grade¶
- Enterprise features
- ECC memory
- Better for production
- 80GB can run 70B at FP16
- ~$8,000-15,000
Multi-GPU Tensor Parallelism¶
For models too large for single GPU:
# vLLM with 2 GPUs
docker run --gpus '"device=0,1"' \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 2
Model Recommendations by Hardware¶
| VRAM | Recommended Models | Notes |
|---|---|---|
| 8GB | Llama 3.1 8B Q4, Mistral 7B Q4, Phi-3 Mini | Stick to 7B models |
| 12GB | Llama 3.1 8B FP16, Mixtral Q4 (partial offload), CodeLlama 13B Q4 | Sweet spot for 7-13B |
| 16GB | Mixtral 8x7B Q4, Llama 3.1 13B FP16, Qwen 14B | Good for MoE models |
| 24GB | Mixtral 8x7B Q4, Llama 3.1 70B Q4 (with offload), Command-R Q4 | Your tier - lots of options |
| 48GB | Llama 3.1 70B Q8, Qwen 72B Q4, DeepSeek 67B | High quality large models |
| 80GB+ | Full precision 70B, Mixtral 8x22B, Claude-scale models | No compromises |
Model Quality Tiers (Subjective)¶
- GPT-4 Class: Claude 3, GPT-4 (API only)
- Near GPT-4: Llama 3.1 70B, Qwen 72B, DeepSeek V2
- GPT-3.5 Class: Mixtral 8x7B, Llama 3.1 8B
- Good Enough: Mistral 7B, Phi-3, Qwen 7B
Performance Tuning¶
Context Length vs Speed¶
| Context | Tokens/sec (8B) | Tokens/sec (70B Q4) | VRAM Delta |
|---|---|---|---|
| 2048 | 80+ | 20+ | Baseline |
| 4096 | 70+ | 18+ | +1-2 GB |
| 8192 | 60+ | 15+ | +3-4 GB |
| 16384 | 45+ | 10+ | +6-8 GB |
| 32768 | 30+ | 5+ | +12-15 GB |
Context vs Quality Tradeoff
Very long contexts can degrade model attention. For most tasks, 4-8K is plenty.
Batch Size Optimization¶
# llama.cpp batch size tuning
./llama-server -m model.gguf -b 512 # Default, good for single user
./llama-server -m model.gguf -b 1024 # Higher throughput, more VRAM
./llama-server -m model.gguf -b 256 # Lower VRAM, slightly slower
KV Cache Quantization¶
Reduce VRAM usage for long contexts:
# llama.cpp with quantized KV cache
./llama-server -m model.gguf --cache-type-k q8_0 --cache-type-v q8_0
| Cache Type | VRAM Savings | Quality Impact |
|---|---|---|
| FP16 (default) | None | None |
| Q8_0 | ~25% | Minimal |
| Q4_0 | ~50% | Noticeable on long contexts |
Flash Attention Setup¶
Flash Attention 2 significantly speeds up inference:
# For PyTorch-based tools
pip install flash-attn --no-build-isolation
# For llama.cpp - built-in on CUDA builds
# No additional setup needed
CUDA Graphs¶
Reduces kernel launch overhead:
Model Fine-Tuning Path¶
When to Fine-Tune¶
✅ Fine-tune when: - You have domain-specific data - Base models don't understand your terminology - You need consistent output formatting - You want to reduce prompting overhead
❌ Don't fine-tune when: - Base models work well with good prompts - You don't have quality training data - You need general capability - Training resources are limited
QLoRA for Consumer GPUs¶
Train on your RTX 3090 with QLoRA:
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=bnb_config,
device_map="auto"
)
# LoRA config
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
Unsloth for 2x Faster Training¶
Unsloth patches models for faster fine-tuning:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/llama-3-8b-bnb-4bit",
max_seq_length=2048,
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_alpha=16,
lora_dropout=0,
bias="none",
)
# 2-3x faster training!
What Unsloth Enables on RTX 3090: - Train Llama 3.1 8B with 16K context - Train Mixtral 8x7B with QLoRA - 2x faster than standard training - 60% less VRAM usage
Quick Reference: Tool Comparison¶
| Tool | Ease | Speed | Control | Multi-GPU | Best For |
|---|---|---|---|---|---|
| Ollama | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐ | ❌ | Beginners |
| LM Studio | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ | ❌ | GUI lovers |
| llama.cpp | ⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | Limited | Power users |
| vLLM | ⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ✅ | Production |
| ExLlamaV2 | ⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | Limited | Max perf |
Recommended Progression¶
Week 1-2: Ollama
├── Learn model basics
├── Test different models
└── Understand limitations
Week 3-4: LM Studio or llama.cpp
├── More parameter control
├── Better performance
└── GGUF model ecosystem
Month 2+: ExLlamaV2 or vLLM
├── Maximum performance
├── Production deployment
└── Custom integrations
Resources¶
- llama.cpp GitHub
- ExLlamaV2 GitHub
- vLLM Documentation
- HuggingFace Models
- TheBloke's Quantizations
- LocalLLaMA Subreddit
- Unsloth GitHub
Go forth and infer locally! 🦙