🚀 Local LLM Scaling Guide¶

From Ollama beginner to maximum performance expert

This guide covers the full spectrum of local LLM deployment options, from the simplest setup to bleeding-edge performance optimization.

Inference Backend Tiers¶

Tier 1: Ollama (Current - Easy)¶

What you're using now. Perfect for getting started.

# That's literally it
ollama pull llama3.1:8b
ollama run llama3.1:8b

Pros	Cons
✅ One-command install	❌ Less control over parameters
✅ Automatic GPU detection	❌ Some performance overhead
✅ Great model library	❌ Limited quantization options
✅ Background service	❌ No multi-GPU support
✅ OpenAI-compatible API

Best for: Quick setup, casual use, experimentation

Performance: ⭐⭐⭐ (Good)

Tier 2: LM Studio (Intermediate)¶

GUI-based model management with more control than Ollama.

Download: lmstudio.ai

Setup Steps¶

Download and install LM Studio
Browse models in the built-in catalog
Download models directly (GGUF format)
Start local server for API access

Features¶

Visual model browser and downloader
Real-time performance metrics
Easy parameter tweaking
OpenAI-compatible API server
Multiple model sessions

API Usage¶

# Start server in LM Studio (default port 1234)

$body = @{
    model = "local-model"
    messages = @(
        @{ role = "user"; content = "Hello!" }
    )
    temperature = 0.7
} | ConvertTo-Json

Invoke-RestMethod -Uri "http://localhost:1234/v1/chat/completions" -Method POST -Body $body -ContentType "application/json"

Pros	Cons
✅ User-friendly GUI	❌ Slightly slower than CLI tools
✅ Easy model management	❌ Resource overhead from GUI
✅ Good for experimentation	❌ Less scriptable
✅ Built-in chat interface

Best for: Users who prefer GUI, experimentation, model comparison

Performance: ⭐⭐⭐ (Good)

Tier 3: llama.cpp / koboldcpp (Advanced)¶

Maximum control and excellent performance. The foundation most other tools build on.

llama.cpp Setup (Windows)¶

Option A: Pre-built Binaries

Download latest release from llama.cpp releases
Get the cudart-llama-*-win-x64.zip for CUDA support
Extract and add to PATH

Option B: Build from Source (Maximum Performance)

# Install prerequisites
winget install cmake
winget install ninja

# Clone repository  
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build with CUDA support for RTX 3090
cmake -B build -G Ninja -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86
cmake --build build --config Release

RTX 3090 Optimization

Use -DCMAKE_CUDA_ARCHITECTURES=86 for Ampere GPUs (RTX 30xx series). For RTX 40xx, use 89.

Running Models¶

# Download a GGUF model from HuggingFace
# Example: TheBloke/Mixtral-8x7B-v0.1-GGUF

# Run with optimal settings for RTX 3090
./build/bin/llama-server `
    -m "path/to/model.gguf" `
    -c 8192 `
    -ngl 999 `
    --host 0.0.0.0 `
    --port 8080

Recommended Flags for RTX 3090¶

./llama-server `
    -m model.gguf `
    -c 8192 `              # Context length
    -ngl 999 `             # All layers on GPU
    -t 8 `                 # CPU threads (for layers not on GPU)
    --mlock `              # Lock model in RAM
    -b 512 `               # Batch size
    --host 0.0.0.0 `
    --port 8080

koboldcpp Alternative¶

KoboldCpp is a llama.cpp fork with extra features:

# Download from GitHub releases
# Run with GUI
koboldcpp.exe --model model.gguf --gpulayers 999 --contextsize 8192

Features: - Built-in web UI - Story mode and adventure mode - Easy model switching - Streaming support

Pros	Cons
✅ Maximum performance	❌ More setup required
✅ Full quantization control	❌ Manual model downloads
✅ Active development	❌ CLI knowledge needed
✅ CUDA optimization	❌ No model catalog

Best for: Performance enthusiasts, production deployments

Performance: ⭐⭐⭐⭐ (Excellent)

Tier 4: vLLM / Text Generation Inference (Expert)¶

Production-grade serving with advanced features.

vLLM (Recommended for High Throughput)¶

# Docker setup (WSL2 required for Windows)
docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    vllm/vllm-openai:latest \
    --model mistralai/Mixtral-8x7B-Instruct-v0.1 \
    --tensor-parallel-size 1

Key Features: - Continuous batching - handles multiple requests efficiently - PagedAttention - optimized memory management - Tensor parallelism - multi-GPU support - OpenAI-compatible API

Text Generation Inference (TGI)¶

HuggingFace's production inference server:

docker run --gpus all --shm-size 1g \
    -p 8080:80 \
    -v ~/.cache/huggingface:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 \
    --quantize gptq

Pros	Cons
✅ Production-ready	❌ Complex setup
✅ Highest throughput	❌ Docker/Linux required
✅ Multi-user support	❌ Higher resource usage
✅ Continuous batching	❌ Overkill for single user

Best for: Multiple users, API serving, production deployments

Performance: ⭐⭐⭐⭐⭐ (Maximum throughput)

Tier 5: ExLlamaV2 (Maximum Single-User Performance)¶

The fastest inference for consumer GPUs. Period.

Installation¶

# Clone and install
git clone https://github.com/turboderp/exllamav2
cd exllamav2
pip install .

# For Windows with CUDA
pip install exllamav2 --extra-index-url https://download.pytorch.org/whl/cu121

Running the Server¶

# server.py
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache, ExLlamaV2Tokenizer
from exllamav2.generator import ExLlamaV2StreamingGenerator

config = ExLlamaV2Config("path/to/model")
config.max_seq_len = 8192

model = ExLlamaV2(config)
model.load()

cache = ExLlamaV2Cache(model)
tokenizer = ExLlamaV2Tokenizer(config)
generator = ExLlamaV2StreamingGenerator(model, cache, tokenizer)

# Use the generator...

Using TabbyAPI (Recommended)¶

TabbyAPI wraps ExLlamaV2 with an OpenAI-compatible server:

git clone https://github.com/theroyallab/tabbyAPI
cd tabbyAPI
pip install -r requirements.txt

# Edit config.yml with your model path
python main.py

EXL2 Quantization¶

ExLlamaV2 uses its own EXL2 format for best performance:

Bits per Weight	Quality Loss	Speed	VRAM Savings
8.0 bpw	Minimal	Fast	50%
6.0 bpw	Very Low	Faster	62%
4.0 bpw	Low	Fastest	75%
3.0 bpw	Moderate	Maximum	81%

Find EXL2 models on HuggingFace: search for "EXL2" or check TheBloke.

Pros	Cons
✅ Fastest consumer inference	❌ Complex setup
✅ Excellent quantization	❌ Fewer models available
✅ Low VRAM usage	❌ Python knowledge required
✅ Flash Attention built-in	❌ Less documentation

Best for: Maximum performance from consumer GPUs

Performance: ⭐⭐⭐⭐⭐ (Fastest)

Hardware Scaling Guide¶

Current Setup: RTX 3090 (24GB)¶

Your hardware can handle:

Model Size	Quantization	Fits in VRAM?	Performance
7-8B	FP16	✅ Easily	Excellent
13B	FP16	✅ Yes	Very Good
34B	Q4_K_M	✅ Yes	Good
70B	Q4_K_M	⚠️ Tight (needs offload)	Usable
Mixtral 8x7B	Q4_K_M	✅ Yes	Very Good

Upgrade Paths¶

RTX 4090 (24GB) - Same Capacity, Faster¶

50-80% faster than 3090
Same model capacity
Better power efficiency
~$1,600-2,000

Dual RTX 3090 (NVLink) - 48GB VRAM¶

Run 70B at Q8 quality
Requires NVLink bridge
Only works with specific motherboards
Tensor parallelism support needed
~$800-1,000 for second card + bridge

A100 40GB/80GB - Professional Grade¶

Enterprise features
ECC memory
Better for production
80GB can run 70B at FP16
~$8,000-15,000

Multi-GPU Tensor Parallelism¶

For models too large for single GPU:

# vLLM with 2 GPUs
docker run --gpus '"device=0,1"' \
    vllm/vllm-openai:latest \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 2

Model Recommendations by Hardware¶

VRAM	Recommended Models	Notes
8GB	Llama 3.1 8B Q4, Mistral 7B Q4, Phi-3 Mini	Stick to 7B models
12GB	Llama 3.1 8B FP16, Mixtral Q4 (partial offload), CodeLlama 13B Q4	Sweet spot for 7-13B
16GB	Mixtral 8x7B Q4, Llama 3.1 13B FP16, Qwen 14B	Good for MoE models
24GB	Mixtral 8x7B Q4, Llama 3.1 70B Q4 (with offload), Command-R Q4	Your tier - lots of options
48GB	Llama 3.1 70B Q8, Qwen 72B Q4, DeepSeek 67B	High quality large models
80GB+	Full precision 70B, Mixtral 8x22B, Claude-scale models	No compromises

Model Quality Tiers (Subjective)¶

GPT-4 Class: Claude 3, GPT-4 (API only)
Near GPT-4: Llama 3.1 70B, Qwen 72B, DeepSeek V2
GPT-3.5 Class: Mixtral 8x7B, Llama 3.1 8B
Good Enough: Mistral 7B, Phi-3, Qwen 7B

Performance Tuning¶

Context Length vs Speed¶

Context	Tokens/sec (8B)	Tokens/sec (70B Q4)	VRAM Delta
2048	80+	20+	Baseline
4096	70+	18+	+1-2 GB
8192	60+	15+	+3-4 GB
16384	45+	10+	+6-8 GB
32768	30+	5+	+12-15 GB

Context vs Quality Tradeoff

Very long contexts can degrade model attention. For most tasks, 4-8K is plenty.

Batch Size Optimization¶

# llama.cpp batch size tuning
./llama-server -m model.gguf -b 512  # Default, good for single user
./llama-server -m model.gguf -b 1024 # Higher throughput, more VRAM
./llama-server -m model.gguf -b 256  # Lower VRAM, slightly slower

KV Cache Quantization¶

Reduce VRAM usage for long contexts:

# llama.cpp with quantized KV cache
./llama-server -m model.gguf --cache-type-k q8_0 --cache-type-v q8_0

Cache Type	VRAM Savings	Quality Impact
FP16 (default)	None	None
Q8_0	~25%	Minimal
Q4_0	~50%	Noticeable on long contexts

Flash Attention Setup¶

Flash Attention 2 significantly speeds up inference:

# For PyTorch-based tools
pip install flash-attn --no-build-isolation

# For llama.cpp - built-in on CUDA builds
# No additional setup needed

CUDA Graphs¶

Reduces kernel launch overhead:

# ExLlamaV2
config.max_batch_size = 1  # Required for CUDA graphs
model.load(lazy=True)

Model Fine-Tuning Path¶

When to Fine-Tune¶

✅ Fine-tune when: - You have domain-specific data - Base models don't understand your terminology - You need consistent output formatting - You want to reduce prompting overhead

❌ Don't fine-tune when: - Base models work well with good prompts - You don't have quality training data - You need general capability - Training resources are limited

QLoRA for Consumer GPUs¶

Train on your RTX 3090 with QLoRA:

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    quantization_config=bnb_config,
    device_map="auto"
)

# LoRA config
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

Unsloth for 2x Faster Training¶

Unsloth patches models for faster fine-tuning:

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/llama-3-8b-bnb-4bit",
    max_seq_length=2048,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
)

# 2-3x faster training!

What Unsloth Enables on RTX 3090: - Train Llama 3.1 8B with 16K context - Train Mixtral 8x7B with QLoRA - 2x faster than standard training - 60% less VRAM usage

Quick Reference: Tool Comparison¶

Tool	Ease	Speed	Control	Multi-GPU	Best For
Ollama	⭐⭐⭐⭐⭐	⭐⭐⭐	⭐⭐	❌	Beginners
LM Studio	⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐	❌	GUI lovers
llama.cpp	⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	Limited	Power users
vLLM	⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	✅	Production
ExLlamaV2	⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	Limited	Max perf

Recommended Progression¶

Week 1-2: Ollama
├── Learn model basics
├── Test different models
└── Understand limitations

Week 3-4: LM Studio or llama.cpp
├── More parameter control
├── Better performance
└── GGUF model ecosystem

Month 2+: ExLlamaV2 or vLLM
├── Maximum performance
├── Production deployment
└── Custom integrations

Resources¶

Go forth and infer locally! 🦙