Skip to content

🚀 Local LLM Scaling Guide

From Ollama beginner to maximum performance expert

This guide covers the full spectrum of local LLM deployment options, from the simplest setup to bleeding-edge performance optimization.


Inference Backend Tiers

Tier 1: Ollama (Current - Easy)

What you're using now. Perfect for getting started.

# That's literally it
ollama pull llama3.1:8b
ollama run llama3.1:8b
Pros Cons
✅ One-command install ❌ Less control over parameters
✅ Automatic GPU detection ❌ Some performance overhead
✅ Great model library ❌ Limited quantization options
✅ Background service ❌ No multi-GPU support
✅ OpenAI-compatible API

Best for: Quick setup, casual use, experimentation

Performance: ⭐⭐⭐ (Good)


Tier 2: LM Studio (Intermediate)

GUI-based model management with more control than Ollama.

Download: lmstudio.ai

Setup Steps

  1. Download and install LM Studio
  2. Browse models in the built-in catalog
  3. Download models directly (GGUF format)
  4. Start local server for API access

Features

  • Visual model browser and downloader
  • Real-time performance metrics
  • Easy parameter tweaking
  • OpenAI-compatible API server
  • Multiple model sessions

API Usage

# Start server in LM Studio (default port 1234)

$body = @{
    model = "local-model"
    messages = @(
        @{ role = "user"; content = "Hello!" }
    )
    temperature = 0.7
} | ConvertTo-Json

Invoke-RestMethod -Uri "http://localhost:1234/v1/chat/completions" -Method POST -Body $body -ContentType "application/json"
Pros Cons
✅ User-friendly GUI ❌ Slightly slower than CLI tools
✅ Easy model management ❌ Resource overhead from GUI
✅ Good for experimentation ❌ Less scriptable
✅ Built-in chat interface

Best for: Users who prefer GUI, experimentation, model comparison

Performance: ⭐⭐⭐ (Good)


Tier 3: llama.cpp / koboldcpp (Advanced)

Maximum control and excellent performance. The foundation most other tools build on.

llama.cpp Setup (Windows)

Option A: Pre-built Binaries

  1. Download latest release from llama.cpp releases
  2. Get the cudart-llama-*-win-x64.zip for CUDA support
  3. Extract and add to PATH

Option B: Build from Source (Maximum Performance)

# Install prerequisites
winget install cmake
winget install ninja

# Clone repository  
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build with CUDA support for RTX 3090
cmake -B build -G Ninja -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86
cmake --build build --config Release

RTX 3090 Optimization

Use -DCMAKE_CUDA_ARCHITECTURES=86 for Ampere GPUs (RTX 30xx series). For RTX 40xx, use 89.

Running Models

# Download a GGUF model from HuggingFace
# Example: TheBloke/Mixtral-8x7B-v0.1-GGUF

# Run with optimal settings for RTX 3090
./build/bin/llama-server `
    -m "path/to/model.gguf" `
    -c 8192 `
    -ngl 999 `
    --host 0.0.0.0 `
    --port 8080
./llama-server `
    -m model.gguf `
    -c 8192 `              # Context length
    -ngl 999 `             # All layers on GPU
    -t 8 `                 # CPU threads (for layers not on GPU)
    --mlock `              # Lock model in RAM
    -b 512 `               # Batch size
    --host 0.0.0.0 `
    --port 8080

koboldcpp Alternative

KoboldCpp is a llama.cpp fork with extra features:

# Download from GitHub releases
# Run with GUI
koboldcpp.exe --model model.gguf --gpulayers 999 --contextsize 8192

Features: - Built-in web UI - Story mode and adventure mode - Easy model switching - Streaming support

Pros Cons
✅ Maximum performance ❌ More setup required
✅ Full quantization control ❌ Manual model downloads
✅ Active development ❌ CLI knowledge needed
✅ CUDA optimization ❌ No model catalog

Best for: Performance enthusiasts, production deployments

Performance: ⭐⭐⭐⭐ (Excellent)


Tier 4: vLLM / Text Generation Inference (Expert)

Production-grade serving with advanced features.

# Docker setup (WSL2 required for Windows)
docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    vllm/vllm-openai:latest \
    --model mistralai/Mixtral-8x7B-Instruct-v0.1 \
    --tensor-parallel-size 1

Key Features: - Continuous batching - handles multiple requests efficiently - PagedAttention - optimized memory management - Tensor parallelism - multi-GPU support - OpenAI-compatible API

Text Generation Inference (TGI)

HuggingFace's production inference server:

docker run --gpus all --shm-size 1g \
    -p 8080:80 \
    -v ~/.cache/huggingface:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 \
    --quantize gptq
Pros Cons
✅ Production-ready ❌ Complex setup
✅ Highest throughput ❌ Docker/Linux required
✅ Multi-user support ❌ Higher resource usage
✅ Continuous batching ❌ Overkill for single user

Best for: Multiple users, API serving, production deployments

Performance: ⭐⭐⭐⭐⭐ (Maximum throughput)


Tier 5: ExLlamaV2 (Maximum Single-User Performance)

The fastest inference for consumer GPUs. Period.

Installation

# Clone and install
git clone https://github.com/turboderp/exllamav2
cd exllamav2
pip install .

# For Windows with CUDA
pip install exllamav2 --extra-index-url https://download.pytorch.org/whl/cu121

Running the Server

# server.py
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache, ExLlamaV2Tokenizer
from exllamav2.generator import ExLlamaV2StreamingGenerator

config = ExLlamaV2Config("path/to/model")
config.max_seq_len = 8192

model = ExLlamaV2(config)
model.load()

cache = ExLlamaV2Cache(model)
tokenizer = ExLlamaV2Tokenizer(config)
generator = ExLlamaV2StreamingGenerator(model, cache, tokenizer)

# Use the generator...

TabbyAPI wraps ExLlamaV2 with an OpenAI-compatible server:

git clone https://github.com/theroyallab/tabbyAPI
cd tabbyAPI
pip install -r requirements.txt

# Edit config.yml with your model path
python main.py

EXL2 Quantization

ExLlamaV2 uses its own EXL2 format for best performance:

Bits per Weight Quality Loss Speed VRAM Savings
8.0 bpw Minimal Fast 50%
6.0 bpw Very Low Faster 62%
4.0 bpw Low Fastest 75%
3.0 bpw Moderate Maximum 81%

Find EXL2 models on HuggingFace: search for "EXL2" or check TheBloke.

Pros Cons
✅ Fastest consumer inference ❌ Complex setup
✅ Excellent quantization ❌ Fewer models available
✅ Low VRAM usage ❌ Python knowledge required
✅ Flash Attention built-in ❌ Less documentation

Best for: Maximum performance from consumer GPUs

Performance: ⭐⭐⭐⭐⭐ (Fastest)


Hardware Scaling Guide

Current Setup: RTX 3090 (24GB)

Your hardware can handle:

Model Size Quantization Fits in VRAM? Performance
7-8B FP16 ✅ Easily Excellent
13B FP16 ✅ Yes Very Good
34B Q4_K_M ✅ Yes Good
70B Q4_K_M ⚠️ Tight (needs offload) Usable
Mixtral 8x7B Q4_K_M ✅ Yes Very Good

Upgrade Paths

RTX 4090 (24GB) - Same Capacity, Faster

  • 50-80% faster than 3090
  • Same model capacity
  • Better power efficiency
  • ~$1,600-2,000
  • Run 70B at Q8 quality
  • Requires NVLink bridge
  • Only works with specific motherboards
  • Tensor parallelism support needed
  • ~$800-1,000 for second card + bridge

A100 40GB/80GB - Professional Grade

  • Enterprise features
  • ECC memory
  • Better for production
  • 80GB can run 70B at FP16
  • ~$8,000-15,000

Multi-GPU Tensor Parallelism

For models too large for single GPU:

# vLLM with 2 GPUs
docker run --gpus '"device=0,1"' \
    vllm/vllm-openai:latest \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 2

Model Recommendations by Hardware

VRAM Recommended Models Notes
8GB Llama 3.1 8B Q4, Mistral 7B Q4, Phi-3 Mini Stick to 7B models
12GB Llama 3.1 8B FP16, Mixtral Q4 (partial offload), CodeLlama 13B Q4 Sweet spot for 7-13B
16GB Mixtral 8x7B Q4, Llama 3.1 13B FP16, Qwen 14B Good for MoE models
24GB Mixtral 8x7B Q4, Llama 3.1 70B Q4 (with offload), Command-R Q4 Your tier - lots of options
48GB Llama 3.1 70B Q8, Qwen 72B Q4, DeepSeek 67B High quality large models
80GB+ Full precision 70B, Mixtral 8x22B, Claude-scale models No compromises

Model Quality Tiers (Subjective)

  1. GPT-4 Class: Claude 3, GPT-4 (API only)
  2. Near GPT-4: Llama 3.1 70B, Qwen 72B, DeepSeek V2
  3. GPT-3.5 Class: Mixtral 8x7B, Llama 3.1 8B
  4. Good Enough: Mistral 7B, Phi-3, Qwen 7B

Performance Tuning

Context Length vs Speed

Context Tokens/sec (8B) Tokens/sec (70B Q4) VRAM Delta
2048 80+ 20+ Baseline
4096 70+ 18+ +1-2 GB
8192 60+ 15+ +3-4 GB
16384 45+ 10+ +6-8 GB
32768 30+ 5+ +12-15 GB

Context vs Quality Tradeoff

Very long contexts can degrade model attention. For most tasks, 4-8K is plenty.

Batch Size Optimization

# llama.cpp batch size tuning
./llama-server -m model.gguf -b 512  # Default, good for single user
./llama-server -m model.gguf -b 1024 # Higher throughput, more VRAM
./llama-server -m model.gguf -b 256  # Lower VRAM, slightly slower

KV Cache Quantization

Reduce VRAM usage for long contexts:

# llama.cpp with quantized KV cache
./llama-server -m model.gguf --cache-type-k q8_0 --cache-type-v q8_0
Cache Type VRAM Savings Quality Impact
FP16 (default) None None
Q8_0 ~25% Minimal
Q4_0 ~50% Noticeable on long contexts

Flash Attention Setup

Flash Attention 2 significantly speeds up inference:

# For PyTorch-based tools
pip install flash-attn --no-build-isolation

# For llama.cpp - built-in on CUDA builds
# No additional setup needed

CUDA Graphs

Reduces kernel launch overhead:

# ExLlamaV2
config.max_batch_size = 1  # Required for CUDA graphs
model.load(lazy=True)

Model Fine-Tuning Path

When to Fine-Tune

Fine-tune when: - You have domain-specific data - Base models don't understand your terminology - You need consistent output formatting - You want to reduce prompting overhead

Don't fine-tune when: - Base models work well with good prompts - You don't have quality training data - You need general capability - Training resources are limited

QLoRA for Consumer GPUs

Train on your RTX 3090 with QLoRA:

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    quantization_config=bnb_config,
    device_map="auto"
)

# LoRA config
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

Unsloth for 2x Faster Training

Unsloth patches models for faster fine-tuning:

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/llama-3-8b-bnb-4bit",
    max_seq_length=2048,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
)

# 2-3x faster training!

What Unsloth Enables on RTX 3090: - Train Llama 3.1 8B with 16K context - Train Mixtral 8x7B with QLoRA - 2x faster than standard training - 60% less VRAM usage


Quick Reference: Tool Comparison

Tool Ease Speed Control Multi-GPU Best For
Ollama ⭐⭐⭐⭐⭐ ⭐⭐⭐ ⭐⭐ Beginners
LM Studio ⭐⭐⭐⭐ ⭐⭐⭐ ⭐⭐⭐ GUI lovers
llama.cpp ⭐⭐⭐ ⭐⭐⭐⭐ ⭐⭐⭐⭐⭐ Limited Power users
vLLM ⭐⭐ ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐ Production
ExLlamaV2 ⭐⭐ ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐ Limited Max perf

Week 1-2: Ollama
├── Learn model basics
├── Test different models
└── Understand limitations

Week 3-4: LM Studio or llama.cpp
├── More parameter control
├── Better performance
└── GGUF model ecosystem

Month 2+: ExLlamaV2 or vLLM
├── Maximum performance
├── Production deployment
└── Custom integrations

Resources


Go forth and infer locally! 🦙