Skip to content

🧠 Local LLM Support

Local-first AI for instant responses - Ollama powers Skippy's primary conversation engine


✅ Active & Primary

Local LLM via Ollama is enabled and primary. Skippy uses local models for instant responses, with Claude available for background heavy-lifting on complex tasks.


Overview

Skippy uses a local-first architecture where Ollama-powered models handle most conversations instantly, while Claude runs in the background for complex tasks that need more reasoning power.

Benefits

  • Instant responses - No network latency, responses start in <100ms
  • Streaming TTS - Audio plays while text generates
  • Privacy - Conversations stay on your machine
  • Cost savings - No API costs for routine chat
  • Always available - Works offline

Current Architecture

User Message
┌─────────────────────────────────────┐
│         LOCAL-FIRST PIPELINE        │
├─────────────────────────────────────┤
│  1. Instant Starter (Phi-3 Mini)    │  ← <100ms
│  2. Full Response (Mistral 7B)      │  ← Streaming
│  3. Background Claude (if needed)   │  ← Complex tasks only
└─────────────────────────────────────┘
Audio + Text Response

Model Roles

Model Role When Used
Phi-3 Mini Quick acknowledgments Every message (instant "DING! Let me think...")
Mistral 7B Main conversation All chat, stories, general questions
Qwen 2.5 0.5B Fallback If other models unavailable
Claude Sonnet Heavy lifting Code analysis, complex reasoning (background)
Claude Opus Escalation Very complex tasks (rare)

Current Configuration

Your active settings in config.json:

{
  "use_pipeline": true,
  "enable_local_fallback": true,
  "local_model": "qwen2.5:0.5b",
  "ollama_url": "http://localhost:11434",
  "pipeline_conversation_model": "mistral:7b-instruct",
  "pipeline_fast_model": "phi3:mini",
  "pipeline_enable_background_claude": true,
  "pipeline_max_tokens": 600,
  "pipeline_temperature": 0.8
}

Key Settings

Setting Current Value Purpose
use_pipeline true Enable local-first pipeline
pipeline_conversation_model mistral:7b-instruct Main conversation model
pipeline_fast_model phi3:mini Quick starter responses
pipeline_enable_background_claude true Use Claude for complex tasks
pipeline_max_tokens 600 Max response length
pipeline_temperature 0.8 Creativity (0.0-1.0)

Required Models

Pull these models for full functionality:

# Main conversation model
ollama pull mistral:7b-instruct

# Quick acknowledgment model
ollama pull phi3:mini

# Fallback model
ollama pull qwen2.5:0.5b

Verify Models

ollama list

Expected output:

NAME                    SIZE
mistral:7b-instruct     4.1 GB
phi3:mini               2.3 GB
qwen2.5:0.5b            397 MB


How It Works

1. Instant Starter (~50ms)

When you send a message, Skippy immediately responds with a quick acknowledgment:

User: "Tell me a story"
Skippy: "DING! Let me think..." ← Plays immediately via TTS

This uses phi3:mini for speed, or falls back to templates.

2. Streaming Response

While the starter plays, Mistral generates the full response:

Skippy: "Once upon a time in a galaxy far away..." ← Streams sentence by sentence

Each sentence is sent to TTS as it's generated - you hear audio while text is still being created.

3. Background Claude (Optional)

For complex requests (code, analysis), Claude runs in the background:

User: "Debug this Python code"
Skippy: "SIGH. Let me look at your mess..." ← Local instant response

[Background: Claude analyzes code]

Skippy: "Actually, I found the bug - line 42 has a null reference." ← Claude follow-up

Conversation Context

Skippy remembers recent conversation for context-aware responses:

## Recent Conversation
- Human said: "Hey Skippy, I'm stuck on a bug"
- You replied: "SIGH. Another bug? Let me see what mess you made..."
- Human said: "It's in the login function"
- You replied: "Ooh, authentication bugs. Classic monkey mistake..."

[Current topic: code]
[Mood: Default snarky superiority]

The local model receives this context to give relevant responses.


Performance

Expected Latency

Component Time
Starter acknowledgment ~50-100ms
First sentence generated ~500ms
Full response (5-8 sentences) ~3-5s
TTS audio starts ~200ms after text

Hardware Usage

With RTX 3090 (24GB VRAM):

  • Mistral 7B: ~4GB VRAM
  • Phi-3 Mini: ~2GB VRAM
  • Both can run simultaneously

Troubleshooting

Ollama Not Running

# Check status
ollama list

# Start if needed
ollama serve

Slow Responses

  1. Check GPU is being used: nvidia-smi
  2. Ensure models are loaded: ollama list
  3. Reduce pipeline_max_tokens for shorter responses

Model Not Found

# Pull missing model
ollama pull mistral:7b-instruct

Falling Back to Templates

If you see generic responses like "DING! Processing...", the local model may have failed. Check:

  1. Ollama is running
  2. Model is pulled
  3. ollama_url in config.json is correct

Advanced: Custom Models

You can use different models by updating config.json:

{
  "pipeline_conversation_model": "llama3.1:8b",
  "pipeline_fast_model": "qwen2.5:1.5b"
}
Model Size Good For
llama3.1:8b 4.7GB General conversation
codellama:7b 3.8GB Code-heavy tasks
neural-chat:7b 4.1GB Natural dialogue
qwen2.5:7b 4.4GB Multilingual support

See Also


Local-first: Your AI responds instantly, thinks deeply in the background.