🧠 Local LLM Support¶
Local-first AI for instant responses - Ollama powers Skippy's primary conversation engine
✅ Active & Primary
Local LLM via Ollama is enabled and primary. Skippy uses local models for instant responses, with Claude available for background heavy-lifting on complex tasks.
Overview¶
Skippy uses a local-first architecture where Ollama-powered models handle most conversations instantly, while Claude runs in the background for complex tasks that need more reasoning power.
Benefits¶
- Instant responses - No network latency, responses start in <100ms
- Streaming TTS - Audio plays while text generates
- Privacy - Conversations stay on your machine
- Cost savings - No API costs for routine chat
- Always available - Works offline
Current Architecture¶
User Message
↓
┌─────────────────────────────────────┐
│ LOCAL-FIRST PIPELINE │
├─────────────────────────────────────┤
│ 1. Instant Starter (Phi-3 Mini) │ ← <100ms
│ 2. Full Response (Mistral 7B) │ ← Streaming
│ 3. Background Claude (if needed) │ ← Complex tasks only
└─────────────────────────────────────┘
↓
Audio + Text Response
Model Roles¶
| Model | Role | When Used |
|---|---|---|
| Phi-3 Mini | Quick acknowledgments | Every message (instant "DING! Let me think...") |
| Mistral 7B | Main conversation | All chat, stories, general questions |
| Qwen 2.5 0.5B | Fallback | If other models unavailable |
| Claude Sonnet | Heavy lifting | Code analysis, complex reasoning (background) |
| Claude Opus | Escalation | Very complex tasks (rare) |
Current Configuration¶
Your active settings in config.json:
{
"use_pipeline": true,
"enable_local_fallback": true,
"local_model": "qwen2.5:0.5b",
"ollama_url": "http://localhost:11434",
"pipeline_conversation_model": "mistral:7b-instruct",
"pipeline_fast_model": "phi3:mini",
"pipeline_enable_background_claude": true,
"pipeline_max_tokens": 600,
"pipeline_temperature": 0.8
}
Key Settings¶
| Setting | Current Value | Purpose |
|---|---|---|
use_pipeline |
true |
Enable local-first pipeline |
pipeline_conversation_model |
mistral:7b-instruct |
Main conversation model |
pipeline_fast_model |
phi3:mini |
Quick starter responses |
pipeline_enable_background_claude |
true |
Use Claude for complex tasks |
pipeline_max_tokens |
600 |
Max response length |
pipeline_temperature |
0.8 |
Creativity (0.0-1.0) |
Required Models¶
Pull these models for full functionality:
# Main conversation model
ollama pull mistral:7b-instruct
# Quick acknowledgment model
ollama pull phi3:mini
# Fallback model
ollama pull qwen2.5:0.5b
Verify Models¶
Expected output:
How It Works¶
1. Instant Starter (~50ms)¶
When you send a message, Skippy immediately responds with a quick acknowledgment:
This uses phi3:mini for speed, or falls back to templates.
2. Streaming Response¶
While the starter plays, Mistral generates the full response:
Each sentence is sent to TTS as it's generated - you hear audio while text is still being created.
3. Background Claude (Optional)¶
For complex requests (code, analysis), Claude runs in the background:
User: "Debug this Python code"
Skippy: "SIGH. Let me look at your mess..." ← Local instant response
[Background: Claude analyzes code]
Skippy: "Actually, I found the bug - line 42 has a null reference." ← Claude follow-up
Conversation Context¶
Skippy remembers recent conversation for context-aware responses:
## Recent Conversation
- Human said: "Hey Skippy, I'm stuck on a bug"
- You replied: "SIGH. Another bug? Let me see what mess you made..."
- Human said: "It's in the login function"
- You replied: "Ooh, authentication bugs. Classic monkey mistake..."
[Current topic: code]
[Mood: Default snarky superiority]
The local model receives this context to give relevant responses.
Performance¶
Expected Latency¶
| Component | Time |
|---|---|
| Starter acknowledgment | ~50-100ms |
| First sentence generated | ~500ms |
| Full response (5-8 sentences) | ~3-5s |
| TTS audio starts | ~200ms after text |
Hardware Usage¶
With RTX 3090 (24GB VRAM):
- Mistral 7B: ~4GB VRAM
- Phi-3 Mini: ~2GB VRAM
- Both can run simultaneously
Troubleshooting¶
Ollama Not Running¶
Slow Responses¶
- Check GPU is being used:
nvidia-smi - Ensure models are loaded:
ollama list - Reduce
pipeline_max_tokensfor shorter responses
Model Not Found¶
Falling Back to Templates¶
If you see generic responses like "DING! Processing...", the local model may have failed. Check:
- Ollama is running
- Model is pulled
ollama_urlin config.json is correct
Advanced: Custom Models¶
You can use different models by updating config.json:
Recommended Alternatives¶
| Model | Size | Good For |
|---|---|---|
llama3.1:8b |
4.7GB | General conversation |
codellama:7b |
3.8GB | Code-heavy tasks |
neural-chat:7b |
4.1GB | Natural dialogue |
qwen2.5:7b |
4.4GB | Multilingual support |
See Also¶
- Architecture Overview - Full system design
- Voice Output - Streaming TTS details
- Configuration - All settings reference
Local-first: Your AI responds instantly, thinks deeply in the background.