🧠 Local LLM Support¶

Local-first AI for instant responses - Ollama powers Skippy's primary conversation engine

✅ Active & Primary

Local LLM via Ollama is enabled and primary. Skippy uses local models for instant responses, with Claude available for background heavy-lifting on complex tasks.

Overview¶

Skippy uses a local-first architecture where Ollama-powered models handle most conversations instantly, while Claude runs in the background for complex tasks that need more reasoning power.

Benefits¶

Instant responses - No network latency, responses start in <100ms
Streaming TTS - Audio plays while text generates
Privacy - Conversations stay on your machine
Cost savings - No API costs for routine chat
Always available - Works offline

Current Architecture¶

User Message
     ↓
┌─────────────────────────────────────┐
│         LOCAL-FIRST PIPELINE        │
├─────────────────────────────────────┤
│  1. Instant Starter (Phi-3 Mini)    │  ← <100ms
│  2. Full Response (Mistral 7B)      │  ← Streaming
│  3. Background Claude (if needed)   │  ← Complex tasks only
└─────────────────────────────────────┘
     ↓
Audio + Text Response

Model Roles¶

Model	Role	When Used
Phi-3 Mini	Quick acknowledgments	Every message (instant "DING! Let me think...")
Mistral 7B	Main conversation	All chat, stories, general questions
Qwen 2.5 0.5B	Fallback	If other models unavailable
Claude Sonnet	Heavy lifting	Code analysis, complex reasoning (background)
Claude Opus	Escalation	Very complex tasks (rare)

Current Configuration¶

Your active settings in config.json:

{
  "use_pipeline": true,
  "enable_local_fallback": true,
  "local_model": "qwen2.5:0.5b",
  "ollama_url": "http://localhost:11434",
  "pipeline_conversation_model": "mistral:7b-instruct",
  "pipeline_fast_model": "phi3:mini",
  "pipeline_enable_background_claude": true,
  "pipeline_max_tokens": 600,
  "pipeline_temperature": 0.8
}

Key Settings¶

Setting	Current Value	Purpose
`use_pipeline`	`true`	Enable local-first pipeline
`pipeline_conversation_model`	`mistral:7b-instruct`	Main conversation model
`pipeline_fast_model`	`phi3:mini`	Quick starter responses
`pipeline_enable_background_claude`	`true`	Use Claude for complex tasks
`pipeline_max_tokens`	`600`	Max response length
`pipeline_temperature`	`0.8`	Creativity (0.0-1.0)

Required Models¶

Pull these models for full functionality:

# Main conversation model
ollama pull mistral:7b-instruct

# Quick acknowledgment model
ollama pull phi3:mini

# Fallback model
ollama pull qwen2.5:0.5b

Verify Models¶

ollama list

Expected output:

NAME                    SIZE
mistral:7b-instruct     4.1 GB
phi3:mini               2.3 GB
qwen2.5:0.5b            397 MB

How It Works¶

1. Instant Starter (~50ms)¶

When you send a message, Skippy immediately responds with a quick acknowledgment:

User: "Tell me a story"
Skippy: "DING! Let me think..." ← Plays immediately via TTS

This uses phi3:mini for speed, or falls back to templates.

2. Streaming Response¶

While the starter plays, Mistral generates the full response:

Skippy: "Once upon a time in a galaxy far away..." ← Streams sentence by sentence

Each sentence is sent to TTS as it's generated - you hear audio while text is still being created.

3. Background Claude (Optional)¶

For complex requests (code, analysis), Claude runs in the background:

User: "Debug this Python code"
Skippy: "SIGH. Let me look at your mess..." ← Local instant response

[Background: Claude analyzes code]

Skippy: "Actually, I found the bug - line 42 has a null reference." ← Claude follow-up

Conversation Context¶

Skippy remembers recent conversation for context-aware responses:

## Recent Conversation
- Human said: "Hey Skippy, I'm stuck on a bug"
- You replied: "SIGH. Another bug? Let me see what mess you made..."
- Human said: "It's in the login function"
- You replied: "Ooh, authentication bugs. Classic monkey mistake..."

[Current topic: code]
[Mood: Default snarky superiority]

The local model receives this context to give relevant responses.

Performance¶

Expected Latency¶

Component	Time
Starter acknowledgment	~50-100ms
First sentence generated	~500ms
Full response (5-8 sentences)	~3-5s
TTS audio starts	~200ms after text

Hardware Usage¶

With RTX 3090 (24GB VRAM):

Mistral 7B: ~4GB VRAM
Phi-3 Mini: ~2GB VRAM
Both can run simultaneously

Troubleshooting¶

Ollama Not Running¶

# Check status
ollama list

# Start if needed
ollama serve

Slow Responses¶

Check GPU is being used: nvidia-smi
Ensure models are loaded: ollama list
Reduce pipeline_max_tokens for shorter responses

Model Not Found¶

# Pull missing model
ollama pull mistral:7b-instruct

Falling Back to Templates¶

If you see generic responses like "DING! Processing...", the local model may have failed. Check:

Ollama is running
Model is pulled
ollama_url in config.json is correct

Advanced: Custom Models¶

You can use different models by updating config.json:

{
  "pipeline_conversation_model": "llama3.1:8b",
  "pipeline_fast_model": "qwen2.5:1.5b"
}

Recommended Alternatives¶

Model	Size	Good For
`llama3.1:8b`	4.7GB	General conversation
`codellama:7b`	3.8GB	Code-heavy tasks
`neural-chat:7b`	4.1GB	Natural dialogue
`qwen2.5:7b`	4.4GB	Multilingual support