Skip to content

Voice Input

Skippy supports push-to-talk voice input for hands-free interaction. Speak your message, release the button, and Skippy transcribes and sends it automatically.


Quick Start

  1. Click and hold the 🎤 button
  2. Speak your message clearly
  3. Release the button
  4. Wait for transcription
  5. Message sends automatically

STT Providers

Default provider - Runs entirely offline using OpenAI's Whisper model.

Feature Details
Privacy 100% offline, no data sent
Speed Fast after model loads
Accuracy Excellent for English
Requirements ~150MB-3GB disk (varies by model)

Model Sizes:

Model Size Speed Accuracy
tiny ~75 MB Fastest Good
base ~150 MB Fast Better
small ~500 MB Medium Good
medium ~1.5 GB Slower Very Good
large-v2 ~3 GB Slowest Best

Recommended: base

The base model offers the best balance of speed and accuracy for most users.

Change Model:

  • Tray menu → Voice → 🧠 Whisper Model → Select
  • Or edit whisper_model in config.json

Google Speech Recognition (Fallback)

Used when Whisper is unavailable or fails.

Feature Details
Privacy Audio sent to Google servers
Speed Depends on internet
Accuracy Very good
Requirements Internet connection

Recording Process

Visual Feedback

State Button Status Bar
Ready 🎤 (gray) Ready
Recording 🔴 (pulsing red) 🎤 Listening...
Processing 🎤 (gray) 🔄 Transcribing...
First Use 🎤 (gray) 🔄 Loading Whisper model...

Recording Flow

Hold Button → Recording Starts → Audio Captured
    ↓
Release Button → Recording Stops → Save WAV
    ↓
Transcription → Text in Input → Auto-Send

Technical Details

  • Sample Rate: 16,000 Hz (speech recognition standard)
  • Format: 16-bit PCM WAV
  • Max Duration: 30 seconds
  • Silence Threshold: Audio level < 50 = no speech detected

First-Time Setup

Model Loading

On first voice use, Whisper model downloads/loads:

  1. Press 🎤 and speak
  2. Status shows "Loading Whisper model..."
  3. Model loads (30-60 seconds first time)
  4. Subsequent uses are instant

Model Location

Whisper models are cached in ~/.cache/huggingface/ or similar.

Microphone Permissions

Windows may prompt for microphone access:

  1. Go to Settings → Privacy → Microphone
  2. Enable "Allow apps to access your microphone"
  3. Ensure Python has permission

Troubleshooting Voice Input

"Voice input not available"

Cause: Missing dependencies

Fix:

pip install sounddevice numpy faster-whisper

"No speech detected"

Cause: Audio too quiet or silence

Fix:

  • Speak louder/closer to mic
  • Check microphone isn't muted
  • Test mic in Windows Sound settings

"Couldn't understand. Try again."

Cause: Speech unclear or Whisper confusion

Fix:

  • Speak more clearly
  • Reduce background noise
  • Try a larger Whisper model

"Recognition error"

Cause: Google fallback failed (network issue)

Fix:

  • Check internet connection
  • Ensure Whisper is properly installed for offline use

Recording stops immediately

Cause: Microphone not capturing

Fix:

  1. Check default recording device in Windows
  2. Test with Voice Recorder app
  3. Verify sounddevice can access mic:
    import sounddevice as sd
    print(sd.query_devices())
    

Configuration

config.json Settings

{
    "voice_input_enabled": true,
    "stt_provider": "local-whisper",
    "whisper_model": "base"
}
Setting Options Default
voice_input_enabled true/false true
stt_provider "local-whisper", "google" "local-whisper"
whisper_model "tiny", "base", "small", "medium", "large-v2" "base"

Change Provider

Via Tray Menu: Currently shows provider info only. Change via config.json.

Via Config:

{
    "stt_provider": "google"
}


Audio File Handling

Temporary Files

Recordings are saved to:

C:\Users\ejb71\SkippyBuddy\temp\voice_recording.wav

This file is overwritten each recording.

Cleanup

Temporary audio files remain until:

  • Next recording (overwritten)
  • Manual deletion
  • System restart

Performance Tips

Fast Transcription

  • Use tiny or base model for speed
  • Keep recordings short (5-15 seconds)
  • Speak clearly without long pauses

Better Accuracy

  • Use small or medium model
  • Minimize background noise
  • Speak at normal pace

Memory Usage

  • base model uses ~300MB RAM
  • large-v2 can use 3GB+ RAM
  • Model unloads when Skippy closes

Advanced: Language Settings

Whisper is configured for English by default:

segments, info = model.transcribe(
    audio_path,
    language="en",  # Force English
    vad_filter=True  # Filter silence
)

To support other languages, modify skippy.py:

language="auto"  # Auto-detect language
# or
language="es"    # Spanish
language="fr"    # French
# etc.