The Interaction Model
Building a truly interactive voice experience within the browser involves three core pillars: Speech-to-Text (STT), Framework-Based Intent Extraction, and Contextual Conversation Tracking.
Unlike modern AI assistants that rely on heavy server-side processing, this experiment is designed to be entirely local-first, utilizing a stateful pipeline to maintain conversation flow without external LLM calls.
1. Speech Recognition (STT)
We utilize the window.SpeechRecognition interface. To make the UI feel alive, we process interimResults: true, allowing the visualizer and transcript to react before the user finishes their sentence.
const recognition = new (window.SpeechRecognition || window.webkitSpeechRecognition)();
recognition.continuous = true;
recognition.interimResults = true;
2. The Frame-Based NLP Engine
Instead of simple keyword matching, we use a Frame-Based Architecture. Every user request is parsed into a structured Frame:
- Intent: The core goal (e.g.,
experience,skills,why_hire). - Entity: The specific subject (e.g.,
Travel LYKKE,Next.js). - Focus: The dimension of the query (
technicalvsimpact). - Tone: Adapting the response to a
developervs arecruiter.
3. State & Context Tracking
The engine maintains a Conversation State to enable multi-turn interactions. This allows for:
- State Carry-over: If you ask "Tell me about my experience" and follow up with "How was it built?", the system carries over the last entity (e.g., DMC Circle) to provide context-aware technical details.
- Depth Escalation: If you stay on a topic, the engine escalates the response depth—moving from a concise summary to a technical deep-dive.
4. Variation Engine & Humanizer
To prevent "bot fatigue," responses are never static. We use a Variation Engine that assembles answers from randomized blocks and situational transitions:
function bridge(state: ConversationState): string {
if (state.depthLevel > 1 && state.lastIntent === 'experience') {
return pick(['Beyond that, ', 'Diving deeper, ', 'To expand on that, ']);
}
return "";
}
Why Local-First?
- Contextual Latency: By processing intents locally, we eliminate the 1-2 second round-trip time of LLM APIs, making transitions feel natural.
- Privacy by Design: No audio data or transcripts ever leave your device.
- Zero Cost: Built entirely on native browser APIs.
Technical Stack
- React + Tailwind: For the responsive UI and glassmorphic micro-animations.
- Frame-Based NLP: A custom stateful engine implemented in TypeScript.
- Web Speech API: For native recognition and synthesis.
- Web Audio Analyser: For frequency-driven microphone visualization.