Akshay Ajay Kalekar | Software Developer

The Interaction Model

Building a truly interactive voice experience within the browser involves three core pillars: Speech-to-Text (STT), Framework-Based Intent Extraction, and Contextual Conversation Tracking.

Unlike modern AI assistants that rely on heavy server-side processing, this experiment is designed to be entirely local-first, utilizing a stateful pipeline to maintain conversation flow without external LLM calls.

1. Speech Recognition (STT)

We utilize the window.SpeechRecognition interface. To make the UI feel alive, we process interimResults: true, allowing the visualizer and transcript to react before the user finishes their sentence.

const recognition = new (window.SpeechRecognition || window.webkitSpeechRecognition)();
recognition.continuous = true;
recognition.interimResults = true;

2. The Frame-Based NLP Engine

Instead of simple keyword matching, we use a Frame-Based Architecture. Every user request is parsed into a structured Frame:

Intent: The core goal (e.g., experience, skills, why_hire).
Entity: The specific subject (e.g., Travel LYKKE, Next.js).
Focus: The dimension of the query (technical vs impact).
Tone: Adapting the response to a developer vs a recruiter.

3. State & Context Tracking

The engine maintains a Conversation State to enable multi-turn interactions. This allows for:

State Carry-over: If you ask "Tell me about my experience" and follow up with "How was it built?", the system carries over the last entity (e.g., DMC Circle) to provide context-aware technical details.
Depth Escalation: If you stay on a topic, the engine escalates the response depth—moving from a concise summary to a technical deep-dive.

4. Variation Engine & Humanizer

To prevent "bot fatigue," responses are never static. We use a Variation Engine that assembles answers from randomized blocks and situational transitions:

function bridge(state: ConversationState): string {
  if (state.depthLevel > 1 && state.lastIntent === 'experience') {
    return pick(['Beyond that, ', 'Diving deeper, ', 'To expand on that, ']);
  }
  return "";
}

Why Local-First?

Contextual Latency: By processing intents locally, we eliminate the 1-2 second round-trip time of LLM APIs, making transitions feel natural.
Privacy by Design: No audio data or transcripts ever leave your device.
Zero Cost: Built entirely on native browser APIs.

Technical Stack

React + Tailwind: For the responsive UI and glassmorphic micro-animations.
Frame-Based NLP: A custom stateful engine implemented in TypeScript.
Web Speech API: For native recognition and synthesis.
Web Audio Analyser: For frequency-driven microphone visualization.