# Speech Configuration (STT / TTS)

The Speech tab controls the acoustic and phonetic identity of your physical Lyntaris deployments. Settings adjusted here profoundly alter how the Unity applications hear audio from the physical microphone array and how the 3D Avatar's mouth moves.

Crucial Architecture: Over-The-Air (OTA) Syncing Changes made on this page do not require you to reboot your physical Kiosks. The Unity application runs a FlowiseConfigSyncRunner connected via WebSockets. When you modify a Provider here, your remote hardware instantly drops its current STT/TTS connection and instantiates the newly selected classes mid-conversation.

Speech-to-Text (STT)

STT defines the "Ears" of the system, transcribing raw PCM audio buffers captured by the Kiosk's hardware microphone into text strings for the LLM.

  • Soniox (Recommended for Physical Hardware): Lyntaris utilizes Soniox via WebSockets because of its extremely low latency and high noise-canceling capabilities. It excels in busy trade shows where standard REST APIs fail.
  • Auto Language Detection: If users frequently switch languages mid-sentence, enabling this forces the engine to recalculate its acoustic model on the fly. However, locking the language dial strictly to English (for example) significantly improves accuracy if you know your demographic.
  • Manual Finalization: In highly chaotic environments, automatic silence detection (VAD - Voice Activity Detection) often cuts users off prematurely. Enabling Manual Finalization links the STT close event directly to a physical button press on the Kiosk UI, guaranteeing the transcription does not finalize until the user explicitly finishes speaking.

Text-to-Speech (TTS) & Lip-Sync

TTS defines the "Voice" of the system, streaming synthesized audio down to the physical speakers. Crucially, your choice of provider dictates how the Unity Avatar animates its mouth.

  • Azure TTS (Best for Procedural Animation): When Azure is selected, Flowise requests standard audio bytes alongside an explicit "Viseme" array. Visemes are phonetic timestamps (e.g., "The mouth should form an 'O' shape at exactly 1.25 seconds"). The Unity Avatar (VisemeDataSync) reads these timestamps and mechanically drives the blendshapes of the 3D mesh, resulting in flawless, 100% accurate procedural lip-sync.
  • ElevenLabs / MiniMax (Best for Emotive Voice): These providers offer incredibly rich voice cloning and high emotional variance. However, they lack native Viseme support. When selected, the Unity Avatar automatically detects the lack of timestamps and falls back to SALSA (Simple Audio Lip Sync Approximation)—a system that "listens" to the raw audio waveform amplitude in real-time to guess how wide to open the jaw.

SSML Integration and Lexicons

Because Lyntaris serves enterprise use-cases, perfect pronunciation of brand names is mandatory.

  • Lexicons: For Azure and ElevenLabs, you can explicitly map written text (e.g., Lyntaris) to raw IPA Phonemes (exact tongue and lip placement data). Flowise intercepts the LLM output and swaps the text for the phoneme before sending it to the TTS engine.
  • SSML Injection: Advanced users can configure the System Prompt (in the Orchestrator tab) to output SSML tags. If the LLM generates <mstts:express-as style="cheerful"> Welcome!</mstts:express-as>, Flowise parses these tags and instructs the Azure TTS engine to fundamentally alter the pitch and timbre of the generated audio buffer sent to Unity, making the Avatar sound genuinely happy.

results matching ""

    No results matching ""