BAYAAN
بیان
noun · bayaan (Urdu) · to express, to articulate, to be heard
a real-time sign language to speech translator · built for India · powered by Sarvam

Existing tools translate sign to text.
Bayaan translates sign to live spoken audio—injected directly into virtual meetings.

A communication barrier built into everyday life.

In India, an entire population communicates through Indian Sign Language. They join meetings, attend interviews, see doctors. The world they're trying to talk to doesn't speak their language.

Every interaction needs a workaround. Type into a phone. Pass a notepad. Find an interpreter. None of these work in a real conversation, and none of them work in a video call.

Deaf & Mute Population
63,000,000
people in India
Indian Sign Language Users
8,435,000
across India
Certified ISL Translators
250
in the entire country

1 translator for every 33,740 ISL users. Source: Ishaara. The supply gap is structural—and unsolvable with humans alone.

What exists, and what doesn't.

Sign language recognition isn't new. Indian academic research spans IIT Roorkee, IIT Delhi, IIIT Hyderabad. Global companies like SignAll, KinTrans, and Hand Talk exist. But none solve the actual problem.

EXISTS
Sign → Text
Gesture recognition that outputs subtitles. Useful for documentation and learning. Not built for live conversation.
MISSING
Sign → Live Spoken Audio in a Call
No production product injects real-time spoken voice into a live video meeting, in an Indian language, in a natural human voice. This is Bayaan.

Bayaan turns a hand into a voice.

A camera reads a person's hands during a video call. A machine learning model recognises the sign language gestures and assembles them into words. The transcript is sent to Sarvam's text-to-speech, which speaks the message in a natural Indian voice—and that audio is fed directly into the meeting as if the person was speaking themselves.

The participant on the other end hears a voice. Not a notification. Not a subtitle. A voice.

How it works—proof of concept vs production.

The MVP proves the loop closes on a single laptop with a small vocabulary. The production architecture is what scales this to a product used in real meetings.

PROOF OF CONCEPT — what's running today
A Python application running locally. Hand landmarks extracted by MediaPipe, classified by K-Nearest Neighbours using inter-joint angle features. Small letter set, one demo phrase.
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ Webcam │ ──▶ │ MediaPipe │ ──▶ │ Feature live video 21 hand extraction frame stream landmarks joint angles └─────────────────┘ └─────────────────┘ └────────┬────────┘ ┌─────────────────┐ ┌─────────────────┐ ┌────────▼────────┐ Audio playback │ ◀── │ Sarvam TTS │ ◀── │ KNN local speakers text → speech classifier via BlackHole Indian English 9 gestures └─────────────────┘ └─────────────────┘ └─────────────────┘ ▼ injected into meeting via virtual audio device ▼ proves the loop closes end-to-end
PRODUCTION — what it needs to become
A browser-based product. The signer joins a normal video call. Recognition runs in the browser via a production model trained on thousands of ISL signers. Sarvam handles translation and speech. Audio streams into the meeting as a participant—no plugins on the listener's side.
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ Browser │ ──▶ │ Gesture │ ──▶ │ Word & Phrase front camera model assembly stream full ISL + spell check └─────────────────┘ └─────────────────┘ └────────┬────────┘ ┌─────────────────┐ ┌─────────────────┐ ┌────────▼────────┐ Sarvam TTS │ ◀── │ Sarvam │ ◀── │ Meeting regional Translate language voice EN → HI/BN/TA selected once └────────┬────────┘ └─────────────────┘ └─────────────────┘ ┌────────▼────────┐ ┌─────────────────┐ WebRTC │ ──▶ │ Live meeting audio stream all participants to peers hear the voice └─────────────────┘ └─────────────────┘

The experience for the signer and the listener.

The point isn't the technology. The point is that the conversation feels normal.

PROOF OF CONCEPT — single user demo
SIGNER ├─ opens Bayaan locally ├─ joins Google Meet in parallel window ├─ routes audio output to virtual mic (BlackHole) ├─ signs into webcam └─▶ Bayaan recognises gesture → builds transcript └─▶ Sarvam speaks aloud └─▶ audio reaches Meet as the signer's "mic" LISTENER └─ hears a voice in the call. Conversation continues.
PRODUCTION — natural meeting experience
SIGNER ├─ joins a Bayaan-enabled meeting (or a Meet/Zoom integration) ├─ picks the meeting's spoken language (English, Hindi, Bengali, Tamil…) ├─ turns on camera and signs naturally └─▶ Bayaan runs invisibly in the browser ├─▶ transcript appears as live captions └─▶ spoken audio streams to all participants LISTENERS (everyone else in the call) └─ hear the signer's voice in the meeting's chosen language. No plugins. No setup. Just a voice on the call.

The specification.

Problem
ISL users cannot participate in spoken conversations—particularly in virtual meetings—without an interpreter or constant typing. There are 250 certified translators for 8.4 million ISL users.
Primary user
An ISL signer in a virtual meeting—a job interview, a class, a doctor's consultation, a standup—where the other participants do not understand sign language.
Secondary user
Hearing participants who need to understand and respond to the signer in real time, in the meeting's chosen language.
Core value
Conversation, not transcription. The signer is heard, not just read. The interaction looks and feels like any other voice call.
Key features
  • Real-time gesture recognition from a standard webcam
  • Word-by-word transcript assembly with spell correction
  • TTS output in the meeting's chosen Indian language via Sarvam
  • Native integration with Meet, Zoom, Teams—or a Bayaan-hosted room
  • Captioning and audio simultaneously
MVP scope
  • Custom letter recognition (~9 gestures)
  • Local Python app, single user
  • English TTS via Sarvam
  • Audio injection via virtual audio device (BlackHole)
  • Demo phrase: "Hi, I am Tia. Hire me."
Non-goals (MVP)
  • Full ISL alphabet and vocabulary
  • Mobile or browser-native experience
Success metrics
  • Gesture recognition accuracy > 90% in varied lighting
  • End-to-end latency < 2 seconds (sign → spoken word)
  • Listener comprehension of signed message > 95% without context
Distribution
Direct-to-user web product. Partnerships with deaf schools and accessibility-focused employers. CSR and government-funded pilots.
Constraints
  • ISL is regionally variant—models must generalise across dialects
  • Latency is a hard ceiling; conversation breaks past ~3 seconds
  • Must run on a standard laptop or phone—no specialist hardware
  • Listener experience must require zero setup

What comes next.

NOW
Proof of concept
Local Python app, 9 letters, English output. Validates the loop: sign → recognition → speech → injection into a live meeting.
V0.1
Production gesture model
Replace KNN with a CNN/transformer trained on the full ISL alphabet. Robust across users, lighting, and dialects.
V0.2
Multilingual output
Chain Sarvam Translate before TTS. The meeting picks its language—Hindi, Bengali, Tamil, Telugu, Marathi, Kannada.
V0.3
Web app
Browser-native, WebRTC-powered. Zero install for the listener.
V1
Meeting platform integrations
Bayaan as a Zoom app, Google Meet add-on, Teams integration. Drops into existing workflows.
V2
Mobile experience
Front-camera signing on iOS and Android. A phone call replacement, not just a meeting tool.
V3
Two-way translation
Speech-to-sign rendered as an avatar or live ISL captioning. The conversation becomes fully symmetric.

What success looks like.

THE FUTURE WE'RE BUILDING TOWARD
A mute candidate sits down for a job interview.
The recruiter on the other end has no idea anything is unusual.
They talk. They listen. They hire.

That's it. That's the whole product.