BAYAAN — to express

THE PROBLEM

A communication barrier built into everyday life.

In India, an entire population communicates through Indian Sign Language. They join meetings, attend interviews, see doctors. The world they're trying to talk to doesn't speak their language.

Every interaction needs a workaround. Type into a phone. Pass a notepad. Find an interpreter. None of these work in a real conversation, and none of them work in a video call.

Deaf & Mute Population

63,000,000

people in India

Indian Sign Language Users

8,435,000

across India

Certified ISL Translators

250

in the entire country

1 translator for every 33,740 ISL users. Source: Ishaara. The supply gap is structural—and unsolvable with humans alone.

THE GAP

What exists, and what doesn't.

Sign language recognition isn't new. Indian academic research spans IIT Roorkee, IIT Delhi, IIIT Hyderabad. Global companies like SignAll, KinTrans, and Hand Talk exist. But none solve the actual problem.

EXISTS

Sign → Text

Gesture recognition that outputs subtitles. Useful for documentation and learning. Not built for live conversation.

MISSING

Sign → Live Spoken Audio in a Call

No production product injects real-time spoken voice into a live video meeting, in an Indian language, in a natural human voice. This is Bayaan.

WHAT IT IS

Bayaan turns a hand into a voice.

A camera reads a person's hands during a video call. A machine learning model recognises the sign language gestures and assembles them into words. The transcript is sent to Sarvam's text-to-speech, which speaks the message in a natural Indian voice—and that audio is fed directly into the meeting as if the person was speaking themselves.

The participant on the other end hears a voice. Not a notification. Not a subtitle. A voice.

SYSTEM FLOW

How it works—proof of concept vs production.

The MVP proves the loop closes on a single laptop with a small vocabulary. The production architecture is what scales this to a product used in real meetings.

PROOF OF CONCEPT — what's running today

A Python application running locally. Hand landmarks extracted by MediaPipe, classified by K-Nearest Neighbours using inter-joint angle features. Small letter set, one demo phrase.

┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ Webcam │ ──▶ │ MediaPipe │ ──▶ │ Feature │ │ live video │ │ 21 hand │ │ extraction │ │ frame stream │ │ landmarks │ │ joint angles │ └─────────────────┘ └─────────────────┘ └────────┬────────┘ │ ┌─────────────────┐ ┌─────────────────┐ ┌────────▼────────┐ │ Audio playback │ ◀── │ Sarvam TTS │ ◀── │ KNN │ │ local speakers │ │ text → speech │ │ classifier │ │ via BlackHole │ │ Indian English │ │ 9 gestures │ └─────────────────┘ └─────────────────┘ └─────────────────┘ ▼ injected into meeting via virtual audio device ▼ proves the loop closes end-to-end

PRODUCTION — what it needs to become

A browser-based product. The signer joins a normal video call. Recognition runs in the browser via a production model trained on thousands of ISL signers. Sarvam handles translation and speech. Audio streams into the meeting as a participant—no plugins on the listener's side.

┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ Browser │ ──▶ │ Gesture │ ──▶ │ Word & Phrase │ │ front camera │ │ model │ │ assembly │ │ stream │ │ full ISL │ │ + spell check │ └─────────────────┘ └─────────────────┘ └────────┬────────┘ │ ┌─────────────────┐ ┌─────────────────┐ ┌────────▼────────┐ │ Sarvam TTS │ ◀── │ Sarvam │ ◀── │ Meeting │ │ regional │ │ Translate │ │ language │ │ voice │ │ EN → HI/BN/TA │ │ selected once │ └────────┬────────┘ └─────────────────┘ └─────────────────┘ │ ┌────────▼────────┐ ┌─────────────────┐ │ WebRTC │ ──▶ │ Live meeting │ │ audio stream │ │ all participants │ │ to peers │ │ hear the voice │ └─────────────────┘ └─────────────────┘

USER FLOW

The experience for the signer and the listener.

The point isn't the technology. The point is that the conversation feels normal.

PROOF OF CONCEPT — single user demo

SIGNER │ ├─ opens Bayaan locally ├─ joins Google Meet in parallel window ├─ routes audio output to virtual mic (BlackHole) ├─ signs into webcam │ └─▶ Bayaan recognises gesture → builds transcript │ └─▶ Sarvam speaks aloud │ └─▶ audio reaches Meet as the signer's "mic" LISTENER │ └─ hears a voice in the call. Conversation continues.

PRODUCTION — natural meeting experience

SIGNER │ ├─ joins a Bayaan-enabled meeting (or a Meet/Zoom integration) ├─ picks the meeting's spoken language (English, Hindi, Bengali, Tamil…) ├─ turns on camera and signs naturally │ └─▶ Bayaan runs invisibly in the browser │ ├─▶ transcript appears as live captions └─▶ spoken audio streams to all participants LISTENERS (everyone else in the call) └─ hear the signer's voice in the meeting's chosen language. No plugins. No setup. Just a voice on the call.

PRD · PRODUCT REQUIREMENTS

The specification.

Problem

ISL users cannot participate in spoken conversations—particularly in virtual meetings—without an interpreter or constant typing. There are 250 certified translators for 8.4 million ISL users.

Primary user

An ISL signer in a virtual meeting—a job interview, a class, a doctor's consultation, a standup—where the other participants do not understand sign language.

Secondary user

Hearing participants who need to understand and respond to the signer in real time, in the meeting's chosen language.

Core value

Conversation, not transcription. The signer is heard, not just read. The interaction looks and feels like any other voice call.

Key features

Real-time gesture recognition from a standard webcam
Word-by-word transcript assembly with spell correction
TTS output in the meeting's chosen Indian language via Sarvam
Native integration with Meet, Zoom, Teams—or a Bayaan-hosted room
Captioning and audio simultaneously

MVP scope

Custom letter recognition (~9 gestures)
Local Python app, single user
English TTS via Sarvam
Audio injection via virtual audio device (BlackHole)
Demo phrase: "Hi, I am Tia. Hire me."

Non-goals (MVP)

Full ISL alphabet and vocabulary
Mobile or browser-native experience

Success metrics

Gesture recognition accuracy > 90% in varied lighting
End-to-end latency < 2 seconds (sign → spoken word)
Listener comprehension of signed message > 95% without context

Distribution

Direct-to-user web product. Partnerships with deaf schools and accessibility-focused employers. CSR and government-funded pilots.

Constraints

ISL is regionally variant—models must generalise across dialects
Latency is a hard ceiling; conversation breaks past ~3 seconds
Must run on a standard laptop or phone—no specialist hardware
Listener experience must require zero setup

ROADMAP

What comes next.

NOW

Proof of concept

Local Python app, 9 letters, English output. Validates the loop: sign → recognition → speech → injection into a live meeting.

V0.1

Production gesture model

Replace KNN with a CNN/transformer trained on the full ISL alphabet. Robust across users, lighting, and dialects.

V0.2

Multilingual output

Chain Sarvam Translate before TTS. The meeting picks its language—Hindi, Bengali, Tamil, Telugu, Marathi, Kannada.

V0.3

Web app

Browser-native, WebRTC-powered. Zero install for the listener.

V1

Meeting platform integrations

Bayaan as a Zoom app, Google Meet add-on, Teams integration. Drops into existing workflows.

V2

Mobile experience

Front-camera signing on iOS and Android. A phone call replacement, not just a meeting tool.

V3

Two-way translation

Speech-to-sign rendered as an avatar or live ISL captioning. The conversation becomes fully symmetric.

END STATE

What success looks like.

THE FUTURE WE'RE BUILDING TOWARD

A mute candidate sits down for a job interview.
The recruiter on the other end has no idea anything is unusual.
They talk. They listen. They hire.

That's it. That's the whole product.