Real-time speech translation in the browser using OpenAI Realtime Translation, WebRTC, TypeScript, Vite, and a FastAPI ephemeral-token backend.
RelAI
Github: GCaggianese/RelAI
RelAI is a browser-based MVP for live speech translation.
It captures microphone audio with getUserMedia, sends it to OpenAI through a WebRTC peer connection, receives translated speech as a remote audio track, and displays both source and translated transcripts from realtime data-channel events.
The backend is intentionally small: it only creates ephemeral client secrets, so the long-lived OpenAI API key never needs to be exposed to the browser.
Status
Working MVP.
Implemented:
- Browser microphone capture
- WebRTC session setup with OpenAI Realtime Translation
- Translated audio playback
- Source transcript subtitles
- Translated transcript subtitles
- FastAPI backend for ephemeral client secrets
- Basic session lifecycle handling
- WebRTC connection-state logging
- Firefox/Zen compatibility warning
Deferred:
- Full interpreter mode
- Mobile wrapper
- Production deployment
- Persistent session history
- Advanced audio routing
- Authentication / user accounts
RelAI is not a polished product yet. It is a working prototype focused on validating the realtime speech translation loop end-to-end.
Why this project exists
Most AI translation demos hide the interesting parts behind a normal request/response API.
RelAI explores the lower-level path:
- live microphone streaming
- WebRTC offer/answer exchange
- browser media permissions
- remote translated audio playback
- transcript deltas over a data channel
- ephemeral browser credentials
- browser-specific WebRTC behavior
- failure handling for unstable realtime sessions
The interesting problem is not “call an AI API and translate text”.
The interesting problem is building a realtime browser audio pipeline where speech, translation, playback, subtitles, credentials, and WebRTC state all have to cooperate inside one live session.
Architecture
Browser
├── getUserMedia()
│ └── microphone audio track
│
├── RTCPeerConnection
│ ├── sends microphone audio to OpenAI
│ ├── receives translated audio track
│ └── creates DataChannel "oai-events"
│
├── HTMLAudioElement
│ └── plays translated remote audio stream
│
└── DataChannel events
├── session.input_transcript.delta
│ └── source subtitles
└── session.output_transcript.delta
└── translated subtitles
FastAPI backend
└── POST /session
└── creates ephemeral OpenAI Realtime Translation client secret
How it works
1. The browser captures microphone audio
The frontend requests microphone access through navigator.mediaDevices.getUserMedia().
The current audio constraints enable:
- echo cancellation
- noise suppression
- automatic gain control
The resulting microphone track is added directly to a WebRTC peer connection.
2. The backend creates an ephemeral client secret
The frontend does not use the long-lived OpenAI API key.
Instead, it calls the local backend:
POST /session
with:
{
"targetLanguage": "en"
}
The FastAPI server then calls OpenAI’s realtime translation client-secret endpoint using OPENAI_API_KEY from the server environment.
The browser receives only the ephemeral client secret.
3. The frontend performs the WebRTC exchange
The frontend:
- Creates an
RTCPeerConnection. - Adds the microphone audio track.
- Creates the
oai-eventsdata channel. - Generates an SDP offer.
- Sends that SDP offer to OpenAI using the ephemeral client secret.
- Receives the SDP answer.
- Sets the remote description.
- Starts receiving translated audio and transcript events.
4. Translated audio is played as a remote track
When OpenAI returns a remote media stream, RelAI attaches it to an HTMLAudioElement and plays the translated audio in the browser.
5. Subtitles arrive as realtime deltas
The data channel receives realtime events.
RelAI currently consumes:
session.input_transcript.delta
session.output_transcript.delta
Those deltas are appended live into the UI as source and translated subtitles.
Modes
Translate mode
Currently active.
microphone speech -> translated audio + source/target subtitles
The UI shows:
- source transcript
- translated transcript
- translated audio playback
The target language selector controls the output language sent to the backend.
The source language selector is currently UI-only; source speech is effectively handled by the realtime translation model.
Interpreter mode
Interpreter mode was the original planned second mode.
The design was:
Session A -> translate into language A -> left ear
Session B -> translate into language B -> right ear
The goal was to support live bilingual interpretation with two parallel translation sessions and stereo panning.
The HTML still contains the interpreter-mode UI skeleton, but the current application intentionally disables it while the single-session translation path is stabilized.
Browser compatibility
Chromium-based browsers are recommended for the current MVP.
Observed during local testing:
- Chromium: stable
- Firefox / Zen Browser: may disconnect after a short time
RelAI detects Firefox-family browsers and displays a compatibility warning.
The suspected issue is in the WebRTC/browser behavior rather than the UI layer. The code includes WebRTC connection-state logging and a short grace period before treating soft disconnects as fatal.
Stack
Frontend:
- Vite
- TypeScript
- WebRTC
- browser media APIs
- vanilla DOM UI
- CSS
Backend:
- FastAPI
- httpx
- python-dotenv
- Uvicorn
Repository layout
.
├── app
│ ├── index.html
│ ├── package.json
│ ├── package-lock.json
│ ├── src
│ │ ├── main.ts
│ │ ├── style.css
│ │ └── translator.ts
│ ├── tsconfig.json
│ └── vite.config.ts
├── server
│ ├── main.py
│ └── requirements.txt
├── README.md
└── LICENSE
Requirements
- Python 3
- Node.js + npm
- OpenAI API key with access to realtime translation
- A Chromium-based browser is recommended for testing
Running locally
1. Backend
cd server
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Create server/.env:
OPENAI_API_KEY=sk-...
OPENAI_SAFETY_IDENTIFIER=local-dev-user
Run the backend:
uvicorn main:app --reload
The backend runs on:
http://localhost:8000
2. Frontend
In another terminal:
cd app
npm install
npm run dev
Open:
http://localhost:5173
The Vite dev server proxies:
/session -> http://localhost:8000
Build
Frontend build:
cd app
npm run build
Preview production build locally:
npm run preview
Security notes
The browser never receives the long-lived OpenAI API key.
The credential flow is:
server/.env
↓
FastAPI /session
↓
OpenAI client-secret endpoint
↓
ephemeral browser secret
↓
WebRTC SDP exchange
The backend logs session metadata for debugging, but intentionally does not print the ephemeral secret value.
Known limitations
- Translate mode is the only active mode.
- Interpreter mode is present in the UI skeleton but disabled.
- Firefox/Zen may disconnect from WebRTC after a few seconds of session.
- Error recovery is basic.
- No production auth.
- No deployment config.
- No mobile wrapper.
- Source-language selection is not yet wired into the backend payload.
- The UI is just local MVP testing.
Future work
Possible next steps:
- Re-enable interpreter mode after single-session stability improves.
- Add explicit Web Audio routing and stereo panning (interpreter mode).
- Improve reconnect behavior.
- Document the Firefox/Gecko WebRTC failure mode more precisely (or try to solve it).