Building Real-time Voice Communication with OpenAI using WebRTC and Ephemeral Keys

Andrew Erie
Your Tech Partner
Introduction
OpenAI's Realtime API offers an incredible opportunity to build voice-powered AI applications with ultra-low latency. However, implementing it securely in a client-side application presents a challenge: how do you connect to OpenAI without exposing your API key?
In this post, I'll walk through how I implemented a secure, real-time voice communication system using WebRTC and ephemeral keys. This approach enables direct peer-to-peer connections between the client and OpenAI's servers, bypassing the need to proxy audio through your backend while maintaining security.
The Challenge
When building voice applications with AI, you typically face these challenges:
- Security: You can't expose your OpenAI API key in client-side code
- Latency: Proxying audio through your server adds significant delay
- Bandwidth: Streaming audio through your server is expensive
- Complexity: Managing WebSocket connections and audio streaming is complex
The Solution: WebRTC with Ephemeral Keys
The solution leverages two key technologies:
- Ephemeral Keys: Temporary, limited-scope API keys generated server-side
- WebRTC: Direct peer-to-peer connection between client and OpenAI
Here's the high-level flow:
- Client requests an ephemeral key from your backend
- Backend uses your OpenAI API key to generate a temporary key
- Client uses the ephemeral key to establish a WebRTC connection directly with OpenAI
- Audio streams directly between client and OpenAI - your server is completely out of the loop
Implementation
Backend: Generating Ephemeral Keys
The backend implementation is surprisingly simple. You need just one endpoint that generates ephemeral keys:
// server/src/api/routes.ts
router.post('/realtime/session', async (req, res) => {
try {
const { voice = 'alloy', model = 'gpt-4o-mini-realtime-preview' } = req.body
if (!process.env.OPENAI_API_KEY) {
return res.status(500).json({ error: 'OpenAI API key not configured' })
}
// Generate ephemeral token from OpenAI
const response = await fetch('https://api.openai.com/v1/realtime/sessions', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
model,
voice,
}),
})
if (!response.ok) {
const error = await response.text()
console.error('Failed to create ephemeral token:', error)
return res.status(response.status).json({ error: 'Failed to create ephemeral token' })
}
const data = await response.json()
res.json(data)
} catch (error) {
console.error('Error creating ephemeral token:', error)
res.status(500).json({ error: 'Failed to create ephemeral token' })
}
})
This endpoint:
- Accepts the desired voice and model configuration
- Uses your server-side OpenAI API key to request an ephemeral key
- Returns the ephemeral key to the client
The ephemeral key has limited permissions and expires quickly, making it safe to send to the client.
Frontend: WebRTC Connection
The frontend implementation involves creating a WebRTC peer connection and using the ephemeral key for authentication:
// client/src/services/webrtc.ts
export class WebRTCService extends EventEmitter {
private pc?: RTCPeerConnection
private dataChannel?: RTCDataChannel
private audioStream?: MediaStream
async connect(config: WebRTCConfig): Promise<void> {
// Create peer connection with STUN server
this.pc = new RTCPeerConnection({
iceServers: [{ urls: 'stun:stun.l.google.com:19302' }],
})
// Create data channel for real-time events
this.dataChannel = this.pc.createDataChannel('oai-events', {
ordered: true,
})
// Get user's microphone
this.audioStream = await navigator.mediaDevices.getUserMedia({
audio: {
echoCancellation: true,
noiseSuppression: true,
autoGainControl: true,
sampleRate: 24000,
channelCount: 1,
},
})
// Add audio track to peer connection
const audioTrack = this.audioStream.getAudioTracks()[0]
this.pc.addTrack(audioTrack, this.audioStream)
// Handle incoming audio from OpenAI
this.pc.ontrack = (event) => {
if (event.track.kind === 'audio') {
this.handleIncomingAudio(event.streams[0])
}
}
// Create offer and exchange SDP with OpenAI
const offer = await this.pc.createOffer()
await this.pc.setLocalDescription(offer)
// Exchange SDP with OpenAI using ephemeral key
const response = await fetch(`https://api.openai.com/v1/realtime?model=${config.model}`, {
method: 'POST',
body: offer.sdp,
headers: {
'Authorization': `Bearer ${config.ephemeralKey}`,
'Content-Type': 'application/sdp',
},
})
const answerSdp = await response.text()
await this.pc.setRemoteDescription({
type: 'answer',
sdp: answerSdp,
})
}
}
Establishing the Connection
The connection flow in the React hook demonstrates how everything comes together:
// client/src/hooks/useRealtimeSession.ts
const connect = useCallback(async (communicationMode: 'voice-to-voice') => {
try {
// Step 1: Get ephemeral token from your backend
const response = await fetch('http://localhost:8080/api/realtime/session', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify({
voice: settings.openai.voice,
model: settings.openai.model,
}),
})
const data = await response.json()
const ephemeralKey = data.client_secret.value
// Step 2: Connect via WebRTC using the ephemeral key
await webrtcService.connect({
ephemeralKey,
model: settings.openai.model,
voice: settings.openai.voice,
audioStream: mediaStreamRef.current,
})
// Connection established! Audio now flows directly between client and OpenAI
} catch (error) {
console.error('Failed to connect:', error)
}
}, [])
Handling Real-time Events
Once connected, the data channel provides real-time events for transcriptions, responses, and function calls:
private setupDataChannel(): void {
this.dataChannel.onopen = () => {
// Send initial session configuration
const config = {
type: 'session.update',
session: {
modalities: ['text', 'audio'],
voice: this.voice,
input_audio_transcription: {
model: 'whisper-1'
},
turn_detection: {
type: 'server_vad',
threshold: 0.5,
prefix_padding_ms: 300,
silence_duration_ms: 1000
}
}
};
this.sendMessage(config);
};
this.dataChannel.onmessage = (event) => {
const message = JSON.parse(event.data);
switch (message.type) {
case 'conversation.item.input_audio_transcription.completed':
// User's speech was transcribed
this.emit('transcription', {
transcript: message.transcript,
item_id: message.item_id
});
break;
case 'response.audio_transcript.delta':
// Assistant's response text (real-time)
this.emit('assistantTranscriptDelta', {
delta: message.delta,
item_id: message.item_id
});
break;
case 'response.audio.delta':
// Assistant's audio response
if (message.delta) {
const audioData = base64ToArrayBuffer(message.delta);
this.emit('assistantAudioDelta', audioData);
}
break;
}
};
}
Key Benefits
1. Ultra-Low Latency
Audio streams directly between the client and OpenAI servers. There's no intermediate hop through your backend, resulting in the lowest possible latency.
2. Reduced Server Load
Your server only handles the initial ephemeral key generation. All audio processing happens client-side and on OpenAI's infrastructure.
3. Enhanced Security
- Your API key never leaves the server
- Ephemeral keys have limited scope and expire quickly
- Each session gets its own unique key
4. Simplified Architecture
No need to implement complex WebSocket proxying or audio streaming on your backend. The WebRTC connection handles all the real-time communication.
Practical Tips
1. Handle Connection States Properly
this.pc.onconnectionstatechange = () => {
const state = this.pc.connectionState
if (state === 'connected') {
this.isConnected = true
this.emit('connected')
} else if (state === 'failed' || state === 'closed') {
this.isConnected = false
this.emit('disconnected')
}
}
2. Implement Proper Audio Cleanup
async disconnect(): Promise<void> {
// Stop audio tracks
if (this.audioStream) {
this.audioStream.getTracks().forEach(track => track.stop());
}
// Close data channel
if (this.dataChannel) {
this.dataChannel.close();
}
// Close peer connection
if (this.pc) {
this.pc.close();
}
}
3. Configure Audio Settings for Quality
const audioConstraints = {
echoCancellation: true,
noiseSuppression: true,
autoGainControl: true,
sampleRate: 24000, // OpenAI expects 24kHz
channelCount: 1, // Mono audio
}
4. Handle Network Interruptions
Implement reconnection logic to handle network drops gracefully. The ephemeral key approach makes this straightforward - just request a new key and reconnect.
Conclusion
Using WebRTC with ephemeral keys provides an elegant solution for building real-time voice applications with OpenAI. It combines the security of server-side API key management with the performance benefits of direct client-to-server communication.
This approach has enabled me to build a responsive voice assistant with minimal latency while keeping the implementation surprisingly simple. The combination of WebRTC's proven real-time capabilities and OpenAI's powerful AI creates an excellent foundation for voice-powered applications.
The key insight is that by leveraging ephemeral keys, we can safely move the real-time communication to the edge (the client) where it belongs, while maintaining security through temporary, scoped credentials. This pattern could be applied to many other real-time AI services as they emerge.