Building Real-time Voice Communication with OpenAI using WebRTC and Ephemeral Keys

Introduction

OpenAI's Realtime API offers an incredible opportunity to build voice-powered AI applications with ultra-low latency. However, implementing it securely in a client-side application presents a challenge: how do you connect to OpenAI without exposing your API key?

In this post, I'll walk through how I implemented a secure, real-time voice communication system using WebRTC and ephemeral keys. This approach enables direct peer-to-peer connections between the client and OpenAI's servers, bypassing the need to proxy audio through your backend while maintaining security.

The Challenge

When building voice applications with AI, you typically face these challenges:

Security: You can't expose your OpenAI API key in client-side code
Latency: Proxying audio through your server adds significant delay
Bandwidth: Streaming audio through your server is expensive
Complexity: Managing WebSocket connections and audio streaming is complex

The Solution: WebRTC with Ephemeral Keys

The solution leverages two key technologies:

Ephemeral Keys: Temporary, limited-scope API keys generated server-side
WebRTC: Direct peer-to-peer connection between client and OpenAI

Here's the high-level flow:

Client requests an ephemeral key from your backend
Backend uses your OpenAI API key to generate a temporary key
Client uses the ephemeral key to establish a WebRTC connection directly with OpenAI
Audio streams directly between client and OpenAI - your server is completely out of the loop

Implementation

Backend: Generating Ephemeral Keys

The backend implementation is surprisingly simple. You need just one endpoint that generates ephemeral keys:

// server/src/api/routes.ts
router.post('/realtime/session', async (req, res) => {
	try {
		const { voice = 'alloy', model = 'gpt-4o-mini-realtime-preview' } = req.body
 
		if (!process.env.OPENAI_API_KEY) {
			return res.status(500).json({ error: 'OpenAI API key not configured' })
		}
 
		// Generate ephemeral token from OpenAI
		const response = await fetch('https://api.openai.com/v1/realtime/sessions', {
			method: 'POST',
			headers: {
				'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
				'Content-Type': 'application/json',
			},
			body: JSON.stringify({
				model,
				voice,
			}),
		})
 
		if (!response.ok) {
			const error = await response.text()
			console.error('Failed to create ephemeral token:', error)
			return res.status(response.status).json({ error: 'Failed to create ephemeral token' })
		}
 
		const data = await response.json()
		res.json(data)
	} catch (error) {
		console.error('Error creating ephemeral token:', error)
		res.status(500).json({ error: 'Failed to create ephemeral token' })
	}
})

This endpoint:

Accepts the desired voice and model configuration
Uses your server-side OpenAI API key to request an ephemeral key
Returns the ephemeral key to the client

The ephemeral key has limited permissions and expires quickly, making it safe to send to the client.

Frontend: WebRTC Connection

The frontend implementation involves creating a WebRTC peer connection and using the ephemeral key for authentication:

// client/src/services/webrtc.ts
export class WebRTCService extends EventEmitter {
	private pc?: RTCPeerConnection
	private dataChannel?: RTCDataChannel
	private audioStream?: MediaStream
 
	async connect(config: WebRTCConfig): Promise<void> {
		// Create peer connection with STUN server
		this.pc = new RTCPeerConnection({
			iceServers: [{ urls: 'stun:stun.l.google.com:19302' }],
		})
 
		// Create data channel for real-time events
		this.dataChannel = this.pc.createDataChannel('oai-events', {
			ordered: true,
		})
 
		// Get user's microphone
		this.audioStream = await navigator.mediaDevices.getUserMedia({
			audio: {
				echoCancellation: true,
				noiseSuppression: true,
				autoGainControl: true,
				sampleRate: 24000,
				channelCount: 1,
			},
		})
 
		// Add audio track to peer connection
		const audioTrack = this.audioStream.getAudioTracks()[0]
		this.pc.addTrack(audioTrack, this.audioStream)
 
		// Handle incoming audio from OpenAI
		this.pc.ontrack = (event) => {
			if (event.track.kind === 'audio') {
				this.handleIncomingAudio(event.streams[0])
			}
		}
 
		// Create offer and exchange SDP with OpenAI
		const offer = await this.pc.createOffer()
		await this.pc.setLocalDescription(offer)
 
		// Exchange SDP with OpenAI using ephemeral key
		const response = await fetch(`https://api.openai.com/v1/realtime?model=${config.model}`, {
			method: 'POST',
			body: offer.sdp,
			headers: {
				'Authorization': `Bearer ${config.ephemeralKey}`,
				'Content-Type': 'application/sdp',
			},
		})
 
		const answerSdp = await response.text()
		await this.pc.setRemoteDescription({
			type: 'answer',
			sdp: answerSdp,
		})
	}
}

Establishing the Connection

The connection flow in the React hook demonstrates how everything comes together:

// client/src/hooks/useRealtimeSession.ts
const connect = useCallback(async (communicationMode: 'voice-to-voice') => {
	try {
		// Step 1: Get ephemeral token from your backend
		const response = await fetch('http://localhost:8080/api/realtime/session', {
			method: 'POST',
			headers: {
				'Content-Type': 'application/json',
			},
			body: JSON.stringify({
				voice: settings.openai.voice,
				model: settings.openai.model,
			}),
		})
 
		const data = await response.json()
		const ephemeralKey = data.client_secret.value
 
		// Step 2: Connect via WebRTC using the ephemeral key
		await webrtcService.connect({
			ephemeralKey,
			model: settings.openai.model,
			voice: settings.openai.voice,
			audioStream: mediaStreamRef.current,
		})
 
		// Connection established! Audio now flows directly between client and OpenAI
	} catch (error) {
		console.error('Failed to connect:', error)
	}
}, [])

Handling Real-time Events

Once connected, the data channel provides real-time events for transcriptions, responses, and function calls:

private setupDataChannel(): void {
  this.dataChannel.onopen = () => {
    // Send initial session configuration
    const config = {
      type: 'session.update',
      session: {
        modalities: ['text', 'audio'],
        voice: this.voice,
        input_audio_transcription: {
          model: 'whisper-1'
        },
        turn_detection: {
          type: 'server_vad',
          threshold: 0.5,
          prefix_padding_ms: 300,
          silence_duration_ms: 1000
        }
      }
    };
 
    this.sendMessage(config);
  };
 
  this.dataChannel.onmessage = (event) => {
    const message = JSON.parse(event.data);
 
    switch (message.type) {
      case 'conversation.item.input_audio_transcription.completed':
        // User's speech was transcribed
        this.emit('transcription', {
          transcript: message.transcript,
          item_id: message.item_id
        });
        break;
 
      case 'response.audio_transcript.delta':
        // Assistant's response text (real-time)
        this.emit('assistantTranscriptDelta', {
          delta: message.delta,
          item_id: message.item_id
        });
        break;
 
      case 'response.audio.delta':
        // Assistant's audio response
        if (message.delta) {
          const audioData = base64ToArrayBuffer(message.delta);
          this.emit('assistantAudioDelta', audioData);
        }
        break;
    }
  };
}

Key Benefits

1. Ultra-Low Latency

Audio streams directly between the client and OpenAI servers. There's no intermediate hop through your backend, resulting in the lowest possible latency.

2. Reduced Server Load

Your server only handles the initial ephemeral key generation. All audio processing happens client-side and on OpenAI's infrastructure.

3. Enhanced Security

Your API key never leaves the server
Ephemeral keys have limited scope and expire quickly
Each session gets its own unique key

4. Simplified Architecture

No need to implement complex WebSocket proxying or audio streaming on your backend. The WebRTC connection handles all the real-time communication.

Practical Tips

1. Handle Connection States Properly

this.pc.onconnectionstatechange = () => {
	const state = this.pc.connectionState
 
	if (state === 'connected') {
		this.isConnected = true
		this.emit('connected')
	} else if (state === 'failed' || state === 'closed') {
		this.isConnected = false
		this.emit('disconnected')
	}
}

2. Implement Proper Audio Cleanup

async disconnect(): Promise<void> {
  // Stop audio tracks
  if (this.audioStream) {
    this.audioStream.getTracks().forEach(track => track.stop());
  }
 
  // Close data channel
  if (this.dataChannel) {
    this.dataChannel.close();
  }
 
  // Close peer connection
  if (this.pc) {
    this.pc.close();
  }
}

3. Configure Audio Settings for Quality

const audioConstraints = {
	echoCancellation: true,
	noiseSuppression: true,
	autoGainControl: true,
	sampleRate: 24000, // OpenAI expects 24kHz
	channelCount: 1, // Mono audio
}

4. Handle Network Interruptions

Implement reconnection logic to handle network drops gracefully. The ephemeral key approach makes this straightforward - just request a new key and reconnect.

Conclusion

Using WebRTC with ephemeral keys provides an elegant solution for building real-time voice applications with OpenAI. It combines the security of server-side API key management with the performance benefits of direct client-to-server communication.

This approach has enabled me to build a responsive voice assistant with minimal latency while keeping the implementation surprisingly simple. The combination of WebRTC's proven real-time capabilities and OpenAI's powerful AI creates an excellent foundation for voice-powered applications.

The key insight is that by leveraging ephemeral keys, we can safely move the real-time communication to the edge (the client) where it belongs, while maintaining security through temporary, scoped credentials. This pattern could be applied to many other real-time AI services as they emerge.