System Architecture - AssemblyAI Real-Time Transcription Browser Example

Overview

The AssemblyAI Real-Time Transcription Browser Example uses a three-tier architecture that separates concerns between the backend server, frontend client, and AssemblyAI’s streaming service.

Architecture Components

1. Express Server (Backend)

The Express server acts as a security layer and token provider:

server.js

const express = require("express");
const path = require("path");
const { generateTempToken } = require("./tokenGenerator");

const app = express();
const PORT = 8000;

app.use(express.static(path.join(__dirname, "public")));

app.get("/token", async (req, res) => {
  try {
    const token = await generateTempToken(60);
    res.json({ token });
  } catch (error) {
    res.status(500).json({ error: "Failed to generate token" });
  }
});

The server’s primary responsibility is generating temporary tokens for secure client-side connections to AssemblyAI. It never exposes your API key to the browser.

2. Browser Client (Frontend)

The client handles three main responsibilities:

Audio capture using the Web Audio API and AudioWorklet
Token retrieval from the Express server
WebSocket communication with AssemblyAI’s real-time service

index.js

async function run() {
  microphone = createMicrophone();
  await microphone.requestPermission();

  // Get temporary token from server
  const response = await fetch("http://localhost:8000/token");
  const data = await response.json();

  // Connect to AssemblyAI with token
  const endpoint = `wss://streaming.assemblyai.com/v3/ws?sample_rate=16000&formatted_finals=true&token=${data.token}`;
  ws = new WebSocket(endpoint);
}

3. AssemblyAI Streaming Service

The AssemblyAI service receives audio data over WebSocket and returns transcripts in real-time using turn-based messages.

Data Flow

Turn-Based Transcription

AssemblyAI returns transcripts as “turns” - natural speech segments organized by speaker turns:

index.js

const turns = {}; // keyed by turn_order

ws.onmessage = (event) => {
  const msg = JSON.parse(event.data);
  if (msg.type === "Turn") {
    const { turn_order, transcript } = msg;
    turns[turn_order] = transcript;

    // Display turns in order
    const orderedTurns = Object.keys(turns)
      .sort((a, b) => Number(a) - Number(b))
      .map((k) => turns[k])
      .join(" ");

    messageEl.innerText = orderedTurns;
  }
};

Turns may arrive out of order due to network conditions or processing delays. The application stores turns in an object and sorts them by turn_order for display.

Security Architecture

By using temporary tokens generated server-side, your AssemblyAI API key never leaves the server. This prevents unauthorized access even if the client code is compromised.

The token-based security model ensures:

API keys remain secret on the server
Clients receive time-limited access tokens
Tokens expire automatically (60 seconds in this example)
Each client session requires a new token

Connection Lifecycle

Initialization: User clicks “Record” button
Token Request: Client fetches temporary token from Express server
WebSocket Connection: Client connects to AssemblyAI using token
Audio Streaming: AudioWorklet processes and sends audio chunks
Transcription: AssemblyAI returns Turn messages with transcripts
Termination: User clicks “Stop”, client sends Terminate message and closes connection

The application maintains state through boolean flags (isRecording) and object references (ws, microphone) to coordinate between components.

Documentation Index

​Overview

​Architecture Components

​1. Express Server (Backend)

​2. Browser Client (Frontend)

​3. AssemblyAI Streaming Service

​Data Flow

​Turn-Based Transcription

​Security Architecture

​Connection Lifecycle

Overview

Architecture Components

1. Express Server (Backend)

2. Browser Client (Frontend)

3. AssemblyAI Streaming Service

Data Flow

Turn-Based Transcription

Security Architecture

Connection Lifecycle