Browser LLMs | theLAB AI Research

Overview

Browser LLMs enable you to run large language models directly in the user's browser using WebGPU acceleration. This means zero server costs, complete data privacy, and surprisingly good performance - typically around 10-30 tokens per second on modern hardware.

This guide covers three major frameworks for running LLMs in the browser: WebLLM (MLC AI), Transformers.js (Hugging Face), and BrowserAI. Each has different strengths depending on your use case.

Key Benefits

No API costs, complete privacy (data never leaves the browser), offline capable, and no server infrastructure to maintain.

Why Browser LLMs?

Running LLMs in the browser offers several compelling advantages over server-side deployment:

Zero Server Costs - No GPUs to rent, no API fees. The user's hardware does all the work.
Complete Privacy - User data never leaves their device. Perfect for sensitive applications.
Offline Capable - Once the model is cached, it works without internet.
Instant Scaling - Each user brings their own compute. Handle millions of users without infrastructure changes.
Low Latency - No network round trips. Responses start immediately.

The tradeoffs? Larger models won't fit, initial load times can be slow (downloading ~1-4GB), and performance varies by user hardware. But for many applications, these are acceptable.

WebGPU Acceleration

WebGPU is the key technology that makes browser LLMs practical. It's the successor to WebGL, providing low-level GPU access similar to Vulkan, Metal, and DirectX 12.

JavaScript

// Check WebGPU support
async function checkWebGPU() {
    if (!navigator.gpu) {
        console.log('WebGPU not supported');
        return false;
    }
    
    const adapter = await navigator.gpu.requestAdapter();
    if (!adapter) {
        console.log('No GPU adapter found');
        return false;
    }
    
    const device = await adapter.requestDevice();
    console.log('WebGPU ready!', device);
    return true;
}

WebGPU is supported in Chrome 113+, Edge 113+, and Firefox (behind a flag). Safari support is coming. For browsers without WebGPU, Transformers.js can fall back to WebGL or WASM, but performance drops significantly.

Browser Support

WebGPU requires Chrome/Edge 113+ or Firefox Nightly. Always include a fallback message for unsupported browsers.

Frameworks Overview

Three frameworks dominate the browser LLM space:

WebLLM

MLC AI's solution optimized specifically for LLM inference. Best raw performance for chat models.

~30

tok/sec

LLM

Focus

Transformers.js

Hugging Face's port of the Transformers library. Supports many model types beyond LLMs.

~15

tok/sec

All

Models

BrowserAI

Unified API wrapping WebLLM and Transformers.js. Simplest developer experience.

~25

tok/sec

Easy

API

WebLLM

WebLLM by MLC AI is purpose-built for LLM inference. It uses Apache TVM's ML compilation to optimize models specifically for WebGPU, delivering the best performance.

Installation

Bash

npm install @mlc-ai/web-llm

Basic Usage

JavaScript

import { CreateMLCEngine } from "@mlc-ai/web-llm";

// Initialize engine with progress callback
const engine = await CreateMLCEngine("Llama-3.2-1B-Instruct-q4f16_1-MLC", {
    initProgressCallback: (progress) => {
        console.log(`Loading: ${Math.round(progress.progress * 100)}%`);
    }
});

// Chat completion (OpenAI-compatible API)
const response = await engine.chat.completions.create({
    messages: [
        { role: "system", content: "You are a helpful assistant." },
        { role: "user", content: "What is WebGPU?" }
    ],
    temperature: 0.7,
    max_tokens: 256
});

console.log(response.choices[0].message.content);

Streaming Responses

JavaScript

// Stream tokens as they're generated
const stream = await engine.chat.completions.create({
    messages: [{ role: "user", content: "Write a haiku about coding" }],
    stream: true
});

let fullResponse = "";
for await (const chunk of stream) {
    const token = chunk.choices[0]?.delta?.content || "";
    fullResponse += token;
    process.stdout.write(token); // Print each token
}

console.log("\n\nFull response:", fullResponse);

Architecture

WebLLM uses a sophisticated pipeline to run models in the browser:

User Input

Text prompt

→

                                    
                                WebLLM Engine
MLC Runtime

→

WebGPU

GPU Acceleration

→

Response

Streamed tokens

Try the Live Demo

We've built a complete WebLLM chat interface that demonstrates these concepts with a professional dashboard, model selection, real-time streaming, and performance statistics.

WebLLM Chat Demo

Experience browser-based AI with our interactive demo. Select a model, wait for it to load, and start chatting - all running locally on your device with complete privacy.

Launch Demo

Transformers.js

Transformers.js is Hugging Face's JavaScript port of their popular Python library. It supports not just LLMs but also vision models, speech recognition, embeddings, and more.

Installation

Bash

npm install @huggingface/transformers

Text Generation

JavaScript

import { pipeline } from "@huggingface/transformers";

// Create text generation pipeline
const generator = await pipeline(
    "text-generation",
    "onnx-community/Qwen2.5-0.5B-Instruct",
    { device: "webgpu" }
);

// Generate text
const output = await generator("Write a poem about the sea:", {
    max_new_tokens: 100,
    temperature: 0.8,
    do_sample: true
});

console.log(output[0].generated_text);

Other Capabilities

JavaScript

// Embeddings
const embedder = await pipeline(
    "feature-extraction", 
    "Xenova/all-MiniLM-L6-v2"
);
const embeddings = await embedder("Hello world");

// Speech Recognition
const transcriber = await pipeline(
    "automatic-speech-recognition",
    "Xenova/whisper-tiny.en"
);
const result = await transcriber(audioBlob);

// Image Classification
const classifier = await pipeline(
    "image-classification",
    "Xenova/vit-base-patch16-224"
);
const labels = await classifier(imageUrl);

BrowserAI

BrowserAI provides a unified, simplified API that wraps both WebLLM and Transformers.js. It handles model loading, caching, and provides a consistent interface.

Installation

Bash

npm install @anthropic-ai/browser-ai

Simple Usage

JavaScript

import { BrowserAI } from "@anthropic-ai/browser-ai";

const ai = new BrowserAI();

// Load model (automatically picks best backend)
await ai.loadModel("llama-3.2-1b-instruct", {
    onProgress: (p) => console.log(`${p.percent}%`)
});

// Generate response
const response = await ai.generateText(
    "Explain quantum computing in simple terms",
    { maxTokens: 200, temperature: 0.7 }
);

console.log(response);

Framework Comparison

Feature	WebLLM	Transformers.js	BrowserAI
Performance	Best (~30 tok/s)	Good (~15 tok/s)	Good (~25 tok/s)
Model Types	LLMs only	LLM, Vision, Audio, Embedding	Depends on backend
API Style	OpenAI-compatible	Pipeline-based	Simplified unified
Backend	WebGPU (TVM)	WebGPU, WebGL, WASM	WebLLM or Transformers.js
Model Format	MLC compiled	ONNX	Either
Bundle Size	~2MB	~500KB	~2.5MB
Best For	Chat apps	Multi-modal apps	Quick prototypes

Supported Models

Browser LLMs are limited by VRAM. Models need to fit in the user's GPU memory (typically 4-8GB on consumer hardware). Here are commonly used models:

Recommended for Browser

Llama 3.2 1B/3B - Best quality for size, Q4 quantization fits easily
Qwen 2.5 0.5B/1.5B - Excellent for instruction following
Phi-3 Mini - Microsoft's efficient 3.8B model
Gemma 2B - Google's compact model
TinyLlama 1.1B - Fast and lightweight
SmolLM 135M/360M - Ultra-compact for simple tasks

JavaScript

// WebLLM model IDs (check webllm.ai for full list)
const models = [
    "Llama-3.2-1B-Instruct-q4f16_1-MLC",
    "Llama-3.2-3B-Instruct-q4f16_1-MLC",
    "Qwen2.5-1.5B-Instruct-q4f16_1-MLC",
    "Phi-3.5-mini-instruct-q4f16_1-MLC",
    "gemma-2-2b-it-q4f16_1-MLC",
    "TinyLlama-1.1B-Chat-v1.0-q4f16_1-MLC",
    "SmolLM2-360M-Instruct-q4f16_1-MLC"
];

// Transformers.js model IDs (ONNX format on HuggingFace)
const onnxModels = [
    "onnx-community/Qwen2.5-0.5B-Instruct",
    "onnx-community/Llama-3.2-1B-Instruct",
    "Xenova/phi-1_5_dev",
    "Xenova/TinyLlama-1.1B-Chat-v1.0"
];

Performance Tips

Getting the best performance from browser LLMs requires attention to several factors:

1. Use Quantized Models

Always use Q4 quantized models (4-bit). They're ~4x smaller with minimal quality loss. The q4f16_1 suffix indicates 4-bit weights with 16-bit activations.

2. Pre-cache Models

JavaScript

// Service Worker caching for instant reloads
// sw.js
self.addEventListener('fetch', (event) => {
    if (event.request.url.includes('.wasm') || 
        event.request.url.includes('.bin')) {
        event.respondWith(
            caches.match(event.request).then(response => {
                return response || fetch(event.request).then(r => {
                    const clone = r.clone();
                    caches.open('model-cache').then(cache => {
                        cache.put(event.request, clone);
                    });
                    return r;
                });
            })
        );
    }
});

3. Use Web Workers

Run inference in a Web Worker to keep the UI responsive:

JavaScript

// worker.js
import { CreateMLCEngine } from "@mlc-ai/web-llm";

let engine = null;

self.onmessage = async (e) => {
    if (e.data.type === "init") {
        engine = await CreateMLCEngine(e.data.model);
        self.postMessage({ type: "ready" });
    }
    
    if (e.data.type === "generate") {
        const response = await engine.chat.completions.create({
            messages: e.data.messages,
            stream: true
        });
        
        for await (const chunk of response) {
            self.postMessage({ 
                type: "token", 
                content: chunk.choices[0]?.delta?.content 
            });
        }
        self.postMessage({ type: "done" });
    }
};

4. Limit Context Length

Shorter prompts = faster inference. Keep context under 2K tokens when possible.

5. Monitor Memory

JavaScript

// Check available memory before loading
if (performance.memory) {
    const { usedJSHeapSize, jsHeapSizeLimit } = performance.memory;
    const available = jsHeapSizeLimit - usedJSHeapSize;
    console.log(`Available heap: ${(available / 1e9).toFixed(2)} GB`);
}

Mobile Limitations

Mobile browsers have less VRAM and may throttle WebGPU. Use smaller models (≤1B params) and test thoroughly on target devices.

Quick Start Example

Here's a complete, working HTML file you can save and open in Chrome to test browser LLMs:

HTML

<!DOCTYPE html>
<html>
<head>
    <title>Browser LLM Demo</title>
    <style>
        body { font-family: system-ui; max-width: 800px; margin: 2rem auto; padding: 1rem; }
        #output { white-space: pre-wrap; background: #1a1a2e; color: #eee; 
                  padding: 1rem; border-radius: 8px; min-height: 200px; }
        button { padding: 0.5rem 1rem; font-size: 1rem; cursor: pointer; }
        #status { color: #06b6d4; margin: 1rem 0; }
    </style>
</head>
<body>
    <h1>Browser LLM Demo</h1>
    <div id="status">Click "Load Model" to start...</div>
    <button onclick="loadModel()">Load Model</button>
    <button onclick="generate()" disabled id="genBtn">Generate</button>
    <div id="output"></div>

    <script type="module">
        import { CreateMLCEngine } from 
            "https://esm.run/@mlc-ai/web-llm";

        let engine = null;
        const status = document.getElementById("status");
        const output = document.getElementById("output");
        const genBtn = document.getElementById("genBtn");

        window.loadModel = async () => {
            status.textContent = "Loading model (this may take a few minutes)...";
            
            engine = await CreateMLCEngine(
                "SmolLM2-360M-Instruct-q4f16_1-MLC",
                { initProgressCallback: (p) => {
                    status.textContent = `Loading: ${(p.progress * 100).toFixed(1)}%`;
                }}
            );
            
            status.textContent = "Model ready!";
            genBtn.disabled = false;
        };

        window.generate = async () => {
            output.textContent = "";
            
            const stream = await engine.chat.completions.create({
                messages: [{ 
                    role: "user", 
                    content: "Write a short poem about technology." 
                }],
                stream: true,
                max_tokens: 150
            });

            for await (const chunk of stream) {
                output.textContent += chunk.choices[0]?.delta?.content || "";
            }
        };
    </script>
</body>
</html>

Try It Now

Save the HTML above as a file, open it in Chrome 113+, and experience browser-based LLMs firsthand. The SmolLM2-360M model downloads quickly (~250MB) and runs on most hardware.