Knowledge Base / Browser AI

Browser LLMs

Overview

Browser LLMs enable you to run large language models directly in the user's browser using WebGPU acceleration. This means zero server costs, complete data privacy, and surprisingly good performance - typically around 10-30 tokens per second on modern hardware.

This guide covers three major frameworks for running LLMs in the browser: WebLLM (MLC AI), Transformers.js (Hugging Face), and BrowserAI. Each has different strengths depending on your use case.

Key Benefits
No API costs, complete privacy (data never leaves the browser), offline capable, and no server infrastructure to maintain.

Why Browser LLMs?

Running LLMs in the browser offers several compelling advantages over server-side deployment:

  • Zero Server Costs - No GPUs to rent, no API fees. The user's hardware does all the work.
  • Complete Privacy - User data never leaves their device. Perfect for sensitive applications.
  • Offline Capable - Once the model is cached, it works without internet.
  • Instant Scaling - Each user brings their own compute. Handle millions of users without infrastructure changes.
  • Low Latency - No network round trips. Responses start immediately.

The tradeoffs? Larger models won't fit, initial load times can be slow (downloading ~1-4GB), and performance varies by user hardware. But for many applications, these are acceptable.

WebGPU Acceleration

WebGPU is the key technology that makes browser LLMs practical. It's the successor to WebGL, providing low-level GPU access similar to Vulkan, Metal, and DirectX 12.

JavaScript
// Check WebGPU support
async function checkWebGPU() {
    if (!navigator.gpu) {
        console.log('WebGPU not supported');
        return false;
    }
    
    const adapter = await navigator.gpu.requestAdapter();
    if (!adapter) {
        console.log('No GPU adapter found');
        return false;
    }
    
    const device = await adapter.requestDevice();
    console.log('WebGPU ready!', device);
    return true;
}

WebGPU is supported in Chrome 113+, Edge 113+, and Firefox (behind a flag). Safari support is coming. For browsers without WebGPU, Transformers.js can fall back to WebGL or WASM, but performance drops significantly.

Browser Support
WebGPU requires Chrome/Edge 113+ or Firefox Nightly. Always include a fallback message for unsupported browsers.

Frameworks Overview

Three frameworks dominate the browser LLM space:

WebLLM

MLC AI's solution optimized specifically for LLM inference. Best raw performance for chat models.

~30
tok/sec
LLM
Focus
Transformers.js

Hugging Face's port of the Transformers library. Supports many model types beyond LLMs.

~15
tok/sec
All
Models
BrowserAI

Unified API wrapping WebLLM and Transformers.js. Simplest developer experience.

~25
tok/sec
Easy
API

WebLLM

WebLLM by MLC AI is purpose-built for LLM inference. It uses Apache TVM's ML compilation to optimize models specifically for WebGPU, delivering the best performance.

Installation

Bash
npm install @mlc-ai/web-llm

Basic Usage

JavaScript
import { CreateMLCEngine } from "@mlc-ai/web-llm";

// Initialize engine with progress callback
const engine = await CreateMLCEngine("Llama-3.2-1B-Instruct-q4f16_1-MLC", {
    initProgressCallback: (progress) => {
        console.log(`Loading: ${Math.round(progress.progress * 100)}%`);
    }
});

// Chat completion (OpenAI-compatible API)
const response = await engine.chat.completions.create({
    messages: [
        { role: "system", content: "You are a helpful assistant." },
        { role: "user", content: "What is WebGPU?" }
    ],
    temperature: 0.7,
    max_tokens: 256
});

console.log(response.choices[0].message.content);

Streaming Responses

JavaScript
// Stream tokens as they're generated
const stream = await engine.chat.completions.create({
    messages: [{ role: "user", content: "Write a haiku about coding" }],
    stream: true
});

let fullResponse = "";
for await (const chunk of stream) {
    const token = chunk.choices[0]?.delta?.content || "";
    fullResponse += token;
    process.stdout.write(token); // Print each token
}

console.log("\n\nFull response:", fullResponse);

Architecture

WebLLM uses a sophisticated pipeline to run models in the browser:

User Input
Text prompt
WebLLM Engine
MLC Runtime
WebGPU
GPU Acceleration
Response
Streamed tokens

Try the Live Demo

We've built a complete WebLLM chat interface that demonstrates these concepts with a professional dashboard, model selection, real-time streaming, and performance statistics.

WebLLM Chat Demo

Experience browser-based AI with our interactive demo. Select a model, wait for it to load, and start chatting - all running locally on your device with complete privacy.

Launch Demo

Transformers.js

Transformers.js is Hugging Face's JavaScript port of their popular Python library. It supports not just LLMs but also vision models, speech recognition, embeddings, and more.

Installation

Bash
npm install @huggingface/transformers

Text Generation

JavaScript
import { pipeline } from "@huggingface/transformers";

// Create text generation pipeline
const generator = await pipeline(
    "text-generation",
    "onnx-community/Qwen2.5-0.5B-Instruct",
    { device: "webgpu" }
);

// Generate text
const output = await generator("Write a poem about the sea:", {
    max_new_tokens: 100,
    temperature: 0.8,
    do_sample: true
});

console.log(output[0].generated_text);

Other Capabilities

JavaScript
// Embeddings
const embedder = await pipeline(
    "feature-extraction", 
    "Xenova/all-MiniLM-L6-v2"
);
const embeddings = await embedder("Hello world");

// Speech Recognition
const transcriber = await pipeline(
    "automatic-speech-recognition",
    "Xenova/whisper-tiny.en"
);
const result = await transcriber(audioBlob);

// Image Classification
const classifier = await pipeline(
    "image-classification",
    "Xenova/vit-base-patch16-224"
);
const labels = await classifier(imageUrl);

BrowserAI

BrowserAI provides a unified, simplified API that wraps both WebLLM and Transformers.js. It handles model loading, caching, and provides a consistent interface.

Installation

Bash
npm install @anthropic-ai/browser-ai

Simple Usage

JavaScript
import { BrowserAI } from "@anthropic-ai/browser-ai";

const ai = new BrowserAI();

// Load model (automatically picks best backend)
await ai.loadModel("llama-3.2-1b-instruct", {
    onProgress: (p) => console.log(`${p.percent}%`)
});

// Generate response
const response = await ai.generateText(
    "Explain quantum computing in simple terms",
    { maxTokens: 200, temperature: 0.7 }
);

console.log(response);

Framework Comparison

Feature WebLLM Transformers.js BrowserAI
Performance Best (~30 tok/s) Good (~15 tok/s) Good (~25 tok/s)
Model Types LLMs only LLM, Vision, Audio, Embedding Depends on backend
API Style OpenAI-compatible Pipeline-based Simplified unified
Backend WebGPU (TVM) WebGPU, WebGL, WASM WebLLM or Transformers.js
Model Format MLC compiled ONNX Either
Bundle Size ~2MB ~500KB ~2.5MB
Best For Chat apps Multi-modal apps Quick prototypes

Supported Models

Browser LLMs are limited by VRAM. Models need to fit in the user's GPU memory (typically 4-8GB on consumer hardware). Here are commonly used models:

Recommended for Browser

  • Llama 3.2 1B/3B - Best quality for size, Q4 quantization fits easily
  • Qwen 2.5 0.5B/1.5B - Excellent for instruction following
  • Phi-3 Mini - Microsoft's efficient 3.8B model
  • Gemma 2B - Google's compact model
  • TinyLlama 1.1B - Fast and lightweight
  • SmolLM 135M/360M - Ultra-compact for simple tasks
JavaScript
// WebLLM model IDs (check webllm.ai for full list)
const models = [
    "Llama-3.2-1B-Instruct-q4f16_1-MLC",
    "Llama-3.2-3B-Instruct-q4f16_1-MLC",
    "Qwen2.5-1.5B-Instruct-q4f16_1-MLC",
    "Phi-3.5-mini-instruct-q4f16_1-MLC",
    "gemma-2-2b-it-q4f16_1-MLC",
    "TinyLlama-1.1B-Chat-v1.0-q4f16_1-MLC",
    "SmolLM2-360M-Instruct-q4f16_1-MLC"
];

// Transformers.js model IDs (ONNX format on HuggingFace)
const onnxModels = [
    "onnx-community/Qwen2.5-0.5B-Instruct",
    "onnx-community/Llama-3.2-1B-Instruct",
    "Xenova/phi-1_5_dev",
    "Xenova/TinyLlama-1.1B-Chat-v1.0"
];

Performance Tips

Getting the best performance from browser LLMs requires attention to several factors:

1. Use Quantized Models

Always use Q4 quantized models (4-bit). They're ~4x smaller with minimal quality loss. The q4f16_1 suffix indicates 4-bit weights with 16-bit activations.

2. Pre-cache Models

JavaScript
// Service Worker caching for instant reloads
// sw.js
self.addEventListener('fetch', (event) => {
    if (event.request.url.includes('.wasm') || 
        event.request.url.includes('.bin')) {
        event.respondWith(
            caches.match(event.request).then(response => {
                return response || fetch(event.request).then(r => {
                    const clone = r.clone();
                    caches.open('model-cache').then(cache => {
                        cache.put(event.request, clone);
                    });
                    return r;
                });
            })
        );
    }
});

3. Use Web Workers

Run inference in a Web Worker to keep the UI responsive:

JavaScript
// worker.js
import { CreateMLCEngine } from "@mlc-ai/web-llm";

let engine = null;

self.onmessage = async (e) => {
    if (e.data.type === "init") {
        engine = await CreateMLCEngine(e.data.model);
        self.postMessage({ type: "ready" });
    }
    
    if (e.data.type === "generate") {
        const response = await engine.chat.completions.create({
            messages: e.data.messages,
            stream: true
        });
        
        for await (const chunk of response) {
            self.postMessage({ 
                type: "token", 
                content: chunk.choices[0]?.delta?.content 
            });
        }
        self.postMessage({ type: "done" });
    }
};

4. Limit Context Length

Shorter prompts = faster inference. Keep context under 2K tokens when possible.

5. Monitor Memory

JavaScript
// Check available memory before loading
if (performance.memory) {
    const { usedJSHeapSize, jsHeapSizeLimit } = performance.memory;
    const available = jsHeapSizeLimit - usedJSHeapSize;
    console.log(`Available heap: ${(available / 1e9).toFixed(2)} GB`);
}
Mobile Limitations
Mobile browsers have less VRAM and may throttle WebGPU. Use smaller models (≤1B params) and test thoroughly on target devices.

Quick Start Example

Here's a complete, working HTML file you can save and open in Chrome to test browser LLMs:

HTML
<!DOCTYPE html>
<html>
<head>
    <title>Browser LLM Demo</title>
    <style>
        body { font-family: system-ui; max-width: 800px; margin: 2rem auto; padding: 1rem; }
        #output { white-space: pre-wrap; background: #1a1a2e; color: #eee; 
                  padding: 1rem; border-radius: 8px; min-height: 200px; }
        button { padding: 0.5rem 1rem; font-size: 1rem; cursor: pointer; }
        #status { color: #06b6d4; margin: 1rem 0; }
    </style>
</head>
<body>
    <h1>Browser LLM Demo</h1>
    <div id="status">Click "Load Model" to start...</div>
    <button onclick="loadModel()">Load Model</button>
    <button onclick="generate()" disabled id="genBtn">Generate</button>
    <div id="output"></div>

    <script type="module">
        import { CreateMLCEngine } from 
            "https://esm.run/@mlc-ai/web-llm";

        let engine = null;
        const status = document.getElementById("status");
        const output = document.getElementById("output");
        const genBtn = document.getElementById("genBtn");

        window.loadModel = async () => {
            status.textContent = "Loading model (this may take a few minutes)...";
            
            engine = await CreateMLCEngine(
                "SmolLM2-360M-Instruct-q4f16_1-MLC",
                { initProgressCallback: (p) => {
                    status.textContent = `Loading: ${(p.progress * 100).toFixed(1)}%`;
                }}
            );
            
            status.textContent = "Model ready!";
            genBtn.disabled = false;
        };

        window.generate = async () => {
            output.textContent = "";
            
            const stream = await engine.chat.completions.create({
                messages: [{ 
                    role: "user", 
                    content: "Write a short poem about technology." 
                }],
                stream: true,
                max_tokens: 150
            });

            for await (const chunk of stream) {
                output.textContent += chunk.choices[0]?.delta?.content || "";
            }
        };
    </script>
</body>
</html>
Try It Now
Save the HTML above as a file, open it in Chrome 113+, and experience browser-based LLMs firsthand. The SmolLM2-360M model downloads quickly (~250MB) and runs on most hardware.