Overview
Browser LLMs enable you to run large language models directly in the user's browser using WebGPU acceleration. This means zero server costs, complete data privacy, and surprisingly good performance - typically around 10-30 tokens per second on modern hardware.
This guide covers three major frameworks for running LLMs in the browser: WebLLM (MLC AI), Transformers.js (Hugging Face), and BrowserAI. Each has different strengths depending on your use case.
Why Browser LLMs?
Running LLMs in the browser offers several compelling advantages over server-side deployment:
- Zero Server Costs - No GPUs to rent, no API fees. The user's hardware does all the work.
- Complete Privacy - User data never leaves their device. Perfect for sensitive applications.
- Offline Capable - Once the model is cached, it works without internet.
- Instant Scaling - Each user brings their own compute. Handle millions of users without infrastructure changes.
- Low Latency - No network round trips. Responses start immediately.
The tradeoffs? Larger models won't fit, initial load times can be slow (downloading ~1-4GB), and performance varies by user hardware. But for many applications, these are acceptable.
WebGPU Acceleration
WebGPU is the key technology that makes browser LLMs practical. It's the successor to WebGL, providing low-level GPU access similar to Vulkan, Metal, and DirectX 12.
// Check WebGPU support
async function checkWebGPU() {
if (!navigator.gpu) {
console.log('WebGPU not supported');
return false;
}
const adapter = await navigator.gpu.requestAdapter();
if (!adapter) {
console.log('No GPU adapter found');
return false;
}
const device = await adapter.requestDevice();
console.log('WebGPU ready!', device);
return true;
}
WebGPU is supported in Chrome 113+, Edge 113+, and Firefox (behind a flag). Safari support is coming. For browsers without WebGPU, Transformers.js can fall back to WebGL or WASM, but performance drops significantly.
Frameworks Overview
Three frameworks dominate the browser LLM space:
MLC AI's solution optimized specifically for LLM inference. Best raw performance for chat models.
Hugging Face's port of the Transformers library. Supports many model types beyond LLMs.
Unified API wrapping WebLLM and Transformers.js. Simplest developer experience.
WebLLM
WebLLM by MLC AI is purpose-built for LLM inference. It uses Apache TVM's ML compilation to optimize models specifically for WebGPU, delivering the best performance.
Installation
npm install @mlc-ai/web-llm
Basic Usage
import { CreateMLCEngine } from "@mlc-ai/web-llm";
// Initialize engine with progress callback
const engine = await CreateMLCEngine("Llama-3.2-1B-Instruct-q4f16_1-MLC", {
initProgressCallback: (progress) => {
console.log(`Loading: ${Math.round(progress.progress * 100)}%`);
}
});
// Chat completion (OpenAI-compatible API)
const response = await engine.chat.completions.create({
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "What is WebGPU?" }
],
temperature: 0.7,
max_tokens: 256
});
console.log(response.choices[0].message.content);
Streaming Responses
// Stream tokens as they're generated
const stream = await engine.chat.completions.create({
messages: [{ role: "user", content: "Write a haiku about coding" }],
stream: true
});
let fullResponse = "";
for await (const chunk of stream) {
const token = chunk.choices[0]?.delta?.content || "";
fullResponse += token;
process.stdout.write(token); // Print each token
}
console.log("\n\nFull response:", fullResponse);
Architecture
WebLLM uses a sophisticated pipeline to run models in the browser:
Try the Live Demo
We've built a complete WebLLM chat interface that demonstrates these concepts with a professional dashboard, model selection, real-time streaming, and performance statistics.
WebLLM Chat Demo
Experience browser-based AI with our interactive demo. Select a model, wait for it to load, and start chatting - all running locally on your device with complete privacy.
Launch DemoTransformers.js
Transformers.js is Hugging Face's JavaScript port of their popular Python library. It supports not just LLMs but also vision models, speech recognition, embeddings, and more.
Installation
npm install @huggingface/transformers
Text Generation
import { pipeline } from "@huggingface/transformers";
// Create text generation pipeline
const generator = await pipeline(
"text-generation",
"onnx-community/Qwen2.5-0.5B-Instruct",
{ device: "webgpu" }
);
// Generate text
const output = await generator("Write a poem about the sea:", {
max_new_tokens: 100,
temperature: 0.8,
do_sample: true
});
console.log(output[0].generated_text);
Other Capabilities
// Embeddings
const embedder = await pipeline(
"feature-extraction",
"Xenova/all-MiniLM-L6-v2"
);
const embeddings = await embedder("Hello world");
// Speech Recognition
const transcriber = await pipeline(
"automatic-speech-recognition",
"Xenova/whisper-tiny.en"
);
const result = await transcriber(audioBlob);
// Image Classification
const classifier = await pipeline(
"image-classification",
"Xenova/vit-base-patch16-224"
);
const labels = await classifier(imageUrl);
BrowserAI
BrowserAI provides a unified, simplified API that wraps both WebLLM and Transformers.js. It handles model loading, caching, and provides a consistent interface.
Installation
npm install @anthropic-ai/browser-ai
Simple Usage
import { BrowserAI } from "@anthropic-ai/browser-ai";
const ai = new BrowserAI();
// Load model (automatically picks best backend)
await ai.loadModel("llama-3.2-1b-instruct", {
onProgress: (p) => console.log(`${p.percent}%`)
});
// Generate response
const response = await ai.generateText(
"Explain quantum computing in simple terms",
{ maxTokens: 200, temperature: 0.7 }
);
console.log(response);
Framework Comparison
| Feature | WebLLM | Transformers.js | BrowserAI |
|---|---|---|---|
| Performance | Best (~30 tok/s) | Good (~15 tok/s) | Good (~25 tok/s) |
| Model Types | LLMs only | LLM, Vision, Audio, Embedding | Depends on backend |
| API Style | OpenAI-compatible | Pipeline-based | Simplified unified |
| Backend | WebGPU (TVM) | WebGPU, WebGL, WASM | WebLLM or Transformers.js |
| Model Format | MLC compiled | ONNX | Either |
| Bundle Size | ~2MB | ~500KB | ~2.5MB |
| Best For | Chat apps | Multi-modal apps | Quick prototypes |
Supported Models
Browser LLMs are limited by VRAM. Models need to fit in the user's GPU memory (typically 4-8GB on consumer hardware). Here are commonly used models:
Recommended for Browser
- Llama 3.2 1B/3B - Best quality for size, Q4 quantization fits easily
- Qwen 2.5 0.5B/1.5B - Excellent for instruction following
- Phi-3 Mini - Microsoft's efficient 3.8B model
- Gemma 2B - Google's compact model
- TinyLlama 1.1B - Fast and lightweight
- SmolLM 135M/360M - Ultra-compact for simple tasks
// WebLLM model IDs (check webllm.ai for full list)
const models = [
"Llama-3.2-1B-Instruct-q4f16_1-MLC",
"Llama-3.2-3B-Instruct-q4f16_1-MLC",
"Qwen2.5-1.5B-Instruct-q4f16_1-MLC",
"Phi-3.5-mini-instruct-q4f16_1-MLC",
"gemma-2-2b-it-q4f16_1-MLC",
"TinyLlama-1.1B-Chat-v1.0-q4f16_1-MLC",
"SmolLM2-360M-Instruct-q4f16_1-MLC"
];
// Transformers.js model IDs (ONNX format on HuggingFace)
const onnxModels = [
"onnx-community/Qwen2.5-0.5B-Instruct",
"onnx-community/Llama-3.2-1B-Instruct",
"Xenova/phi-1_5_dev",
"Xenova/TinyLlama-1.1B-Chat-v1.0"
];
Performance Tips
Getting the best performance from browser LLMs requires attention to several factors:
1. Use Quantized Models
Always use Q4 quantized models (4-bit). They're ~4x smaller with minimal quality loss. The q4f16_1 suffix indicates 4-bit weights with 16-bit activations.
2. Pre-cache Models
// Service Worker caching for instant reloads
// sw.js
self.addEventListener('fetch', (event) => {
if (event.request.url.includes('.wasm') ||
event.request.url.includes('.bin')) {
event.respondWith(
caches.match(event.request).then(response => {
return response || fetch(event.request).then(r => {
const clone = r.clone();
caches.open('model-cache').then(cache => {
cache.put(event.request, clone);
});
return r;
});
})
);
}
});
3. Use Web Workers
Run inference in a Web Worker to keep the UI responsive:
// worker.js
import { CreateMLCEngine } from "@mlc-ai/web-llm";
let engine = null;
self.onmessage = async (e) => {
if (e.data.type === "init") {
engine = await CreateMLCEngine(e.data.model);
self.postMessage({ type: "ready" });
}
if (e.data.type === "generate") {
const response = await engine.chat.completions.create({
messages: e.data.messages,
stream: true
});
for await (const chunk of response) {
self.postMessage({
type: "token",
content: chunk.choices[0]?.delta?.content
});
}
self.postMessage({ type: "done" });
}
};
4. Limit Context Length
Shorter prompts = faster inference. Keep context under 2K tokens when possible.
5. Monitor Memory
// Check available memory before loading
if (performance.memory) {
const { usedJSHeapSize, jsHeapSizeLimit } = performance.memory;
const available = jsHeapSizeLimit - usedJSHeapSize;
console.log(`Available heap: ${(available / 1e9).toFixed(2)} GB`);
}
Quick Start Example
Here's a complete, working HTML file you can save and open in Chrome to test browser LLMs:
<!DOCTYPE html>
<html>
<head>
<title>Browser LLM Demo</title>
<style>
body { font-family: system-ui; max-width: 800px; margin: 2rem auto; padding: 1rem; }
#output { white-space: pre-wrap; background: #1a1a2e; color: #eee;
padding: 1rem; border-radius: 8px; min-height: 200px; }
button { padding: 0.5rem 1rem; font-size: 1rem; cursor: pointer; }
#status { color: #06b6d4; margin: 1rem 0; }
</style>
</head>
<body>
<h1>Browser LLM Demo</h1>
<div id="status">Click "Load Model" to start...</div>
<button onclick="loadModel()">Load Model</button>
<button onclick="generate()" disabled id="genBtn">Generate</button>
<div id="output"></div>
<script type="module">
import { CreateMLCEngine } from
"https://esm.run/@mlc-ai/web-llm";
let engine = null;
const status = document.getElementById("status");
const output = document.getElementById("output");
const genBtn = document.getElementById("genBtn");
window.loadModel = async () => {
status.textContent = "Loading model (this may take a few minutes)...";
engine = await CreateMLCEngine(
"SmolLM2-360M-Instruct-q4f16_1-MLC",
{ initProgressCallback: (p) => {
status.textContent = `Loading: ${(p.progress * 100).toFixed(1)}%`;
}}
);
status.textContent = "Model ready!";
genBtn.disabled = false;
};
window.generate = async () => {
output.textContent = "";
const stream = await engine.chat.completions.create({
messages: [{
role: "user",
content: "Write a short poem about technology."
}],
stream: true,
max_tokens: 150
});
for await (const chunk of stream) {
output.textContent += chunk.choices[0]?.delta?.content || "";
}
};
</script>
</body>
</html>