Distilling glm-5.2 into a local Qwen3-8B
A ~5 GB local model that reasons and answers like a frontier teacher — with more headroom than the 4B. Built by distilling glm-5.2 general-knowledge responses into a Qwen3-8B student with a 4-bit QLoRA fine-tune, then merged, converted, and quantized to GGUF for Ollama. It keeps the base's 128K context and native tool calling, and still runs on a single 16 GB GPU.
This page walks through how a small, fully local model was built to reason and answer like a much larger one — using knowledge distillation. A strong teacher (glm-5.2) generates high-quality answers to a few thousand prompts, and a Qwen3-8B student is fine-tuned to imitate them. It's the same recipe as the 4B build, scaled to an 8B base for more capacity. Everything GPU-heavy runs on thelab-genesis, a dedicated training box fitted with an RTX 5060 Ti (16 GB).
The goal was a small, fully local model that reasons and answers like a much larger one on general-knowledge questions — and that still supports native tool calling. The teacher is called through Ollama's cloud routing; every GPU-heavy step — training, merging, conversion — is local.
Everything runs on thelab-genesis, fitted with an RTX 5060 Ti (16 GB). The teacher is called through Ollama's cloud routing; every GPU-heavy step — training, merging, conversion — is local.
Teacher glm-5.2 (via Ollama, thinking disabled)
Student Qwen3-8B (unsloth/Qwen3-8B, 4-bit QLoRA)
Trainer Unsloth + TRL (SFT, rank-16 LoRA)
Convert llama.cpp (convert_hf_to_gguf.py → f16 → q4_k_m)
Serve Ollama (adi-qwen3-8b-glm5.2-general:latest)
Hardware RTX 5060 Ti 16GB (single GPU)
Distillation needs a diverse set of questions to ask the teacher. Rather than hand-writing them, pull human-written instructions from the Dolly-15k dataset, filter out anything needing an attached context passage, dedupe, and keep 2,000.
from datasets import load_dataset
ds = load_dataset("databricks/databricks-dolly-15k", split="train")
skip = {"closed_qa", "information_extraction", "summarization"}
seen, prompts = set(), []
for row in ds:
if row.get("category") in skip:
continue
q = row["instruction"].strip()
if not q or len(q) < 15 or len(q) > 400 or q.lower() in seen:
continue
seen.add(q.lower()); prompts.append(q)
if len(prompts) >= 2000:
break
Each prompt is sent to glm-5.2 through Ollama's native /api/chat endpoint with thinking disabled — you want the teacher's clean final answers for the student to imitate, not chain-of-thought. Output is written as Qwen chat-format JSONL. The same distilled set feeds both the 4B and 8B students.
payload = {
"model": "glm-5.2",
"messages": [
{"role": "system", "content": SYSTEM},
{"role": "user", "content": prompt},
],
"think": False, # native API reliably disables thinking
"stream": False,
"options": {"num_predict": 2048, "temperature": 0.7},
}
r = requests.post("http://localhost:11434/api/chat", json=payload, timeout=300)
answer = r.json()["message"]["content"].strip()
think: false flag is honored by Ollama's native /api/chat endpoint but not reliably by the OpenAI-compatible /v1 path — where a thinking model can burn its whole token budget on hidden reasoning. Result here: 2,068 answers for ~1.36M tokens, far cheaper than a thinking run.
A rank-16 LoRA adapter is trained on the distilled pairs over a 4-bit-quantized base. Only the small adapter moves — the base stays frozen — so an 8B fits comfortably on the 16 GB card.
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/Qwen3-8B",
max_seq_length = 4096,
load_in_4bit = True, # 4-bit QLoRA — fine for dense Qwen3
)
model = FastLanguageModel.get_peft_model(
model, r=16, lora_alpha=16, lora_dropout=0,
use_gradient_checkpointing="unsloth",
target_modules=["q_proj","k_proj","v_proj","o_proj",
"gate_proj","up_proj","down_proj"],
)
# 3 epochs · 777 steps · rank 16
The adapter is merged into the base as 16-bit safetensors, then converted to GGUF with llama.cpp's own converter and quantized to q4_k_m. Qwen3-8B uses the well-supported Qwen3 architecture, so the conversion is uneventful.
# merge adapter → 16-bit safetensors (Unsloth)
# then convert with llama.cpp directly:
python3 llama.cpp/convert_hf_to_gguf.py ./merged \
--outfile adi-qwen3-8b-glm5.2-general-f16.gguf \
--outtype f16
# quantize f16 → q4_k_m (~16 GB → ~5 GB)
./llama.cpp/build/bin/llama-quantize \
adi-qwen3-8b-glm5.2-general-f16.gguf \
adi-qwen3-8b-glm5.2-general-q4_k_m.gguf q4_k_m
A Modelfile points Ollama at the quantized GGUF. After ollama create, the model is live in the fleet and reachable through the standard OpenAI-compatible endpoint alongside every other ADI model.
# Modelfile
FROM ./adi-qwen3-8b-glm5.2-general-q4_k_m.gguf
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
SYSTEM """You are ADI, a precise and knowledgeable assistant.
Answer clearly and completely, reasoning step by step where it helps."""
# build it locally from the Modelfile…
ollama create adi-qwen3-8b-glm5.2-general -f Modelfile
# …or pull the published GGUF straight from Hugging Face
ollama run hf.co/AdvancedDataIntelligence/adi-qwen3-8b-glm5.2-general-GGUF:Q4_K_M \
"Explain the CAP theorem in two sentences."
ollama list as adi-qwen3-8b-glm5.2-general:latest (~5 GB).