Distilling glm-5.2 into a local Qwen3.5-4B
A 2.7 GB local model that answers like a frontier teacher. Built by distilling glm-5.2 general-knowledge responses into a Qwen3.5-4B student with a bf16 LoRA fine-tune — then merged, converted, quantized to GGUF, and served from Ollama. The whole pipeline runs on a single 16 GB GPU.
This page walks through how a small, fully local model was built to reason and answer like a much larger one — using knowledge distillation. A strong teacher (glm-5.2) generates high-quality answers to a few thousand prompts, and a small student (Qwen3.5-4B) is fine-tuned to imitate them. Everything GPU-heavy runs on thelab-genesis, a dedicated training box fitted with an RTX 5060 Ti (16 GB).
The goal was a small, fully local model that reasons and answers like a much larger one on general-knowledge questions — and that still supports native tool calling. The teacher is called through Ollama's cloud routing; every GPU-heavy step — training, merging, conversion — is local.
Everything runs on thelab-genesis, fitted with an RTX 5060 Ti (16 GB). The teacher is called through Ollama's cloud routing; every GPU-heavy step — training, merging, conversion — is local.
Teacher glm-5.2 (via Ollama, thinking disabled)
Student Qwen3.5-4B (unsloth/Qwen3.5-4B, bf16 LoRA)
Trainer Unsloth + TRL (transformers 5.5.0, torch 2.10)
Convert llama.cpp (convert_hf_to_gguf.py → bf16 → q4_k_m)
Serve Ollama (adi-qwen3.5-4b-glm5.2-general:latest)
Hardware RTX 5060 Ti 16GB (single GPU, ~10GB used during training)
Distillation needs a diverse set of questions to ask the teacher. Rather than hand-writing them, pull human-written instructions from the Dolly-15k dataset, filter out anything needing an attached context passage, dedupe, and keep 2,000.
from datasets import load_dataset
ds = load_dataset("databricks/databricks-dolly-15k", split="train")
skip = {"closed_qa", "information_extraction", "summarization"}
seen, prompts = set(), []
for row in ds:
if row.get("category") in skip:
continue
q = row["instruction"].strip()
if not q or len(q) < 15 or len(q) > 400 or q.lower() in seen:
continue
seen.add(q.lower()); prompts.append(q)
if len(prompts) >= 2000:
break
Each prompt is sent to glm-5.2 through Ollama's native /api/chat endpoint with thinking disabled — for a 4B student you want the teacher's clean final answers, not chain-of-thought it's too small to imitate well. Output is written as Qwen chat-format JSONL.
payload = {
"model": "glm-5.2",
"messages": [
{"role": "system", "content": SYSTEM},
{"role": "user", "content": prompt},
],
"think": False, # native API reliably disables thinking
"stream": False,
"options": {"num_predict": 2048, "temperature": 0.7},
}
r = requests.post("http://localhost:11434/api/chat", json=payload, timeout=300)
answer = r.json()["message"]["content"].strip()
think: false flag is honored by Ollama's native /api/chat endpoint but not reliably by the OpenAI-compatible /v1 path — where a thinking model can burn its whole token budget on hidden reasoning. Result here: 2,068 answers for ~1.36M tokens, far cheaper than a thinking run.
A rank-16 LoRA adapter is trained on the distilled pairs. Only 0.47% of parameters (~21M of 4.56B) are trained — the base stays frozen — which is what keeps it inside 16 GB.
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
model_name = "unsloth/Qwen3.5-4B",
max_seq_length = 4096,
load_in_4bit = False, # bf16 — NOT 4-bit for Qwen3.5
load_in_16bit = True,
)
model = FastModel.get_peft_model(
model, r=16, lora_alpha=16, lora_dropout=0,
use_gradient_checkpointing="unsloth",
target_modules=["q_proj","k_proj","v_proj","o_proj",
"gate_proj","up_proj","down_proj"],
)
# 3 epochs · lr 2e-4 · 777 steps · final loss 0.9346
transformers >= 5.2.0 is required to recognize the architecture, but Unsloth caps at <= 5.5.0. The only version satisfying both is transformers 5.5.0. Keep numpy < 2.3 too, or numba breaks.
The adapter is merged into the base as 16-bit safetensors, then converted to GGUF with llama.cpp's own converter and quantized to q4_k_m. llama.cpp already understands Qwen3.5's SSM/Mamba-hybrid layers and the MTP head — it converts cleanly.
# merge adapter → 16-bit safetensors (Unsloth)
# then convert with llama.cpp directly:
python3 llama.cpp/convert_hf_to_gguf.py ./gguf_out \
--outfile adi-qwen3.5-4b-glm5.2-general-bf16.gguf \
--outtype bf16
# quantize bf16 → q4_k_m (9 GB → 2.7 GB)
./llama.cpp/build/bin/llama-quantize \
adi-qwen3.5-4b-glm5.2-general-bf16.gguf \
adi-qwen3.5-4b-glm5.2-general-q4_k_m.gguf q4_k_m
requirements.txt downgrades transformers and breaks the converter with TokenizersBackend does not exist. Fix: reinstall transformers 5.5.0 after, since that version reads the merged tokenizer correctly. Never go above 5.5.0 — it breaks Unsloth.
A Modelfile points Ollama at the quantized GGUF. After ollama create, the model is live in the fleet and reachable through the standard OpenAI-compatible endpoint alongside every other ADI model.
# Modelfile
FROM ./adi-qwen3.5-4b-glm5.2-general-q4_k_m.gguf
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 4096
SYSTEM """You are ADI, a precise and knowledgeable assistant.
Answer clearly and completely, reasoning step by step where it helps."""
ollama create adi-qwen3.5-4b-glm5.2-general -f Modelfile
ollama run adi-qwen3.5-4b-glm5.2-general "Explain the CAP theorem in two sentences."
ollama list as adi-qwen3.5-4b-glm5.2-general:latest (2.7 GB).<think></think> block confirms thinking-off carried through from the dataset.