Distilling glm-5.2 into a local Qwen2.5-14B
A mid-size local model that reasons and answers like a frontier teacher — built by distilling glm-5.2 general-knowledge responses into a Qwen2.5-14B student with a bf16 LoRA fine-tune, then merged, converted, and quantized to GGUF for Ollama. More capacity than the 4B/8B Qwen3 models while still fitting on a single 16 GB GPU.
This page walks through how a mid-size, fully local model was built to reason and answer like a much larger one — using knowledge distillation. A strong teacher (glm-5.2) generates high-quality answers to a few thousand prompts, and a Qwen2.5-14B student is fine-tuned to imitate them. Everything GPU-heavy runs on thelab-genesis, a dedicated training box fitted with an RTX 5060 Ti (16 GB).
The goal was a mid-size model with enough parametric capacity to absorb the teacher's reasoning style deeply, while still fitting a single 16 GB GPU. Qwen2.5-14B sits in a sweet spot: nearly double the parameters of an 8B, with more room to internalize complex answer patterns from the teacher.
Everything runs on thelab-genesis, fitted with an RTX 5060 Ti (16 GB). The teacher is called through Ollama's cloud routing; every GPU-heavy step — training, merging, conversion — is local.
Teacher glm-5.2 (via Ollama, thinking disabled)
Student Qwen2.5-14B (unsloth/Qwen2.5-14B, bf16 LoRA)
Trainer Unsloth + TRL (SFT, rank-16 LoRA)
Convert llama.cpp (convert_hf_to_gguf.py → f16 → q4_k_m)
Serve Ollama (adi-qwen2.5-14b-glm5.2-general:latest)
Hardware RTX 5060 Ti 16GB (single GPU)
ollama list as adi-qwen2.5-14b-glm5.2-general:latest.