ADI Qwen2.5 Line / Model Build

adi-qwen2.5-14b-glm5.2-general

Distilling glm-5.2 into a local Qwen2.5-14B

A mid-size local model that reasons and answers like a frontier teacher — built by distilling glm-5.2 general-knowledge responses into a Qwen2.5-14B student with a bf16 LoRA fine-tune, then merged, converted, and quantized to GGUF for Ollama. More capacity than the 4B/8B Qwen3 models while still fitting on a single 16 GB GPU.

Download GGUF View on Hugging Face

This page walks through how a mid-size, fully local model was built to reason and answer like a much larger one — using knowledge distillation. A strong teacher (glm-5.2) generates high-quality answers to a few thousand prompts, and a Qwen2.5-14B student is fine-tuned to imitate them. Everything GPU-heavy runs on thelab-genesis, a dedicated training box fitted with an RTX 5060 Ti (16 GB).

Qwen2.5-14B

glm-5.2 (teacher)

Unsloth + TRL

bf16 LoRA

llama.cpp

GGUF q4_k_m

Ollama

RTX 5060 Ti 16GB

What Got Built

The goal was a mid-size model with enough parametric capacity to absorb the teacher's reasoning style deeply, while still fitting a single 16 GB GPU. Qwen2.5-14B sits in a sweet spot: nearly double the parameters of an 8B, with more room to internalize complex answer patterns from the teacher.

14B

student parameters

~9 GB

final q4_k_m GGUF

128K

context window

bf16

LoRA precision

A note on what distillation does It transfers the teacher's reasoning style and answer quality, not net-new facts. A 14B model carries more parametric knowledge than the 4B or 8B builds, so it has more surface area to absorb the teacher's patterns. For raw factual recall beyond what the base knows, RAG is the right tool, not fine-tuning.

The Stack

Everything runs on thelab-genesis, fitted with an RTX 5060 Ti (16 GB). The teacher is called through Ollama's cloud routing; every GPU-heavy step — training, merging, conversion — is local.

Teacher    glm-5.2          (via Ollama, thinking disabled)
Student    Qwen2.5-14B      (unsloth/Qwen2.5-14B, bf16 LoRA)
Trainer    Unsloth + TRL    (SFT, rank-16 LoRA)
Convert    llama.cpp        (convert_hf_to_gguf.py → f16 → q4_k_m)
Serve      Ollama           (adi-qwen2.5-14b-glm5.2-general:latest)
Hardware   RTX 5060 Ti 16GB (single GPU)

Why bf16 LoRA The Qwen2.5 14B is a dense transformer but its larger size makes it more sensitive to quantization noise during training. bf16 LoRA keeps the adapter in full precision while the base stays frozen at 4-bit, giving a clean training signal without blowing the VRAM budget on a 16 GB card.

Downloads

Download GGUF View on Hugging Face

Verify

Model appears in ollama list as adi-qwen2.5-14b-glm5.2-general:latest.
Smoke-test prompts return coherent, well-structured answers — not repeated tokens or gibberish.
Step-by-step answers (e.g. "explain the CAP theorem") show the teacher's structured reasoning style with more depth than the 4B/8B builds.
The 14B base carries more parametric knowledge, so it handles a wider range of topics before hitting its limits.

If output is gibberish A freshly-converted GGUF can load but produce garbled text if Ollama's bundled llama.cpp runtime is older than the converter that built the file. Fix: update Ollama on the serving host.