ADI Qwen3 Line / Model Build

adi-qwen3.5-4b-glm5.2-general

Distilling glm-5.2 into a local Qwen3.5-4B

A 2.7 GB local model that answers like a frontier teacher. Built by distilling glm-5.2 general-knowledge responses into a Qwen3.5-4B student with a bf16 LoRA fine-tune — then merged, converted, quantized to GGUF, and served from Ollama. The whole pipeline runs on a single 16 GB GPU.

Download GGUF View on Hugging Face

This page walks through how a small, fully local model was built to reason and answer like a much larger one — using knowledge distillation. A strong teacher (glm-5.2) generates high-quality answers to a few thousand prompts, and a small student (Qwen3.5-4B) is fine-tuned to imitate them. Everything GPU-heavy runs on thelab-genesis, a dedicated training box fitted with an RTX 5060 Ti (16 GB).

Qwen3.5-4B

glm-5.2 (teacher)

Unsloth + TRL

bf16 LoRA

llama.cpp

GGUF q4_k_m

Ollama

RTX 5060 Ti 16GB

What Got Built

The goal was a small, fully local model that reasons and answers like a much larger one on general-knowledge questions — and that still supports native tool calling. The teacher is called through Ollama's cloud routing; every GPU-heavy step — training, merging, conversion — is local.

2,068

distilled training pairs

1.36M

teacher tokens spent

2h53m

LoRA training (777 steps)

0.93

final train loss

2.7 GB

final q4_k_m GGUF

3 epochs

over the dataset

A note on what distillation does It transfers the teacher's reasoning style and answer quality, not net-new facts. A 4B model won't become an encyclopedia — but it will reason and respond noticeably more like its teacher on topics it already partly knows. For raw factual recall, RAG is the right tool, not fine-tuning.

The Stack

Everything runs on thelab-genesis, fitted with an RTX 5060 Ti (16 GB). The teacher is called through Ollama's cloud routing; every GPU-heavy step — training, merging, conversion — is local.

Teacher    glm-5.2          (via Ollama, thinking disabled)
Student    Qwen3.5-4B       (unsloth/Qwen3.5-4B, bf16 LoRA)
Trainer    Unsloth + TRL    (transformers 5.5.0, torch 2.10)
Convert    llama.cpp        (convert_hf_to_gguf.py → bf16 → q4_k_m)
Serve      Ollama           (adi-qwen3.5-4b-glm5.2-general:latest)
Hardware   RTX 5060 Ti 16GB (single GPU, ~10GB used during training)

Heads up — Qwen3.5 quantization Unsloth recommends bf16 LoRA, not 4-bit QLoRA, for Qwen3.5. Its delta-net / Mamba-hybrid layers quantize poorly during training, so 4-bit costs accuracy. bf16 LoRA uses ~10 GB on a 4B — comfortable on a 16 GB card.

01Build the Seed Prompts

Distillation needs a diverse set of questions to ask the teacher. Rather than hand-writing them, pull human-written instructions from the Dolly-15k dataset, filter out anything needing an attached context passage, dedupe, and keep 2,000.

from datasets import load_dataset

ds = load_dataset("databricks/databricks-dolly-15k", split="train")
skip = {"closed_qa", "information_extraction", "summarization"}

seen, prompts = set(), []
for row in ds:
    if row.get("category") in skip:
        continue
    q = row["instruction"].strip()
    if not q or len(q) < 15 or len(q) > 400 or q.lower() in seen:
        continue
    seen.add(q.lower()); prompts.append(q)
    if len(prompts) >= 2000:
        break

02Generate the Dataset from the Teacher

Each prompt is sent to glm-5.2 through Ollama's native /api/chat endpoint with thinking disabled — for a 4B student you want the teacher's clean final answers, not chain-of-thought it's too small to imitate well. Output is written as Qwen chat-format JSONL.

payload = {
    "model": "glm-5.2",
    "messages": [
        {"role": "system", "content": SYSTEM},
        {"role": "user",   "content": prompt},
    ],
    "think": False,          # native API reliably disables thinking
    "stream": False,
    "options": {"num_predict": 2048, "temperature": 0.7},
}
r = requests.post("http://localhost:11434/api/chat", json=payload, timeout=300)
answer = r.json()["message"]["content"].strip()

Why the native API The think: false flag is honored by Ollama's native /api/chat endpoint but not reliably by the OpenAI-compatible /v1 path — where a thinking model can burn its whole token budget on hidden reasoning. Result here: 2,068 answers for ~1.36M tokens, far cheaper than a thinking run.

03Fine-Tune with Unsloth (bf16 LoRA)

A rank-16 LoRA adapter is trained on the distilled pairs. Only 0.47% of parameters (~21M of 4.56B) are trained — the base stays frozen — which is what keeps it inside 16 GB.

from unsloth import FastModel

model, tokenizer = FastModel.from_pretrained(
    model_name      = "unsloth/Qwen3.5-4B",
    max_seq_length  = 4096,
    load_in_4bit    = False,   # bf16 — NOT 4-bit for Qwen3.5
    load_in_16bit   = True,
)

model = FastModel.get_peft_model(
    model, r=16, lora_alpha=16, lora_dropout=0,
    use_gradient_checkpointing="unsloth",
    target_modules=["q_proj","k_proj","v_proj","o_proj",
                    "gate_proj","up_proj","down_proj"],
)
# 3 epochs · lr 2e-4 · 777 steps · final loss 0.9346

Heads up — version trap Qwen3.5 is brand new: transformers >= 5.2.0 is required to recognize the architecture, but Unsloth caps at <= 5.5.0. The only version satisfying both is transformers 5.5.0. Keep numpy < 2.3 too, or numba breaks.

04Merge, Convert, and Quantize

The adapter is merged into the base as 16-bit safetensors, then converted to GGUF with llama.cpp's own converter and quantized to q4_k_m. llama.cpp already understands Qwen3.5's SSM/Mamba-hybrid layers and the MTP head — it converts cleanly.

# merge adapter → 16-bit safetensors (Unsloth)
# then convert with llama.cpp directly:
python3 llama.cpp/convert_hf_to_gguf.py ./gguf_out \
    --outfile adi-qwen3.5-4b-glm5.2-general-bf16.gguf \
    --outtype bf16

# quantize bf16 → q4_k_m  (9 GB → 2.7 GB)
./llama.cpp/build/bin/llama-quantize \
    adi-qwen3.5-4b-glm5.2-general-bf16.gguf \
    adi-qwen3.5-4b-glm5.2-general-q4_k_m.gguf q4_k_m

Heads up — tokenizer backend Installing llama.cpp's requirements.txt downgrades transformers and breaks the converter with TokenizersBackend does not exist. Fix: reinstall transformers 5.5.0 after, since that version reads the merged tokenizer correctly. Never go above 5.5.0 — it breaks Unsloth.

05Serve from Ollama

A Modelfile points Ollama at the quantized GGUF. After ollama create, the model is live in the fleet and reachable through the standard OpenAI-compatible endpoint alongside every other ADI model.

# Modelfile
FROM ./adi-qwen3.5-4b-glm5.2-general-q4_k_m.gguf
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 4096
SYSTEM """You are ADI, a precise and knowledgeable assistant.
Answer clearly and completely, reasoning step by step where it helps."""

ollama create adi-qwen3.5-4b-glm5.2-general -f Modelfile
ollama run  adi-qwen3.5-4b-glm5.2-general "Explain the CAP theorem in two sentences."

Verify

Model appears in ollama list as adi-qwen3.5-4b-glm5.2-general:latest (2.7 GB).
Smoke-test prompts return coherent, well-structured answers — not repeated tokens or gibberish.
The empty <think></think> block confirms thinking-off carried through from the dataset.
Step-by-step answers (e.g. "what caused the 2008 financial crisis") show the teacher's structured reasoning style — the distillation took.

If output is gibberish A freshly-converted GGUF of a brand-new architecture can load but produce garbled text if Ollama's bundled llama.cpp runtime is older than the converter that built the file. Fix: update Ollama on the serving host.