ADI Qwen3 Line / Model Build

adi-qwen3-8b-glm5.2-general

Distilling glm-5.2 into a local Qwen3-8B

A ~5 GB local model that reasons and answers like a frontier teacher — with more headroom than the 4B. Built by distilling glm-5.2 general-knowledge responses into a Qwen3-8B student with a 4-bit QLoRA fine-tune, then merged, converted, and quantized to GGUF for Ollama. It keeps the base's 128K context and native tool calling, and still runs on a single 16 GB GPU.

Download GGUF View on Hugging Face

This page walks through how a small, fully local model was built to reason and answer like a much larger one — using knowledge distillation. A strong teacher (glm-5.2) generates high-quality answers to a few thousand prompts, and a Qwen3-8B student is fine-tuned to imitate them. It's the same recipe as the 4B build, scaled to an 8B base for more capacity. Everything GPU-heavy runs on thelab-genesis, a dedicated training box fitted with an RTX 5060 Ti (16 GB).

Qwen3-8B

glm-5.2 (teacher)

Unsloth + TRL

4-bit QLoRA

llama.cpp

GGUF q4_k_m

Ollama

RTX 5060 Ti 16GB

What Got Built

The goal was a small, fully local model that reasons and answers like a much larger one on general-knowledge questions — and that still supports native tool calling. The teacher is called through Ollama's cloud routing; every GPU-heavy step — training, merging, conversion — is local.

2,068

distilled training pairs

1.36M

teacher tokens spent

777

steps · 3 epochs

~5 GB

final q4_k_m GGUF

128K

context window

student parameters

A note on what distillation does It transfers the teacher's reasoning style and answer quality, not net-new facts. An 8B model won't become an encyclopedia — though it carries more parametric knowledge than the 4B — but it will reason and respond noticeably more like its teacher on topics it already partly knows. For raw factual recall, RAG is the right tool, not fine-tuning.

The Stack

Everything runs on thelab-genesis, fitted with an RTX 5060 Ti (16 GB). The teacher is called through Ollama's cloud routing; every GPU-heavy step — training, merging, conversion — is local.

Teacher    glm-5.2          (via Ollama, thinking disabled)
Student    Qwen3-8B         (unsloth/Qwen3-8B, 4-bit QLoRA)
Trainer    Unsloth + TRL    (SFT, rank-16 LoRA)
Convert    llama.cpp        (convert_hf_to_gguf.py → f16 → q4_k_m)
Serve      Ollama           (adi-qwen3-8b-glm5.2-general:latest)
Hardware   RTX 5060 Ti 16GB (single GPU)

Why 4-bit QLoRA here (and not the 4B) Unlike the Qwen3.5-4B — whose gated-delta / Mamba-hybrid layers quantize poorly during training and need bf16 LoRA — Qwen3-8B is a standard dense transformer that trains cleanly in 4-bit QLoRA. That makes an 8B comfortable on a single 16 GB card: the base loads quantized, only the small adapter trains in higher precision.

01Build the Seed Prompts

Distillation needs a diverse set of questions to ask the teacher. Rather than hand-writing them, pull human-written instructions from the Dolly-15k dataset, filter out anything needing an attached context passage, dedupe, and keep 2,000.

from datasets import load_dataset

ds = load_dataset("databricks/databricks-dolly-15k", split="train")
skip = {"closed_qa", "information_extraction", "summarization"}

seen, prompts = set(), []
for row in ds:
    if row.get("category") in skip:
        continue
    q = row["instruction"].strip()
    if not q or len(q) < 15 or len(q) > 400 or q.lower() in seen:
        continue
    seen.add(q.lower()); prompts.append(q)
    if len(prompts) >= 2000:
        break

02Generate the Dataset from the Teacher

Each prompt is sent to glm-5.2 through Ollama's native /api/chat endpoint with thinking disabled — you want the teacher's clean final answers for the student to imitate, not chain-of-thought. Output is written as Qwen chat-format JSONL. The same distilled set feeds both the 4B and 8B students.

payload = {
    "model": "glm-5.2",
    "messages": [
        {"role": "system", "content": SYSTEM},
        {"role": "user",   "content": prompt},
    ],
    "think": False,          # native API reliably disables thinking
    "stream": False,
    "options": {"num_predict": 2048, "temperature": 0.7},
}
r = requests.post("http://localhost:11434/api/chat", json=payload, timeout=300)
answer = r.json()["message"]["content"].strip()

Why the native API The think: false flag is honored by Ollama's native /api/chat endpoint but not reliably by the OpenAI-compatible /v1 path — where a thinking model can burn its whole token budget on hidden reasoning. Result here: 2,068 answers for ~1.36M tokens, far cheaper than a thinking run.

03Fine-Tune with Unsloth (4-bit QLoRA)

A rank-16 LoRA adapter is trained on the distilled pairs over a 4-bit-quantized base. Only the small adapter moves — the base stays frozen — so an 8B fits comfortably on the 16 GB card.

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name      = "unsloth/Qwen3-8B",
    max_seq_length  = 4096,
    load_in_4bit    = True,    # 4-bit QLoRA — fine for dense Qwen3
)

model = FastLanguageModel.get_peft_model(
    model, r=16, lora_alpha=16, lora_dropout=0,
    use_gradient_checkpointing="unsloth",
    target_modules=["q_proj","k_proj","v_proj","o_proj",
                    "gate_proj","up_proj","down_proj"],
)
# 3 epochs · 777 steps · rank 16

04Merge, Convert, and Quantize

The adapter is merged into the base as 16-bit safetensors, then converted to GGUF with llama.cpp's own converter and quantized to q4_k_m. Qwen3-8B uses the well-supported Qwen3 architecture, so the conversion is uneventful.

# merge adapter → 16-bit safetensors (Unsloth)
# then convert with llama.cpp directly:
python3 llama.cpp/convert_hf_to_gguf.py ./merged \
    --outfile adi-qwen3-8b-glm5.2-general-f16.gguf \
    --outtype f16

# quantize f16 → q4_k_m  (~16 GB → ~5 GB)
./llama.cpp/build/bin/llama-quantize \
    adi-qwen3-8b-glm5.2-general-f16.gguf \
    adi-qwen3-8b-glm5.2-general-q4_k_m.gguf q4_k_m

05Serve from Ollama

A Modelfile points Ollama at the quantized GGUF. After ollama create, the model is live in the fleet and reachable through the standard OpenAI-compatible endpoint alongside every other ADI model.

# Modelfile
FROM ./adi-qwen3-8b-glm5.2-general-q4_k_m.gguf
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
SYSTEM """You are ADI, a precise and knowledgeable assistant.
Answer clearly and completely, reasoning step by step where it helps."""

# build it locally from the Modelfile…
ollama create adi-qwen3-8b-glm5.2-general -f Modelfile

# …or pull the published GGUF straight from Hugging Face
ollama run hf.co/AdvancedDataIntelligence/adi-qwen3-8b-glm5.2-general-GGUF:Q4_K_M \
    "Explain the CAP theorem in two sentences."

Verify

Model appears in ollama list as adi-qwen3-8b-glm5.2-general:latest (~5 GB).
Smoke-test prompts return coherent, well-structured answers — not repeated tokens or gibberish.
Tool-calling still works — the capability is inherited from the Qwen3-8B base and survives the fine-tune.
Step-by-step answers (e.g. "what caused the 2008 financial crisis") show the teacher's structured reasoning style — the distillation took — with noticeably more depth than the 4B.

If output is gibberish A freshly-converted GGUF of a brand-new architecture can load but produce garbled text if Ollama's bundled llama.cpp runtime is older than the converter that built the file. Fix: update Ollama on the serving host.