ADI Qwen2.5-Coder Line / Model Build

adi-qwen2.5-coder-7b-kimi2.7code-coder

Distilling kimi-k2.7-code into a local Qwen2.5-Coder-7B

A fully local coding model that writes and reasons about code like a frontier teacher. Built by distilling kimi-k2.7-code completions into a Qwen2.5-Coder-7B student with a LoRA fine-tune — then merged, converted, quantized to GGUF, and served from Ollama. The whole pipeline runs on a single 16 GB GPU.

This page walks through how a small, fully local coding model was built to write and reason about code like a much larger one — using knowledge distillation. A strong coding teacher (kimi-k2.7-code) generates high-quality solutions to a few thousand programming prompts, and a small student (Qwen2.5-Coder-7B) is fine-tuned to imitate them. Everything GPU-heavy runs on thelab-genesis, a dedicated training box fitted with an RTX 5060 Ti (16 GB).

Qwen2.5-Coder-7B
kimi-k2.7-code (teacher)
Unsloth + TRL
4-bit QLoRA
llama.cpp
GGUF q4_k_m
Ollama
RTX 5060 Ti 16GB

What Got Built

The goal was a small, fully local coding assistant that generates, debugs, and explains code like a much larger model — and that still handles tool-call formats. The teacher is called through Ollama Cloud; every GPU-heavy step — training, merging, conversion — runs locally on a single card.

2,000
distilled coding pairs
7B
Qwen2.5-Coder student
kimi-k2.7
coding teacher
QLoRA
4-bit rank-16 adapter
q4_k_m
final GGUF quant
16 GB
single-GPU training
A note on what distillation does It transfers the teacher's problem-solving approach and code style — how it structures solutions, handles edge cases, and explains its reasoning — not net-new library knowledge. A 7B coder won't memorize every API, but it will tackle a task the way its much larger teacher would. For up-to-the-minute API details, pair it with retrieval, not more fine-tuning.

The Stack

Everything runs on thelab-genesis, fitted with an RTX 5060 Ti (16 GB). The teacher is reached through Ollama Cloud; every GPU-heavy step — training, merging, conversion — is local.

Teacher    kimi-k2.7-code:cloud   (via Ollama Cloud)
Student    Qwen2.5-Coder-7B       (unsloth/Qwen2.5-Coder-7B-Instruct, 4-bit QLoRA)
Trainer    Unsloth + TRL          (transformers + torch)
Convert    llama.cpp              (convert_hf_to_gguf.py → f16 → q4_k_m)
Serve      Ollama                 (adi-qwen2.5-coder-7b-kimi2.7code-coder:latest)
Hardware   RTX 5060 Ti 16GB       (single GPU)
Why Qwen2.5-Coder is the easy case Unlike the Qwen3.5 line, Qwen2.5-Coder is a standard dense transformer — no delta-net / Mamba-hybrid layers, no MTP head. That means ordinary 4-bit QLoRA trains cleanly (a 7B fits in well under 16 GB), llama.cpp's Qwen2 path converts without surprises, and no special transformers version pin is needed.

01Build the Seed Prompts

Distillation needs a diverse set of coding tasks to ask the teacher. The seed set is ~2,000 programming prompts spanning generation, debugging, refactoring, and explanation across multiple languages — deduped and length-filtered so each is a standalone, self-contained instruction.

seen, prompts = set(), []
for row in raw_coding_instructions:          # code-instruction source
    q = row["instruction"].strip()
    if not q or len(q) < 15 or q.lower() in seen:
        continue
    seen.add(q.lower()); prompts.append(q)
    if len(prompts) >= 2000:
        break

02Generate the Dataset from the Teacher

Each prompt is sent to kimi-k2.7-code:cloud through Ollama's native /api/chat endpoint. A coding system prompt keeps answers focused on clean, complete solutions. Output is written as chat-format JSONL, and the run logs input/output tokens per pair so cost is tracked live.

payload = {
    "model": "kimi-k2.7-code:cloud",
    "messages": [
        {"role": "system", "content": CODER_SYSTEM},
        {"role": "user",   "content": prompt},
    ],
    "stream": False,
    "options": {"num_predict": 4096, "temperature": 0.3},
}
r = requests.post("http://localhost:11434/api/chat", json=payload, timeout=600)
answer = r.json()["message"]["content"].strip()
# [n/2000] OK (chars) | tokens: in=.. out=..
Low temperature for code Code wants determinism, not creativity — a low temperature (~0.3) keeps the teacher's solutions tight and reproducible, which is exactly the behavior you want the student to imitate.

03Fine-Tune with Unsloth (4-bit QLoRA)

A rank-16 LoRA adapter is trained on the distilled pairs over a 4-bit-quantized base. Only a fraction of a percent of parameters are trained — the base stays frozen — so a 7B fits comfortably on the 16 GB card.

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name      = "unsloth/Qwen2.5-Coder-7B-Instruct",
    max_seq_length  = 4096,
    load_in_4bit    = True,    # 4-bit QLoRA is fine for Qwen2.5
)

model = FastLanguageModel.get_peft_model(
    model, r=16, lora_alpha=16, lora_dropout=0,
    use_gradient_checkpointing="unsloth",
    target_modules=["q_proj","k_proj","v_proj","o_proj",
                    "gate_proj","up_proj","down_proj"],
)

04Merge, Convert, and Quantize

The adapter is merged into the base as 16-bit safetensors, then converted to GGUF with llama.cpp's own converter and quantized to q4_k_m. Qwen2.5-Coder uses the well-supported Qwen2 architecture, so the conversion is uneventful.

# merge adapter → 16-bit safetensors (Unsloth)
# then convert with llama.cpp directly:
python3 llama.cpp/convert_hf_to_gguf.py ./merged \
    --outfile adi-qwen2.5-coder-7b-kimi2.7code-coder-f16.gguf \
    --outtype f16

# quantize f16 → q4_k_m
./llama.cpp/build/bin/llama-quantize \
    adi-qwen2.5-coder-7b-kimi2.7code-coder-f16.gguf \
    adi-qwen2.5-coder-7b-kimi2.7code-coder-q4_k_m.gguf q4_k_m

05Serve from Ollama

A Modelfile points Ollama at the quantized GGUF. After ollama create, the coder is live in the fleet and reachable through the standard OpenAI-compatible endpoint alongside every other ADI model.

# Modelfile
FROM ./adi-qwen2.5-coder-7b-kimi2.7code-coder-q4_k_m.gguf
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
SYSTEM """You are ADI, a precise coding assistant.
Write correct, complete, well-structured code and explain it briefly."""
ollama create adi-qwen2.5-coder-7b-kimi2.7code-coder -f Modelfile
ollama run  adi-qwen2.5-coder-7b-kimi2.7code-coder "Write a Python LRU cache with a capacity limit."

Verify

If output is gibberish A freshly-converted GGUF can load but produce garbled text if Ollama's bundled llama.cpp runtime is older than the converter that built the file. Fix: update Ollama on the serving host.