Distilling kimi-k2.7-code into a local Qwen2.5-Coder-7B
A fully local coding model that writes and reasons about code like a frontier teacher. Built by distilling kimi-k2.7-code completions into a Qwen2.5-Coder-7B student with a LoRA fine-tune — then merged, converted, quantized to GGUF, and served from Ollama. The whole pipeline runs on a single 16 GB GPU.
This page walks through how a small, fully local coding model was built to write and reason about code like a much larger one — using knowledge distillation. A strong coding teacher (kimi-k2.7-code) generates high-quality solutions to a few thousand programming prompts, and a small student (Qwen2.5-Coder-7B) is fine-tuned to imitate them. Everything GPU-heavy runs on thelab-genesis, a dedicated training box fitted with an RTX 5060 Ti (16 GB).
The goal was a small, fully local coding assistant that generates, debugs, and explains code like a much larger model — and that still handles tool-call formats. The teacher is called through Ollama Cloud; every GPU-heavy step — training, merging, conversion — runs locally on a single card.
Everything runs on thelab-genesis, fitted with an RTX 5060 Ti (16 GB). The teacher is reached through Ollama Cloud; every GPU-heavy step — training, merging, conversion — is local.
Teacher kimi-k2.7-code:cloud (via Ollama Cloud)
Student Qwen2.5-Coder-7B (unsloth/Qwen2.5-Coder-7B-Instruct, 4-bit QLoRA)
Trainer Unsloth + TRL (transformers + torch)
Convert llama.cpp (convert_hf_to_gguf.py → f16 → q4_k_m)
Serve Ollama (adi-qwen2.5-coder-7b-kimi2.7code-coder:latest)
Hardware RTX 5060 Ti 16GB (single GPU)
transformers version pin is needed.
Distillation needs a diverse set of coding tasks to ask the teacher. The seed set is ~2,000 programming prompts spanning generation, debugging, refactoring, and explanation across multiple languages — deduped and length-filtered so each is a standalone, self-contained instruction.
seen, prompts = set(), []
for row in raw_coding_instructions: # code-instruction source
q = row["instruction"].strip()
if not q or len(q) < 15 or q.lower() in seen:
continue
seen.add(q.lower()); prompts.append(q)
if len(prompts) >= 2000:
break
Each prompt is sent to kimi-k2.7-code:cloud through Ollama's native /api/chat endpoint. A coding system prompt keeps answers focused on clean, complete solutions. Output is written as chat-format JSONL, and the run logs input/output tokens per pair so cost is tracked live.
payload = {
"model": "kimi-k2.7-code:cloud",
"messages": [
{"role": "system", "content": CODER_SYSTEM},
{"role": "user", "content": prompt},
],
"stream": False,
"options": {"num_predict": 4096, "temperature": 0.3},
}
r = requests.post("http://localhost:11434/api/chat", json=payload, timeout=600)
answer = r.json()["message"]["content"].strip()
# [n/2000] OK (chars) | tokens: in=.. out=..
A rank-16 LoRA adapter is trained on the distilled pairs over a 4-bit-quantized base. Only a fraction of a percent of parameters are trained — the base stays frozen — so a 7B fits comfortably on the 16 GB card.
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/Qwen2.5-Coder-7B-Instruct",
max_seq_length = 4096,
load_in_4bit = True, # 4-bit QLoRA is fine for Qwen2.5
)
model = FastLanguageModel.get_peft_model(
model, r=16, lora_alpha=16, lora_dropout=0,
use_gradient_checkpointing="unsloth",
target_modules=["q_proj","k_proj","v_proj","o_proj",
"gate_proj","up_proj","down_proj"],
)
The adapter is merged into the base as 16-bit safetensors, then converted to GGUF with llama.cpp's own converter and quantized to q4_k_m. Qwen2.5-Coder uses the well-supported Qwen2 architecture, so the conversion is uneventful.
# merge adapter → 16-bit safetensors (Unsloth)
# then convert with llama.cpp directly:
python3 llama.cpp/convert_hf_to_gguf.py ./merged \
--outfile adi-qwen2.5-coder-7b-kimi2.7code-coder-f16.gguf \
--outtype f16
# quantize f16 → q4_k_m
./llama.cpp/build/bin/llama-quantize \
adi-qwen2.5-coder-7b-kimi2.7code-coder-f16.gguf \
adi-qwen2.5-coder-7b-kimi2.7code-coder-q4_k_m.gguf q4_k_m
A Modelfile points Ollama at the quantized GGUF. After ollama create, the coder is live in the fleet and reachable through the standard OpenAI-compatible endpoint alongside every other ADI model.
# Modelfile
FROM ./adi-qwen2.5-coder-7b-kimi2.7code-coder-q4_k_m.gguf
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
SYSTEM """You are ADI, a precise coding assistant.
Write correct, complete, well-structured code and explain it briefly."""
ollama create adi-qwen2.5-coder-7b-kimi2.7code-coder -f Modelfile
ollama run adi-qwen2.5-coder-7b-kimi2.7code-coder "Write a Python LRU cache with a capacity limit."
ollama list as adi-qwen2.5-coder-7b-kimi2.7code-coder:latest.