Use Coding Agents with On-Premise Inference Services
TOC
IntroductionPrerequisitesHow the pieces fit togetherStep 1: Deploy and smoke-test the endpointStep 2: Enable tool calling on the runtimeStep 2b (optional): Configure reasoning models and reasoning effortServer-side flagsConfiguring reasoning effort and thinking behaviorServer-side defaultsRequest-time controlsStep 3: Connect your coding agentopencodeCodex CLIClaude CodeOption 1: LiteLLM proxyOption 2: claude-code-routerNotes for on-premise operationBest practicesRecommended model families for coding agentsChoose a model that fits your hardwareQuantized models from Unsloth on HuggingFaceHardware fit guideTune inference service performanceGetting started with vibe codingGetting started with MLOpsTroubleshootingReferencesIntroduction
Coding agents such as opencode, Codex CLI, and Claude Code are terminal-based assistants that read your repository, plan changes, edit files, and run commands on your behalf. They normally talk to a hosted model provider over the internet.
This document shows how to point those agents at a model you serve yourself on Alauda AI, so that your source code, prompts, and infrastructure configuration never leave your cluster. The same on-premise InferenceService that you deploy for any other workload can back an interactive coding agent, as long as it exposes an OpenAI-compatible API and has tool (function) calling enabled. opencode and Codex CLI can call that endpoint directly; Claude Code speaks the Anthropic Messages API (/v1/messages) and needs a lightweight translation proxy (see Claude Code).
This page builds directly on the deployment how-tos. It does not repeat how to create or expose an InferenceService; instead it links to them and focuses on the agent-specific configuration and tuning.
Coding agents and their configuration formats evolve quickly. The config snippets below are correct starting points for the versions available at the time of writing. Always confirm field names against the current upstream documentation of the agent you use.
Prerequisites
- A running, ready
InferenceServicethat serves an OpenAI-compatible API. See Create Inference Service using CLI. - Network access from the machine running the agent to the service endpoint. For access from a developer laptop outside the cluster, see Configure External Access for Inference Services.
- A model with tool/function calling support, served with the matching vLLM parser enabled (see Enable tool calling on the runtime). Without this, agents can chat but cannot edit files or run commands.
- The agent CLI installed locally (
opencode,codex, orclaude). - For Claude Code, a translation proxy (LiteLLM or claude-code-router) to bridge Claude Code's Anthropic Messages API to the OpenAI-compatible endpoint (see Claude Code).
How the pieces fit together
- opencode and Codex CLI speak the OpenAI Chat Completions API natively, so they can call the
InferenceServiceendpoint directly. - Claude Code speaks the Anthropic Messages API, which vLLM does not serve. It requires a small translation proxy in front of the OpenAI-compatible endpoint (see Claude Code).
Step 1: Deploy and smoke-test the endpoint
Deploy your model as an InferenceService following Create Inference Service using CLI, and if the agent runs outside the cluster, expose it following Configure External Access for Inference Services.
Before wiring up any agent, confirm the endpoint answers a chat request. Coding agents fail in confusing ways if the base URL, model name, or auth is wrong, so validate with curl first:
A normal JSON completion confirms the endpoint is reachable and the model name is correct. Note the three values you will reuse for every agent: base URL (ending in /v1), model name (the --served-model-name), and API key.
For reasoning models (DeepSeek R1, QwQ, Qwen3, etc.), also add the matching --reasoning-parser to the vLLM launch flags. See Configure reasoning models and reasoning effort.
Step 2: Enable tool calling on the runtime
Coding agents work by calling tools (read file, write file, run shell). This requires the model to emit tool calls and vLLM to parse them. Add the following flags to the vLLM launch command in your InferenceService (in the sample from Create Inference Service using CLI, they go on the python3 -m vllm.entrypoints.openai.api_server line):
- The parser must match the model. For example, Qwen2.5 and QwQ-32B commonly use
hermes; Qwen3-Coder usesqwen3_xml; Llama 3.x models usellama3_json; Mistral models usemistral. Check the vLLM tool calling documentation for the current parser list and the value that matches your model. - Some models need a specific chat template to emit tool calls correctly; pass
--chat-templateif the model card calls for it. - If you serve a reasoning model, also enable the matching
--reasoning-parserso the agent receives clean assistant content separated from reasoning traces.
Verify tool calling end-to-end by asking the agent to perform a trivial file operation (for example, "create hello.txt containing the word hi"). If the model replies in prose instead of editing the file, tool calling is not wired up correctly — recheck the parser and model.
Step 2b (optional): Configure reasoning models and reasoning effort
Some models (for example, DeepSeek R1, QwQ, Hunyuan, or Cohere Command A Reasoning) emit chain-of-thought reasoning before their final answer. vLLM separates the reasoning traces from the assistant content so your agent receives clean output — but you must enable the matching flags.
Server-side flags
Add --reasoning-parser to your vLLM launch command. If the same model also needs agent tool calls, pair it with the appropriate --tool-call-parser:
The table below shows common model families and their required parsers. Confirm against the vLLM tool calling documentation for the current list.
For model families not listed above, check the model card for reasoning instructions and the vLLM tool calling documentation for the matching parser pair.
Configuring reasoning effort and thinking behavior
Reasoning effort controls how much the model "thinks" before answering. For coding agents you typically want low reasoning effort to keep interactive latency acceptable — many short, low-reasoning turns beat a single long, high-reasoning one.
Server-side defaults
vLLM does not expose a generic --reasoning-effort launch flag. Server-wide control is achieved through the model's chat template: you can supply a custom Jinja template that disables thinking by default, then pass it with --chat-template. Alternatively, some models and vLLM versions expose per-model template kwargs; check the vLLM release notes for the specific key.
Request-time controls
Do not assume every vLLM-backed InferenceService accepts reasoning_effort. Support depends on the vLLM version, OpenAI-compatible server implementation, model, and chat template. If the service rejects unknown request fields, reasoning_effort can fail even when the model itself supports reasoning.
Prefer model-specific controls that your deployed vLLM service documents. For example, Qwen3-style templates commonly use chat_template_kwargs to enable or disable thinking:
When using the OpenAI Python client, pass vLLM-specific request fields through extra_body:
For parsers that support an explicit thinking budget, you can also cap reasoning tokens per request:
When using a translation proxy (LiteLLM or claude-code-router), confirm the proxy version passes through these vLLM/OpenAI extension fields before relying on them.
Only use reasoning_effort after you verify that your exact vLLM image and model template accept it. On supported deployments, it can be sent as a top-level Chat Completions field such as "reasoning_effort": "low"; on unsupported deployments, use chat_template_kwargs, thinking_token_budget, or max_tokens instead.
Step 3: Connect your coding agent
opencode
opencode reads configuration from opencode.json in the project root or ~/.config/opencode/opencode.json. Define a custom OpenAI-compatible provider that points at your endpoint:
- The model key (
qwen-2) must match the--served-model-nameof theInferenceService. - Export the key the config references, then select the model:
export ONPREM_API_KEY=sk-localand chooseonprem/qwen-2with the/modelscommand inside opencode.
Codex CLI
Codex CLI reads ~/.codex/config.toml. Register your endpoint as a model provider and select it:
base_urlmust end at/v1;modelmust match the--served-model-name.env_keynames the environment variable that holds the API key:export ONPREM_API_KEY=sk-local.- Use
wire_api = "chat"for vLLM's OpenAI Chat Completions API.
Claude Code
Claude Code communicates over the Anthropic Messages API (/v1/messages), while your InferenceService exposes an OpenAI-compatible endpoint (/v1/chat/completions). Bridge the two by running a translation proxy in front of your endpoint. Two common options:
- LiteLLM proxy, which exposes an Anthropic-compatible
/v1/messagesendpoint and routes to any backend model. - claude-code-router, a proxy built specifically to point Claude Code at OpenAI-compatible and other backends.
Both approaches handle the API translation for you. Pick whichever fits your workflow — LiteLLM is more general-purpose, while claude-code-router is tailored to Claude Code's needs.
Option 1: LiteLLM proxy
Start the LiteLLM proxy, pointing it at your InferenceService endpoint:
This exposes http://localhost:4000/v1/messages (Anthropic format) and forwards requests to your OpenAI-compatible backend.
Then point Claude Code at the proxy:
Option 2: claude-code-router
Create a config file at ~/.claude-code-router/config.json with your InferenceService as a provider:
Then start Claude Code through the router:
The router automatically sets the required ANTHROPIC_BASE_URL and other environment variables — no manual export needed. The model is selected by the Router.default field in the config (format: provider_name,model_name). You can also activate the router in your shell first with eval "$(ccr activate)" and then run claude directly. Inside a running session, switch models with /model provider_name,model_name.
Notes for on-premise operation
- The
ANTHROPIC_AUTH_TOKEN/ANTHROPIC_API_KEYvalues (used with the LiteLLM option) must be non-empty but their content does not matter if your proxy and endpoint do not check them; gate access at the endpoint or proxy (see Manage gateways for adding auth via Envoy AI Gateway). - The
CLAUDE_CODE_DISABLE_*flags are what actually keep an "on-prem" setup on-prem: without them, Claude Code can still emit non-essential requests to Anthropic-hosted endpoints and ask the model for features (1M context, very large outputs) the on-prem model cannot honor. claude-code-router sets some of these automatically. ANTHROPIC_MODELmust match the model name yourInferenceServiceexposes (the--served-model-name).- Optionally set
ANTHROPIC_SMALL_FAST_MODELto an on-prem model so background/low-cost requests stay on-prem too.
Claude Code's agentic quality depends heavily on the served model's tool-calling fidelity — prefer a strong instruction- and tool-tuned model, and confirm tool calls round-trip end-to-end before relying on it.
Best practices
Recommended model families for coding agents
Qwen3.6 and Gemma 4 are the two model families we currently recommend for on-premise coding agents. Both have strong instruction tuning and a wide range of sizes and quantization formats available; verify tool-calling parser support against the vLLM version you run.
Choose a model that fits your hardware
Start from the GPU memory you have, then pick the largest capable model that leaves headroom for the KV cache. A rough weight-size estimate is parameters × bytes-per-parameter — FP16 ≈ 2 bytes, FP8/INT8 ≈ 1 byte, INT4 ≈ 0.5 bytes per parameter — on top of which the KV cache and runtime overhead consume more memory. Leave 15–25% headroom.
Quantized models from Unsloth on HuggingFace
Unsloth publishes GGUF-quantized versions of the latest models, optimized for fast loading with vLLM. The table below lists the most useful ones for coding agents:
Note: GGUF-quantized models load in vLLM via
--quantization gguf. For AWQ or GPTQ INT4 variants, check huggingface.co/models — search forqwen3.6 AWQorgemma-4 GPTQto find community-quantized versions. Unsloth's QAT (quantization-aware training) models typically retain higher quality at aggressive bit-widths than post-hoc quantization.
Hardware fit guide
Additional selection guidance:
- Prefer code-specialized, instruction-tuned models that natively support tool/function calling. If the model card does not mention tool calling, the agent will not be able to edit files reliably.
- Confirm a matching vLLM parser exists for the model (see Enable tool calling on the runtime) before committing to it. Qwen3-Coder models use
qwen3_xml; verify Qwen3.6 and Gemma 4 parser support in the vLLM docs for your version. - Budget for context length. Coding agents send large prompts (system prompt + file and repo context). Pick a model whose context window covers your largest expected prompt, and remember that a longer
--max-model-lenconsumes more KV cache per request, reducing concurrency. - Quantization is a force multiplier on-premise. INT4 (AWQ/GPTQ) or GGUF quantization lets you fit a noticeably more capable model in the same VRAM, which usually matters more for agent quality than raw FP16 precision.
- MoE models are especially efficient. Qwen3.6-35B-A3B and Gemma 4-26B-A4B activate only 3–4B parameters per token while carrying a larger knowledge base, giving near-dense quality at a fraction of the VRAM cost.
Tune inference service performance
Coding-agent traffic has a distinctive shape: long, highly repetitive prompts (the same system prompt and repo context resent every turn), bursts of short interactive requests, and sensitivity to first-token latency. Tune for it:
- Enable prefix caching (
--enable-prefix-caching). This is the single highest-impact flag for coding agents: the shared prompt prefix is reused across turns instead of being recomputed, cutting prefill cost and latency dramatically. See Automatic Prefix Caching — vLLM. - Raise
--gpu-memory-utilizationtoward0.90–0.95to enlarge the KV cache, which increases concurrency and the context length you can sustain. - Right-size
--max-model-len. Set it to the largest context the agent actually needs, not the model's theoretical maximum — every extra token of capacity costs KV-cache memory. - Enable chunked prefill (
--enable-chunked-prefill) when long prompts cause latency spikes under concurrency, so decode steps are not starved by a large prefill. Note the CLI sample disables it by default. - Allow CUDA graphs for steady-state latency: the CLI sample sets
ENFORCE_EAGER=True(eager mode, which starts faster but runs slower). Once the service is stable, switch to non-eager to capture CUDA graphs, at the cost of longer startup. - Tune batching with
--max-num-seqsand--max-num-batched-tokensto balance throughput against per-request latency for your concurrency level. - Use FP8 KV cache (
--kv-cache-dtype fp8) to stretch context length and concurrency when memory is tight. - Shard large models across GPUs with
--tensor-parallel-sizewhen a model does not fit on one card. - Consider speculative decoding for lower interactive latency on agent loops — see Speculative Decoding for vLLM Inference Services.
- Mind autoscaling and cold starts. For interactive single-user agent use, keep
minReplicas: 1— scaling from zero adds a multi-minute cold start that is painful mid-task. For bursty multi-developer usage, configure autoscaling deliberately; see Configure Scaling for Inference Services and Set Up Autoscaling for Inference Services with KEDA. - Allow long requests. Agent turns can be long-running; size the Knative
serving.knative.dev/progress-deadlineannotation and your client timeouts accordingly. If requests are cut off, see Inference timeout troubleshooting.
Getting started with vibe coding
"Vibe coding" — iterating quickly by describing intent and letting the agent write the code — works well with a self-hosted model once the basics are right:
- Start with a Qwen3.6 or Gemma 4 model that fits comfortably on your GPU with headroom; a responsive smaller model beats a sluggish larger one for interactive flow. For 24 GB GPUs,
Qwen3.6-35B-A3B(MoE) is an excellent starting point. - Set a low temperature (around
0–0.2) for code generation to keep edits deterministic and reduce flailing. - Validate tool calling with one trivial task ("create a file and run it") before attempting anything real.
- Keep prompts focused — open or reference only the relevant files so the agent's context stays on-topic and prefill stays cheap.
- Work in small, reviewable steps and read each diff before accepting it. Commit often so you can roll back a bad suggestion cleanly.
Getting started with MLOps
Because the model runs inside your cluster, a coding agent backed by an on-premise InferenceService is a good fit for operating the platform itself — your manifests, configs, and proprietary code never leave the environment, which matters in regulated settings. Productive starting tasks:
- Generate or modify
InferenceServiceYAML — for example, "write anInferenceServicefor model X targeting a 24 GB GPU with prefix caching and tool calling enabled." - Add autoscaling, scheduling, or resource configuration — KEDA/KPA autoscaling, CUDA-version-aware scheduling, or Kueue/Volcano queueing.
- Author and adjust pipelines and monitoring for your model lifecycle.
- Close the loop: deploy a model with the agent, then use that same on-premise model to drive further platform operations.
For detailed MLOps workflows — managing InferenceServices, configuring gateways, tuning performance iteratively, and planning fine-tuning runs — see Run MLOps with Coding Agents and On-Premise LLMs.
Troubleshooting
- Agent chats but never edits files or runs commands. Tool calling is not enabled or the parser does not match the model — see Enable tool calling on the runtime.
model not found/ 404. The model name in the agent config does not match the--served-model-name, or the base URL does not end in/v1.- 401 / 403. The agent is sending the wrong (or no) API key for what the endpoint or gateway expects.
- Requests time out on long tasks. Increase the Knative
progress-deadlineannotation and the client timeout — see Inference timeout troubleshooting. - First request after idle is very slow. The service scaled to zero and is cold-starting; set
minReplicas: 1for interactive use.
References
- Run MLOps with Coding Agents and On-Premise LLMs
- Create Inference Service using CLI
- Configure External Access for Inference Services
- Configure Scaling for Inference Services
- Set Up Autoscaling for Inference Services with KEDA
- Speculative Decoding for vLLM Inference Services
- Extend Inference Runtimes
- Tool Calling — vLLM
- Reasoning Outputs — vLLM
- Automatic Prefix Caching — vLLM
- opencode documentation
- Codex CLI
- Claude Code documentation
- LiteLLM
- claude-code-router