Charging Back Token Usage
Introduction
This guide defines a custom cost model for AI Gateway token usage by following the platform's custom cost model workflow. It mirrors the structure of the official vGPU (Hami) cost model guide: add a collection configuration on the Cost Management agent cluster, add a display/storage configuration on the Cost Management server cluster, and then add a price to the cost model from the platform console.
Cost Management consumes the OpenTelemetry GenAI token metric collected by Metering Token Usage, keyed on the user_namespace label, and turns it into per-namespace bills. In a multi-cluster deployment, the Cost Management server and the Cost Management agent may live on different clusters; each step below states the cluster on which it must be performed.
Use Cases
- Bill each namespace or tenant for the AI Gateway tokens it consumed.
- Apply a per-model price so more expensive models cost more per token.
- Produce auditable chargeback reports from the same metric used for dashboards.
Prerequisites
-
Metering Token Usage is configured: the
PodMonitoris collecting the token metric, and the controller mapsx-user-namespace:user_namespaceso the series carries theuser_namespacelabel. Confirm the metric is queryable from the agent cluster's platform Prometheus (or Thanos): -
Cost Management is installed:
cost-serverandcost-apion the server cluster,cost-agenton every cluster whose AI Gateway traffic should be billed. See Cost Management installation. -
kubectlaccess to both clusters with permission to writeConfigMaps incpaas-system(agent cluster) andkube-public(server cluster).
Steps
Add the collection configuration
Cluster: the Cost Management agent cluster — the cluster where the AI Gateway runs and cost-agent is installed. The cost-agent component is deployed as the slark-agent workload, which the restart command below targets.
Create a ConfigMap that tells the Cost Management agent which Prometheus query to evaluate and how to map its labels onto Cost Management dimensions. The agent discovers configurations carrying the cpaas.io/slark.collection.config: "true" label in cpaas-system.
Field reference:
kind: the Cost Management collector kind, which also names the collector (keep it unique among collection configs).Podis reserved for OpenCost-emitted pod metrics andProjectfor the platform's built-in CPU/Memory/Storage quota collectors. AI Gateway token usage uses a dedicated custom kind,AIGateway, matching the vGPU/NPU/pGPU custom-kind pattern — this is the value verified on the cluster to populatecost.usagewithAIGatewayTokenUsagerows. After applying this configuration, confirm rows appear (see thecost.usagecheck in Troubleshooting) before moving on.category,item: identifiers used to link this collection configuration to its display/storage counterpart in the next step. Both values must match the corresponding fields in the display/storage configuration.period: the aggregation period. UseHourlyto bill by hour.usage.query: the PromQL query the agent evaluates everystep. The platform stores the metric under its OpenTelemetry UTF-8 name with dots preserved, so select it with{__name__="gen_ai.client.token.usage_token_sum"}and reference the dottedgen_ai.request.modellabel with the quoted UTF-8 syntax. The header-mappeduser_namespacelabel is the per-namespace billing key.usage.step: the query evaluation interval.usage.mappers: maps PromQL labels onto Cost Management's standard dimensions (name,namespace,cluster,project). Setclusterto an empty string so the agent fills in its own cluster identity automatically; setting it to a label name (such ascluster) only works if the source metric exposes that label, otherwise every row is dropped without a log entry.
After applying the YAML, restart the slark-agent workload (the cost-agent component) to reload the configuration:
Add the display configuration
Cluster: the Cost Management server cluster — the cluster where cost-server is installed (typically the global control plane). The cost-server component is deployed as the slark-server workload, which the restart command below targets.
Create a ConfigMap that registers the billing item and its billing methods in the platform console. The server discovers configurations carrying the cpaas.io/slark.display.config: "true" label in kube-public.
Field reference:
name: the billing item name shown in the cost model form. Must match thecategoryvalue in the collection configuration.methods[].name: the billing method, listed under the billing item when adding a price.methods[].item: must match theitemvalue in the collection configuration so the server can join the per-method price back to the usage rows.divisor: the unit conversion factor applied when displaying usage. Tokens are unitless, so set1; for byte-sized items use1073741824to render asGi-hours.
The following table summarizes the billing methods registered by this configuration:
Do not edit the platform-installed slark-server-common-config (the default display configuration containing CPU, Memory and Storage). Adding a custom entry to it causes the server to fail validation at startup. Always add custom billing items as a separate ConfigMap labelled cpaas.io/slark.display.config: "true".
After applying the YAML, restart the slark-server workload (the cost-server component) to reload the configuration:
Add a price to the cost model
Cluster: any cluster — operated from the platform console served by cost-api.
In the platform console, navigate to Administrator → Metering and Billing → Cost Model, then create or edit a cost model. The newly registered AI tokens billing item is now selectable in the price form.
- Cost Model Name: any identifier, for example
aml-cost-model. - Linked Clusters: select the clusters whose AI Gateway traffic this model should price. An empty selection saves successfully but matches no usage data and produces no bills.
- Pricing rows:
- Billing Item:
AI tokens - Billing Method:
Token Usage - Default Price: the per-token rate, in the platform's currency.
- Price By Label (optional): per-model overrides, for example a higher rate for
gen_ai.request.model="gpt-4o".
- Billing Item:
- Save.
Verification
The cost-server worker runs every five minutes. Drive a few authenticated requests through the AI Gateway with different x-user-namespace values, wait at least one worker cycle, and refresh the platform console.
- Cost Details (Administrator → Metering and Billing → Cost Details) shows per-namespace AI tokens line items. Filter by namespace or by date to drill in.
- Cost Statistics (Administrator → Metering and Billing → Cost Statistics) aggregates the same data by cluster, project, and time range.
For server-side verification before opening the UI, query ClickHouse on the server cluster:
Expect one row per (namespace, hour) combination that consumed tokens. The cost column stores the platform's micro-currency unit, so usage × default_price × 1_000_000 should match the value shown.
Troubleshooting
Learn More
Next Steps
After bills are generated, set per-model price overrides under Price By Label to reflect the relative cost of each model, and schedule periodic export of cost.bills for finance reporting.