ClickHouse
ClickHouse is the analytics database that stores detailed LLM token usage events. It runs as a single-node StatefulSet in the backend namespace alongside the orchestrator and PostgreSQL.
Connection Details
| Property | Value |
|---|---|
| Service name | backend-clickhouse |
| Namespace | backend |
| HTTP port | 8123 |
| Native port | 9000 |
| Database | default |
| Username | default |
| Image | clickhouse/clickhouse-server:24.8-alpine |
| Storage | 10 Gi persistent volume |
| CPU | 100m request / no limit |
| Memory | 256 Mi request / 1 Gi limit |
Getting the password
The password is stored in AWS Secrets Manager at crawbl/dev/backend/clickhouse and synced to the Kubernetes secret backend-clickhouse-auth by External Secrets:
kubectl get secret backend-clickhouse-auth -n backend \
-o jsonpath='{.data.password}' | base64 -d
Connecting to ClickHouse
Port-forward (recommended for development)
# Forward the native port (for clickhouse-client)
kubectl port-forward svc/backend-clickhouse 9000:9000 -n backend
# Forward the HTTP port (for curl / DBeaver / DataGrip)
kubectl port-forward svc/backend-clickhouse 8123:8123 -n backend
Using clickhouse-client (CLI)
Install the client:
# macOS
brew install clickhouse
# Or use the official Docker image
docker run -it --rm --network host clickhouse/clickhouse-client
Connect (after port-forwarding):
clickhouse-client \
--host localhost \
--port 9000 \
--user default \
--password "$(kubectl get secret backend-clickhouse-auth -n backend -o jsonpath='{.data.password}' | base64 -d)"
Using the HTTP interface (curl)
CH_PASS=$(kubectl get secret backend-clickhouse-auth -n backend \
-o jsonpath='{.data.password}' | base64 -d)
curl "http://localhost:8123/?user=default&password=${CH_PASS}" \
--data "SELECT count() FROM llm_usage"
Using a GUI tool (DBeaver / DataGrip)
| Setting | Value |
|---|---|
| Host | localhost |
| Port | 8123 (HTTP) or 9000 (native) |
| Database | default |
| User | default |
| Password | (from the secret above) |
| Driver | ClickHouse |
DBeaver has a built-in ClickHouse driver. In DataGrip, select "ClickHouse" as the data source type.
Tables
llm_usage (raw events)
Every individual LLM call is recorded here. This is the detailed event log.
| Column | Type | Description |
|---|---|---|
event_id | UUID | Unique event identifier (auto-generated) |
event_time | DateTime64(3) | When the LLM call happened (UTC, millisecond precision) |
user_id | UUID | The platform user who triggered the call |
workspace_id | UUID | The workspace the call was made from |
conversation_id | UUID | The conversation context |
message_id | String | The message that triggered the call |
agent_id | String | Agent slug (e.g., "wally", "manager") |
agent_db_id | UUID | Agent's database ID |
model | String | LLM model used (e.g., "gpt-4o", "claude-3-sonnet") |
provider | String | LLM provider (e.g., "openai", "anthropic") |
prompt_tokens | Int32 | Input tokens consumed |
completion_tokens | Int32 | Output tokens generated |
total_tokens | Int32 | Sum of prompt + completion |
tool_use_prompt_tokens | Int32 | Tokens used in tool call prompts |
thoughts_tokens | Int32 | Tokens used for extended thinking |
cached_tokens | Int32 | Tokens served from provider cache |
cost_usd | Decimal64(6) | Computed cost in USD |
call_sequence | Int32 | Order within a single turn (0-indexed) |
turn_id | String | Turn identifier |
session_id | String | Session/conversation ID |
Engine: MergeTree, partitioned by toYYYYMM(event_time), ordered by (user_id, workspace_id, event_time, event_id).
TTL: 13 months -- events older than 13 months are automatically deleted.
llm_usage_daily (aggregated rollups)
Daily summaries auto-populated by a materialized view. Use this for dashboard queries instead of scanning raw events.
| Column | Type | Description |
|---|---|---|
day | Date | The calendar day |
user_id | UUID | The user |
workspace_id | UUID | The workspace |
model | String | The LLM model |
agent_id | String | The agent slug |
total_prompt_tokens | Int64 | Sum of input tokens for the day |
total_completion_tokens | Int64 | Sum of output tokens for the day |
total_tokens | Int64 | Sum of all tokens for the day |
total_cost_usd | Decimal128(6) | Sum of costs for the day |
request_count | Int64 | Number of LLM calls for the day |
Engine: SummingMergeTree, partitioned by toYYYYMM(day), ordered by (user_id, workspace_id, day, model, agent_id).
TTL: 25 months.
The SummingMergeTree engine automatically merges rows with the same key by summing the numeric columns. This means you should always use sum() in queries to get accurate results, as un-merged parts may still contain duplicate partial rows.
Supporting Services
pricing_refresh River job
Model pricing is refreshed daily by the pricing_refresh River worker running inside the orchestrator process — no standalone pod or CronJob. The worker uses modelpricingrepo to insert new rows into the Postgres model_pricing table, fetching current per-token prices from LiteLLM.
| Property | Value |
|---|---|
| Queue | pricing_refresh |
| Schedule | Daily periodic (River periodic job) |
| Runs in | Orchestrator pod |
| Writes to | Postgres orchestrator.model_pricing |
Check recent runs via riverui or Postgres:
kubectl port-forward svc/backend-postgresql 5432:5432 -n backend
psql -h localhost -U crawbl -d crawbl \
-c "SELECT id, state, attempt, finalized_at FROM orchestrator.river_job WHERE kind = 'pricing_refresh' ORDER BY finalized_at DESC LIMIT 10;"
Example Queries
Total usage for a user this month
SELECT
sum(total_tokens) AS tokens,
sum(cost_usd) AS cost_usd,
count() AS requests
FROM llm_usage
WHERE user_id = '<user-uuid>'
AND event_time >= toStartOfMonth(now());
Usage breakdown by model
SELECT
model,
sum(prompt_tokens) AS input_tokens,
sum(completion_tokens) AS output_tokens,
sum(total_tokens) AS total_tokens,
sum(cost_usd) AS cost_usd,
count() AS requests
FROM llm_usage
WHERE workspace_id = '<workspace-uuid>'
AND event_time >= toStartOfMonth(now())
GROUP BY model
ORDER BY cost_usd DESC;
Daily usage trend (last 30 days)
SELECT
day,
sum(total_tokens) AS tokens,
sum(total_cost_usd) AS cost_usd,
sum(request_count) AS requests
FROM llm_usage_daily
WHERE user_id = '<user-uuid>'
AND day >= today() - 30
GROUP BY day
ORDER BY day;
Most active agents by cost
SELECT
agent_id,
sum(total_tokens) AS tokens,
sum(cost_usd) AS cost_usd,
count() AS requests
FROM llm_usage
WHERE workspace_id = '<workspace-uuid>'
AND event_time >= toStartOfMonth(now())
GROUP BY agent_id
ORDER BY cost_usd DESC;
Hourly request rate (spot traffic spikes)
SELECT
toStartOfHour(event_time) AS hour,
count() AS requests,
sum(total_tokens) AS tokens
FROM llm_usage
WHERE event_time >= now() - INTERVAL 24 HOUR
GROUP BY hour
ORDER BY hour;
Check if data is flowing (health check)
SELECT
count() AS events_last_hour,
max(event_time) AS latest_event
FROM llm_usage
WHERE event_time >= now() - INTERVAL 1 HOUR;
If events_last_hour is 0 and users are actively chatting, check the orchestrator logs for River usage_write job errors and verify ClickHouse connectivity from the orchestrator pod.
Secrets Reference
| Secret | Location (AWS SM) | K8s Secret | Used by |
|---|---|---|---|
| ClickHouse password | crawbl/dev/backend/clickhouse | backend-clickhouse-auth | StatefulSet, orchestrator (usagewriter worker) |
| Postgres password | crawbl/dev/backend/postgresql | backend-postgresql-auth | Orchestrator (pricing_refresh River job, migrations) |
Troubleshooting
"No data in ClickHouse"
- Check if the orchestrator is running and healthy:
kubectl get pods -n backend -l app.kubernetes.io/name=orchestrator - Check orchestrator logs for usagewriter errors:
kubectl logs deployment/orchestrator -n backend | grep -i usage - Verify River job queue has no stuck
usage_writejobs: queryorchestrator.river_jobwherekind = 'usage_write'andstate IN ('retryable', 'discarded') - Verify ClickHouse is healthy:
kubectl exec -n backend backend-clickhouse-0 -- clickhouse-client -q "SELECT 1"
"Costs are showing as 0"
- Check if
model_pricinghas entries: query Postgres forSELECT count(*) FROM orchestrator.model_pricing - Check if the
pricing_refreshRiver job ran successfully: queryorchestrator.river_jobwherekind = 'pricing_refresh'and checkstateandfinalized_at - Verify the pricing cache loaded: check orchestrator startup logs for "pricing cache refreshed"
"Usage counters seem wrong"
- Compare Postgres counters with ClickHouse raw events -- ClickHouse is the source of truth for detailed data
- Check if the agent runtime is actually emitting UsageEvents: look for "llm usage" in orchestrator logs
- Remember: Postgres counters are incremented in real-time, ClickHouse data may be delayed by up to the batch flush interval (500ms)