Перейти к основному содержимому

ClickHouse

ClickHouse is the analytics database that stores detailed LLM token usage events. It runs as a single-node StatefulSet in the backend namespace alongside the orchestrator and PostgreSQL.


Connection Details

PropertyValue
Service namebackend-clickhouse
Namespacebackend
HTTP port8123
Native port9000
Databasedefault
Usernamedefault
Imageclickhouse/clickhouse-server:24.8-alpine
Storage10 Gi persistent volume
CPU100m request / no limit
Memory256 Mi request / 1 Gi limit

Getting the password

The password is stored in AWS Secrets Manager at crawbl/dev/backend/clickhouse and synced to the Kubernetes secret backend-clickhouse-auth by External Secrets:

kubectl get secret backend-clickhouse-auth -n backend \
-o jsonpath='{.data.password}' | base64 -d

Connecting to ClickHouse

# Forward the native port (for clickhouse-client)
kubectl port-forward svc/backend-clickhouse 9000:9000 -n backend

# Forward the HTTP port (for curl / DBeaver / DataGrip)
kubectl port-forward svc/backend-clickhouse 8123:8123 -n backend

Using clickhouse-client (CLI)

Install the client:

# macOS
brew install clickhouse

# Or use the official Docker image
docker run -it --rm --network host clickhouse/clickhouse-client

Connect (after port-forwarding):

clickhouse-client \
--host localhost \
--port 9000 \
--user default \
--password "$(kubectl get secret backend-clickhouse-auth -n backend -o jsonpath='{.data.password}' | base64 -d)"

Using the HTTP interface (curl)

CH_PASS=$(kubectl get secret backend-clickhouse-auth -n backend \
-o jsonpath='{.data.password}' | base64 -d)

curl "http://localhost:8123/?user=default&password=${CH_PASS}" \
--data "SELECT count() FROM llm_usage"

Using a GUI tool (DBeaver / DataGrip)

SettingValue
Hostlocalhost
Port8123 (HTTP) or 9000 (native)
Databasedefault
Userdefault
Password(from the secret above)
DriverClickHouse
подсказка

DBeaver has a built-in ClickHouse driver. In DataGrip, select "ClickHouse" as the data source type.


Tables

llm_usage (raw events)

Every individual LLM call is recorded here. This is the detailed event log.

ColumnTypeDescription
event_idUUIDUnique event identifier (auto-generated)
event_timeDateTime64(3)When the LLM call happened (UTC, millisecond precision)
user_idUUIDThe platform user who triggered the call
workspace_idUUIDThe workspace the call was made from
conversation_idUUIDThe conversation context
message_idStringThe message that triggered the call
agent_idStringAgent slug (e.g., "wally", "manager")
agent_db_idUUIDAgent's database ID
modelStringLLM model used (e.g., "gpt-4o", "claude-3-sonnet")
providerStringLLM provider (e.g., "openai", "anthropic")
prompt_tokensInt32Input tokens consumed
completion_tokensInt32Output tokens generated
total_tokensInt32Sum of prompt + completion
tool_use_prompt_tokensInt32Tokens used in tool call prompts
thoughts_tokensInt32Tokens used for extended thinking
cached_tokensInt32Tokens served from provider cache
cost_usdDecimal64(6)Computed cost in USD
call_sequenceInt32Order within a single turn (0-indexed)
turn_idStringTurn identifier
session_idStringSession/conversation ID

Engine: MergeTree, partitioned by toYYYYMM(event_time), ordered by (user_id, workspace_id, event_time, event_id).

TTL: 13 months -- events older than 13 months are automatically deleted.


llm_usage_daily (aggregated rollups)

Daily summaries auto-populated by a materialized view. Use this for dashboard queries instead of scanning raw events.

ColumnTypeDescription
dayDateThe calendar day
user_idUUIDThe user
workspace_idUUIDThe workspace
modelStringThe LLM model
agent_idStringThe agent slug
total_prompt_tokensInt64Sum of input tokens for the day
total_completion_tokensInt64Sum of output tokens for the day
total_tokensInt64Sum of all tokens for the day
total_cost_usdDecimal128(6)Sum of costs for the day
request_countInt64Number of LLM calls for the day

Engine: SummingMergeTree, partitioned by toYYYYMM(day), ordered by (user_id, workspace_id, day, model, agent_id).

TTL: 25 months.

к сведению

The SummingMergeTree engine automatically merges rows with the same key by summing the numeric columns. This means you should always use sum() in queries to get accurate results, as un-merged parts may still contain duplicate partial rows.


Supporting Services

pricing_refresh River job

Model pricing is refreshed daily by the pricing_refresh River worker running inside the orchestrator process — no standalone pod or CronJob. The worker uses modelpricingrepo to insert new rows into the Postgres model_pricing table, fetching current per-token prices from LiteLLM.

PropertyValue
Queuepricing_refresh
ScheduleDaily periodic (River periodic job)
Runs inOrchestrator pod
Writes toPostgres orchestrator.model_pricing

Check recent runs via riverui or Postgres:

kubectl port-forward svc/backend-postgresql 5432:5432 -n backend
psql -h localhost -U crawbl -d crawbl \
-c "SELECT id, state, attempt, finalized_at FROM orchestrator.river_job WHERE kind = 'pricing_refresh' ORDER BY finalized_at DESC LIMIT 10;"

Example Queries

Total usage for a user this month

SELECT
sum(total_tokens) AS tokens,
sum(cost_usd) AS cost_usd,
count() AS requests
FROM llm_usage
WHERE user_id = '<user-uuid>'
AND event_time >= toStartOfMonth(now());

Usage breakdown by model

SELECT
model,
sum(prompt_tokens) AS input_tokens,
sum(completion_tokens) AS output_tokens,
sum(total_tokens) AS total_tokens,
sum(cost_usd) AS cost_usd,
count() AS requests
FROM llm_usage
WHERE workspace_id = '<workspace-uuid>'
AND event_time >= toStartOfMonth(now())
GROUP BY model
ORDER BY cost_usd DESC;

Daily usage trend (last 30 days)

SELECT
day,
sum(total_tokens) AS tokens,
sum(total_cost_usd) AS cost_usd,
sum(request_count) AS requests
FROM llm_usage_daily
WHERE user_id = '<user-uuid>'
AND day >= today() - 30
GROUP BY day
ORDER BY day;

Most active agents by cost

SELECT
agent_id,
sum(total_tokens) AS tokens,
sum(cost_usd) AS cost_usd,
count() AS requests
FROM llm_usage
WHERE workspace_id = '<workspace-uuid>'
AND event_time >= toStartOfMonth(now())
GROUP BY agent_id
ORDER BY cost_usd DESC;

Hourly request rate (spot traffic spikes)

SELECT
toStartOfHour(event_time) AS hour,
count() AS requests,
sum(total_tokens) AS tokens
FROM llm_usage
WHERE event_time >= now() - INTERVAL 24 HOUR
GROUP BY hour
ORDER BY hour;

Check if data is flowing (health check)

SELECT
count() AS events_last_hour,
max(event_time) AS latest_event
FROM llm_usage
WHERE event_time >= now() - INTERVAL 1 HOUR;
подсказка

If events_last_hour is 0 and users are actively chatting, check the orchestrator logs for River usage_write job errors and verify ClickHouse connectivity from the orchestrator pod.


Secrets Reference

SecretLocation (AWS SM)K8s SecretUsed by
ClickHouse passwordcrawbl/dev/backend/clickhousebackend-clickhouse-authStatefulSet, orchestrator (usagewriter worker)
Postgres passwordcrawbl/dev/backend/postgresqlbackend-postgresql-authOrchestrator (pricing_refresh River job, migrations)

Troubleshooting

"No data in ClickHouse"

  1. Check if the orchestrator is running and healthy: kubectl get pods -n backend -l app.kubernetes.io/name=orchestrator
  2. Check orchestrator logs for usagewriter errors: kubectl logs deployment/orchestrator -n backend | grep -i usage
  3. Verify River job queue has no stuck usage_write jobs: query orchestrator.river_job where kind = 'usage_write' and state IN ('retryable', 'discarded')
  4. Verify ClickHouse is healthy: kubectl exec -n backend backend-clickhouse-0 -- clickhouse-client -q "SELECT 1"

"Costs are showing as 0"

  1. Check if model_pricing has entries: query Postgres for SELECT count(*) FROM orchestrator.model_pricing
  2. Check if the pricing_refresh River job ran successfully: query orchestrator.river_job where kind = 'pricing_refresh' and check state and finalized_at
  3. Verify the pricing cache loaded: check orchestrator startup logs for "pricing cache refreshed"

"Usage counters seem wrong"

  1. Compare Postgres counters with ClickHouse raw events -- ClickHouse is the source of truth for detailed data
  2. Check if the agent runtime is actually emitting UsageEvents: look for "llm usage" in orchestrator logs
  3. Remember: Postgres counters are incremented in real-time, ClickHouse data may be delayed by up to the batch flush interval (500ms)