Перейти к основному содержимому

Metrics Guide

A step-by-step guide to understanding, querying, and acting on metrics in the Crawbl platform. Written for engineers who have never touched Prometheus, MetricsQL, or any monitoring tool before.


Chapter 1: Getting Started

What are metrics and why do they matter?

A metric is a number that describes something about a running system, measured over time.

Think of it like a car dashboard:

  • Speedometer -- how fast are you going right now?
  • Odometer -- how far have you driven total?
  • Fuel gauge -- how much gas is left?

Metrics are the same thing, but for software. They tell you things like how many database connections are open, how much memory Redis is using, or whether PostgreSQL is even running.

к сведению

Without metrics, the only way to know something is wrong is when a user reports it. With metrics, you can see problems forming before they become outages.


What is VictoriaMetrics?

VictoriaMetrics is a database that stores numbers over time. That is literally it.

Every 30 seconds, it reaches out to each service in our cluster, asks "what are your current numbers?", and stores the response with a timestamp. You can then query those numbers to see trends, spot problems, or verify that a deploy went smoothly.

примечание

VictoriaMetrics is compatible with Prometheus (the industry standard), so any PromQL guide or Stack Overflow answer you find will work here too. It also supports MetricsQL, a superset that adds extra convenience functions.


How metrics get collected

The collection process has three steps:

  1. A service exposes a /metrics endpoint. This is a plain HTTP page that lists all the service's current numbers in a specific text format.
  2. VictoriaMetrics scrapes that endpoint every 30 seconds. It discovers which pods to scrape by looking for a Kubernetes annotation: prometheus.io/scrape: "true".
  3. The numbers are stored with a timestamp. You can then query them over any time range.
подсказка

You do not need to configure anything when a new service adds the annotation -- VictoriaMetrics finds it automatically.


What we currently scrape

ServiceHow it exposes metrics
PostgreSQLBitnami metrics sidecar (translates internal DB stats)
RedisBitnami metrics sidecar (translates Redis INFO output)
cert-managerBuilt-in Prometheus endpoint
Envoy GatewayBuilt-in Prometheus endpoint
Cilium (networking)Built-in Prometheus endpoint

How to open the metrics UI

VictoriaMetrics ships with a built-in query interface called vmui.

Open it here: https://dev.metrics.crawbl.com/vmui

No login is required for the dev environment.

UI ElementWhat it does
Query bar (top)Type your query here and press Enter
Time range picker (right)Defaults to the last hour
Graph tabPlots results as a chart over time
Table tabShows raw numbers (use for instant queries)
JSON tabShows the raw API response

Your first query

Check if monitoring is working at all.

up

What this does: returns 1 for every service that VictoriaMetrics can reach, and 0 for any service that is down. You should see results for PostgreSQL, Redis, cert-manager, and other scraped targets.

подсказка

If you are ever unsure whether monitoring is working, up is the first thing to check.


Chapter 2: Understanding Metrics

What a metric looks like

Every metric has four parts: a name, labels, a value, and a timestamp.

pg_up{namespace="backend", pod="backend-postgresql-0"} = 1
PartExamplePurpose
Namepg_upWhat is being measured
Labels{namespace="backend", pod="backend-postgresql-0"}Which specific thing is being measured
Value1The actual number
Timestamp(hidden)When this measurement was taken

Labels are filters

Labels let you narrow down results. Filter using these operators inside {}:

OperatorMeaningExample
=Exact match{namespace="backend"}
!=Not equal{state!="idle"}
=~Regex match{pod=~"backend-redis.*"}
!~Regex exclude{datname!~"template.*"}

You can combine multiple filters -- they are ANDed together:

pg_stat_activity_count{namespace="backend", state="active"}

What this does: returns only active database connections in the backend namespace.


Counters vs gauges

There are two fundamental types of metrics. Understanding the difference saves you from confusion later.

GaugeCounter
AnalogySpeedometerOdometer
BehaviorGoes up and downOnly goes up (resets on restart)
RepresentsA current valueA cumulative total
How to readRead directlyUse rate() or increase()
Name patternNo _total suffixEnds in _total

Gauge examples:

  • redis_memory_used_bytes -- memory Redis is using right now
  • pg_stat_activity_count -- database connections right now
  • redis_connected_clients -- clients connected right now

Counter examples:

  • redis_commands_processed_total -- total commands ever processed
  • redis_keyspace_hits_total -- total cache hits since start
  • pg_stat_database_tup_inserted -- total rows inserted since start
к сведению

A raw counter value is rarely useful on its own. "Redis has processed 4,823,917 commands" does not tell you much. What you want is "how many commands per second?" -- that is where rate() comes in.


Why rate() matters for counters

rate() takes a counter and calculates how fast it is increasing per second:

rate(redis_commands_processed_total[5m])

What this does: looks at the last 5 minutes of redis_commands_processed_total values and calculates the per-second rate of increase. The result might be 42.5, meaning Redis is handling about 42.5 commands per second.

warning

Only use rate() on counters. Applying rate() to a gauge produces meaningless results. If a metric name ends in _total, it is a counter -- use rate() or increase(). If it does not end in _total, it is probably a gauge and you can read it directly.


Chapter 3: What We Monitor

PostgreSQL Metrics

These come from the Bitnami postgres-exporter sidecar. It connects to PostgreSQL and translates internal database statistics into scrapeable metrics.

подсказка

Type pg_ in the vmui query bar to see all available PostgreSQL metrics.

Metric reference

MetricTypeWhat it tells you
pg_upGaugeIs the database alive? (1 = yes, 0 = no)
pg_database_size_bytesGaugeSize of a database in bytes
pg_stat_activity_countGaugeNumber of connections, broken down by state
pg_settings_max_connectionsGaugeThe max_connections config value
pg_stat_database_tup_insertedCounterTotal rows inserted
pg_stat_database_tup_updatedCounterTotal rows updated
pg_stat_database_tup_deletedCounterTotal rows deleted
pg_locks_countGaugeActive locks, broken down by mode
pg_database_connection_limitGaugePer-database connection limit (-1 = unlimited)

pg_up

Is the database alive?

pg_up

What this does: returns 1 if PostgreSQL is reachable, 0 if it is down or the exporter cannot connect.

подсказка

This is the first thing to check during any database incident.


pg_database_size_bytes

How big is our data?

pg_database_size_bytes{datname="crawbl"} / 1024 / 1024

What this does: shows the size of the crawbl database in megabytes. The datname="crawbl" filter excludes system databases like postgres and template1.

подсказка

Set the time range to "Last 7d" or "Last 30d" to see growth trends.


pg_stat_activity_count

How many connections are open?

pg_stat_activity_count

What this does: shows the number of database connections, broken down by state.

StateMeaning
activeCurrently running a query
idleOpen but doing nothing
idle in transactionInside a transaction but not executing (can hold locks)
warning

If idle in transaction is high, investigate immediately -- these connections hold locks and block other queries.


pg_settings_max_connections

What is the connection limit?

pg_settings_max_connections

What this does: returns the max_connections value from PostgreSQL's configuration (default: 100). Compare against pg_stat_activity_count to see how close you are to the limit.


Row throughput (inserts / updates / deletes)

How much write activity is happening?

rate(pg_stat_database_tup_inserted{datname="crawbl"}[5m])

What this does: shows rows inserted per second over the last 5 minutes. Replace tup_inserted with tup_updated or tup_deleted to see other operations.

примечание

These are counters -- always use rate() to get meaningful per-second values.


pg_locks_count

Are queries waiting on locks?

pg_locks_count

What this does: shows the number of active locks, broken down by mode.

warning

A high count of ExclusiveLock or AccessExclusiveLock can indicate contention that slows down queries significantly.


pg_database_connection_limit

Per-database connection limit

pg_database_connection_limit{datname="crawbl"}

What this does: shows the connection limit configured for a specific database (as opposed to the server-wide max_connections). A value of -1 means no per-database limit.

осторожно

PostgreSQL does not perform well as it approaches max_connections. If connection counts exceed 80% of the limit, investigate connection pooling or idle connections before it becomes an outage.


Redis Metrics

These come from the Bitnami redis-exporter sidecar. It translates the output of Redis's INFO command into metrics.

подсказка

Type redis_ in the vmui query bar to see all available Redis metrics.

Metric reference

MetricTypeWhat it tells you
redis_upGaugeIs Redis alive? (1 = yes, 0 = no)
redis_connected_clientsGaugeNumber of connected clients
redis_memory_used_bytesGaugeCurrent memory usage
redis_commands_processed_totalCounterTotal commands processed
redis_keyspace_hits_totalCounterTotal cache hits
redis_keyspace_misses_totalCounterTotal cache misses
redis_connected_slavesGaugeNumber of connected replicas
redis_db_keysGaugeNumber of keys stored
redis_evicted_keys_totalCounterTotal keys evicted due to memory pressure

redis_up

Is Redis alive?

redis_up

What this does: returns 1 if Redis is reachable, 0 if it is down.


redis_connected_clients

How many clients are connected?

redis_connected_clients

What this does: shows the current number of connected clients. Compare against redis_config_maxclients to see capacity.


redis_memory_used_bytes

How much memory is Redis using?

redis_memory_used_bytes / 1024 / 1024

What this does: shows current memory usage in megabytes.

к сведению

The Redis pod in dev has a 2Gi persistent volume. Monitor this metric to make sure in-memory data is not growing unbounded. If it climbs toward the pod's memory limit, check that TTLs are set correctly.


redis_commands_processed_total

What is the throughput?

rate(redis_commands_processed_total[5m])

What this does: shows commands processed per second over the last 5 minutes. Spikes correspond to busy periods in the application.


redis_keyspace_hits_total / redis_keyspace_misses_total

Is the cache effective?

rate(redis_keyspace_hits_total[5m])

What this does: shows cache hits per second. A high hit rate means the cache is working well -- most lookups are served from Redis instead of hitting the database.


redis_connected_slaves

Replication status

redis_connected_slaves

What this does: shows the number of connected replicas.

примечание

In our dev environment this should be 0 (standalone mode, no replicas). A non-zero value means something unexpected is happening.


redis_db_keys

How many keys are stored?

redis_db_keys

What this does: shows the number of keys in each database. Useful for understanding how much data is cached.


redis_evicted_keys_total

Are keys being evicted?

rate(redis_evicted_keys_total[5m])

What this does: shows the rate of key evictions per second.

осторожно

If this is non-zero, Redis is running out of memory and deleting keys to make space. The cache is too small for the workload -- investigate immediately.


Other Metrics

VictoriaMetrics also scrapes cert-manager, Envoy Gateway, and Cilium. These are primarily useful for infrastructure debugging.

ServiceKey metricWhat it tells you
cert-managercertmanager_certificate_ready_status1 = cert valid and ready, 0 = renewal failed
Envoy Gatewayenvoy_cluster_upstream_cx_activeActive connections per Envoy cluster (backend service)
Ciliumcilium_*Network-level metrics (flows, drops, policy verdicts)
подсказка

VictoriaMetrics does not currently expose self-monitoring metrics (vm_*) in this deployment. To check VictoriaMetrics health directly:

curl https://dev.metrics.crawbl.com/health

Chapter 4: Real-World Scenarios

Each scenario gives you the exact query to copy-paste, explains what it does, and tells you what the answer means.


"Is everything running?"

up

What this does: every scraped service appears with a value of 1 (healthy) or 0 (down). This is your health check dashboard in a single query.

Reading the result
  • If all values are 1 -- everything is healthy
  • If any value is 0 -- that service needs immediate attention

"Is our database about to run out of connections?"

sum(pg_stat_activity_count) / scalar(pg_settings_max_connections) * 100

What this does: divides current connections by the maximum allowed and multiplies by 100 to get a percentage.

Reading the result
  • Below 50% -- comfortable, no action needed
  • Between 50-80% -- worth monitoring, consider optimizing idle connections
  • Above 80% -- danger territory, investigate connection pooling immediately

"How fast is our database growing?"

pg_database_size_bytes{datname="crawbl"} / 1024 / 1024

What this does: returns the database size in megabytes.

подсказка

Set the time range to "Last 7d" or "Last 30d" in vmui to see the growth trend on the graph.


"Is Redis being effective as a cache?"

rate(redis_keyspace_hits_total[5m]) / (rate(redis_keyspace_hits_total[5m]) + rate(redis_keyspace_misses_total[5m])) * 100

What this does: returns the cache hit rate as a percentage.

Reading the result
  • Above 90% -- healthy, cache is working well
  • Between 70-90% -- acceptable, but could improve
  • Below 70% -- most lookups hit the database instead of cache. Investigate whether keys expire too quickly or the cache is being evicted
примечание

If the result is empty or shows NaN, it means both redis_keyspace_hits_total and redis_keyspace_misses_total are zero -- no cache lookups have occurred yet. This is normal in a low-traffic dev environment. The query will produce results once the application starts using Redis for caching.


"How much memory is Redis using?"

redis_memory_used_bytes / 1024 / 1024

What this does: returns memory usage in megabytes.

Reading the result
  • If steady -- normal operation
  • If climbing steadily -- check whether TTLs are configured and eviction policies are appropriate
  • If near pod memory limit -- risk of OOM kill, investigate immediately

"Are there lock contention issues in the database?"

pg_locks_count

What this does: shows all active locks broken down by type.

Reading the result
  • Low numbers across all types -- normal
  • Sustained high ExclusiveLock -- queries are blocking each other

To investigate further, check for long-running transactions:

pg_stat_activity_max_tx_duration

What this does: shows the duration (in seconds) of the longest-running transaction.


"What is the database write throughput?"

rate(pg_stat_database_tup_inserted{datname="crawbl"}[5m])

What this does: shows rows inserted per second.

To see all write operations combined:

rate(pg_stat_database_tup_inserted{datname="crawbl"}[5m])
+ rate(pg_stat_database_tup_updated{datname="crawbl"}[5m])
+ rate(pg_stat_database_tup_deleted{datname="crawbl"}[5m])

What this does: sums inserts, updates, and deletes into a single writes-per-second value.


"Are our TLS certificates about to expire?"

(certmanager_certificate_expiration_timestamp_seconds - time()) / 86400

What this does: returns the number of days until each certificate expires.

Reading the result
  • Above 30 days -- healthy, cert-manager is renewing properly
  • Between 7-30 days -- monitor closely
  • Below 7 days -- investigate why cert-manager is not renewing

"Show me everything that is being monitored"

count by (__name__) ({__name__=~".+"})

What this does: lists every metric name and how many time series it has.

подсказка

Run this in Table view (not Graph) to browse what is available without overloading the UI.


Chapter 5: Advanced Queries

Once you are comfortable with the basics, these functions let you ask more sophisticated questions.


rate()

How fast is this counter growing per second?

rate(redis_commands_processed_total[5m])

What this does: takes a counter and returns the per-second rate of increase, averaged over the window. The [5m] window smooths out spikes.

подсказка
  • Use [5m] for stable, smoothed results (recommended default)
  • Use [1m] for more responsive but noisier results
warning

rate() only works on counters. Never apply it to a gauge -- the results will be meaningless. If the metric name does not end in _total, it is probably a gauge.


increase()

How much did this counter grow in the last hour?

increase(redis_commands_processed_total[1h])

What this does: returns the total increase over the window. Example result: "Redis processed 152,847 commands in the last hour."

подсказка

Use increase() when you want absolute numbers rather than per-second rates.


avg_over_time()

What is the average over a time period?

avg_over_time(pg_stat_activity_count{state="active"}[30m])

What this does: returns the average value of a gauge over the window. Example: "We averaged 3.2 active database connections over the last 30 minutes."


max_over_time()

What was the peak?

max_over_time(redis_memory_used_bytes[1h]) / 1024 / 1024

What this does: returns the highest value a gauge reached during the window. Useful for capacity planning.

подсказка

Combine with min_over_time() to see the full range of a metric over a period.


sum by (label)

Group and total

sum by (state) (pg_stat_activity_count)

What this does: aggregates multiple series into one, grouped by the specified label. This gives you total connections per state (active, idle, etc.) regardless of which pod they belong to.


topk(N, query)

Show me the top N

topk(5, rate(redis_commands_processed_total[5m]))

What this does: returns only the top N series by value. Useful when you have many targets and want to find the busiest ones.


Math operations

You can use standard arithmetic on metrics:

sum(pg_stat_activity_count) / scalar(pg_settings_max_connections) * 100

What this does: calculates connection usage as a percentage.

OperatorMeaning
+Add
-Subtract
*Multiply
/Divide
%Modulo

Comparison operators for alert-style queries

sum(pg_stat_activity_count) / scalar(pg_settings_max_connections) * 100 > 80

What this does: returns results only when the value exceeds 80. If the result is empty, you are below the threshold.

подсказка

Think of it as "show me a problem only if there is one." Other comparison operators: <, >=, <=, ==, !=.


Chapter 6: Adding Metrics to the Orchestrator

When you are ready to add custom metrics to the Go orchestrator (or any new service), follow these steps.


Step 1: Add the Prometheus client library

go get github.com/prometheus/client_golang/prometheus
go get github.com/prometheus/client_golang/prometheus/promhttp

Step 2: Define and register your metrics

import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
)

var (
requestsTotal = promauto.NewCounterVec(prometheus.CounterOpts{
Name: "crawbl_http_requests_total",
Help: "Total number of HTTP requests handled.",
}, []string{"method", "path", "status"})

requestDuration = promauto.NewHistogramVec(prometheus.HistogramOpts{
Name: "crawbl_http_request_duration_seconds",
Help: "HTTP request latency in seconds.",
Buckets: prometheus.DefBuckets,
}, []string{"method", "path"})
)
подсказка

promauto registers metrics automatically -- you do not need to call prometheus.MustRegister() separately.


Step 3: Expose the /metrics HTTP endpoint

import "github.com/prometheus/client_golang/prometheus/promhttp"

// In your router setup:
http.Handle("/metrics", promhttp.Handler())

Step 4: Instrument your handlers

func (h *Handler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
timer := prometheus.NewTimer(requestDuration.WithLabelValues(r.Method, r.URL.Path))
defer timer.ObserveDuration()

// ... handle request ...

requestsTotal.WithLabelValues(r.Method, r.URL.Path, strconv.Itoa(statusCode)).Inc()
}

Step 5: Add pod annotations in the Helm values

podAnnotations:
prometheus.io/scrape: "true"
prometheus.io/port: "7171"
prometheus.io/path: "/metrics"

Replace 7171 with whatever port your service listens on. VictoriaMetrics will discover the pod automatically on the next scrape cycle (within 30 seconds).


Step 6: Verify it is working

After deploying, run this query in vmui:

{__name__=~"crawbl_.*"}

What this does: finds all metrics with the crawbl_ prefix.

If you see your metrics, the scrape is working. If not, check that the pod has the correct annotations:

kubectl get pod <pod-name> -n backend -o jsonpath='{.metadata.annotations}'

Chapter 7: Quick Reference Card

Copy-paste these queries directly into vmui.

I want to know...Query
Is everything running?up
Is Postgres alive?pg_up
Is Redis alive?redis_up
Database size in MBpg_database_size_bytes{datname="crawbl"} / 1024 / 1024
Active DB connectionspg_stat_activity_count{state="active"}
All DB connectionspg_stat_activity_count
Connection usage %sum(pg_stat_activity_count) / scalar(pg_settings_max_connections) * 100
Longest running transactionpg_stat_activity_max_tx_duration
Database lockspg_locks_count
DB write rate (rows/sec)rate(pg_stat_database_tup_inserted{datname="crawbl"}[5m])
Redis memory in MBredis_memory_used_bytes / 1024 / 1024
Redis commands/secrate(redis_commands_processed_total[5m])
Redis cache hit rate %rate(redis_keyspace_hits_total[5m]) / (rate(redis_keyspace_hits_total[5m]) + rate(redis_keyspace_misses_total[5m])) * 100
Redis connected clientsredis_connected_clients
Redis keys storedredis_db_keys
Redis evictions/secrate(redis_evicted_keys_total[5m])
TLS cert days until expiry(certmanager_certificate_expiration_timestamp_seconds - time()) / 86400
All Postgres metrics{__name__=~"pg_.*"}
All Redis metrics{__name__=~"redis_.*"}
Every metric (discovery)count by (__name__) ({__name__=~".+"})

Retention and Limits

SettingValue
Retention period14 days
Scrape interval30 seconds
Deduplication window30 seconds
Storage volume10 Gi

Metrics older than 14 days are automatically deleted. There is no archive.

warning

If you need to preserve data for a specific investigation, export it before it ages out:

curl -s 'https://dev.metrics.crawbl.com/api/v1/export?match={__name__=~"pg_.*"}' > pg_metrics_export.jsonl