Metrics Guide
A step-by-step guide to understanding, querying, and acting on metrics in the Crawbl platform. Written for engineers who have never touched Prometheus, MetricsQL, or any monitoring tool before.
Chapter 1: Getting Started
What are metrics and why do they matter?
A metric is a number that describes something about a running system, measured over time.
Think of it like a car dashboard:
- Speedometer -- how fast are you going right now?
- Odometer -- how far have you driven total?
- Fuel gauge -- how much gas is left?
Metrics are the same thing, but for software. They tell you things like how many database connections are open, how much memory Redis is using, or whether PostgreSQL is even running.
Without metrics, the only way to know something is wrong is when a user reports it. With metrics, you can see problems forming before they become outages.
What is VictoriaMetrics?
VictoriaMetrics is a database that stores numbers over time. That is literally it.
Every 30 seconds, it reaches out to each service in our cluster, asks "what are your current numbers?", and stores the response with a timestamp. You can then query those numbers to see trends, spot problems, or verify that a deploy went smoothly.
VictoriaMetrics is compatible with Prometheus (the industry standard), so any PromQL guide or Stack Overflow answer you find will work here too. It also supports MetricsQL, a superset that adds extra convenience functions.
How metrics get collected
The collection process has three steps:
- A service exposes a
/metricsendpoint. This is a plain HTTP page that lists all the service's current numbers in a specific text format. - VictoriaMetrics scrapes that endpoint every 30 seconds. It discovers which pods to scrape by looking for a Kubernetes annotation:
prometheus.io/scrape: "true". - The numbers are stored with a timestamp. You can then query them over any time range.
You do not need to configure anything when a new service adds the annotation -- VictoriaMetrics finds it automatically.
What we currently scrape
| Service | How it exposes metrics |
|---|---|
| PostgreSQL | Bitnami metrics sidecar (translates internal DB stats) |
| Redis | Bitnami metrics sidecar (translates Redis INFO output) |
| cert-manager | Built-in Prometheus endpoint |
| Envoy Gateway | Built-in Prometheus endpoint |
| Cilium (networking) | Built-in Prometheus endpoint |
How to open the metrics UI
VictoriaMetrics ships with a built-in query interface called vmui.
Open it here: https://dev.metrics.crawbl.com/vmui
No login is required for the dev environment.
| UI Element | What it does |
|---|---|
| Query bar (top) | Type your query here and press Enter |
| Time range picker (right) | Defaults to the last hour |
| Graph tab | Plots results as a chart over time |
| Table tab | Shows raw numbers (use for instant queries) |
| JSON tab | Shows the raw API response |
Your first query
Check if monitoring is working at all.
up
What this does: returns 1 for every service that VictoriaMetrics can reach, and 0 for any service that is down. You should see results for PostgreSQL, Redis, cert-manager, and other scraped targets.
If you are ever unsure whether monitoring is working, up is the first thing to check.
Chapter 2: Understanding Metrics
What a metric looks like
Every metric has four parts: a name, labels, a value, and a timestamp.
pg_up{namespace="backend", pod="backend-postgresql-0"} = 1
| Part | Example | Purpose |
|---|---|---|
| Name | pg_up | What is being measured |
| Labels | {namespace="backend", pod="backend-postgresql-0"} | Which specific thing is being measured |
| Value | 1 | The actual number |
| Timestamp | (hidden) | When this measurement was taken |
Labels are filters
Labels let you narrow down results. Filter using these operators inside {}:
| Operator | Meaning | Example |
|---|---|---|
= | Exact match | {namespace="backend"} |
!= | Not equal | {state!="idle"} |
=~ | Regex match | {pod=~"backend-redis.*"} |
!~ | Regex exclude | {datname!~"template.*"} |
You can combine multiple filters -- they are ANDed together:
pg_stat_activity_count{namespace="backend", state="active"}
What this does: returns only active database connections in the backend namespace.
Counters vs gauges
There are two fundamental types of metrics. Understanding the difference saves you from confusion later.
| Gauge | Counter | |
|---|---|---|
| Analogy | Speedometer | Odometer |
| Behavior | Goes up and down | Only goes up (resets on restart) |
| Represents | A current value | A cumulative total |
| How to read | Read directly | Use rate() or increase() |
| Name pattern | No _total suffix | Ends in _total |
Gauge examples:
redis_memory_used_bytes-- memory Redis is using right nowpg_stat_activity_count-- database connections right nowredis_connected_clients-- clients connected right now
Counter examples:
redis_commands_processed_total-- total commands ever processedredis_keyspace_hits_total-- total cache hits since startpg_stat_database_tup_inserted-- total rows inserted since start
A raw counter value is rarely useful on its own. "Redis has processed 4,823,917 commands" does not tell you much. What you want is "how many commands per second?" -- that is where rate() comes in.
Why rate() matters for counters
rate() takes a counter and calculates how fast it is increasing per second:
rate(redis_commands_processed_total[5m])
What this does: looks at the last 5 minutes of redis_commands_processed_total values and calculates the per-second rate of increase. The result might be 42.5, meaning Redis is handling about 42.5 commands per second.
Only use rate() on counters. Applying rate() to a gauge produces meaningless results. If a metric name ends in _total, it is a counter -- use rate() or increase(). If it does not end in _total, it is probably a gauge and you can read it directly.
Chapter 3: What We Monitor
PostgreSQL Metrics
These come from the Bitnami postgres-exporter sidecar. It connects to PostgreSQL and translates internal database statistics into scrapeable metrics.
Type pg_ in the vmui query bar to see all available PostgreSQL metrics.
Metric reference
| Metric | Type | What it tells you |
|---|---|---|
pg_up | Gauge | Is the database alive? (1 = yes, 0 = no) |
pg_database_size_bytes | Gauge | Size of a database in bytes |
pg_stat_activity_count | Gauge | Number of connections, broken down by state |
pg_settings_max_connections | Gauge | The max_connections config value |
pg_stat_database_tup_inserted | Counter | Total rows inserted |
pg_stat_database_tup_updated | Counter | Total rows updated |
pg_stat_database_tup_deleted | Counter | Total rows deleted |
pg_locks_count | Gauge | Active locks, broken down by mode |
pg_database_connection_limit | Gauge | Per-database connection limit (-1 = unlimited) |
pg_up
Is the database alive?
pg_up
What this does: returns 1 if PostgreSQL is reachable, 0 if it is down or the exporter cannot connect.
This is the first thing to check during any database incident.
pg_database_size_bytes
How big is our data?
pg_database_size_bytes{datname="crawbl"} / 1024 / 1024
What this does: shows the size of the crawbl database in megabytes. The datname="crawbl" filter excludes system databases like postgres and template1.
Set the time range to "Last 7d" or "Last 30d" to see growth trends.
pg_stat_activity_count
How many connections are open?
pg_stat_activity_count
What this does: shows the number of database connections, broken down by state.
| State | Meaning |
|---|---|
active | Currently running a query |
idle | Open but doing nothing |
idle in transaction | Inside a transaction but not executing (can hold locks) |
If idle in transaction is high, investigate immediately -- these connections hold locks and block other queries.
pg_settings_max_connections
What is the connection limit?
pg_settings_max_connections
What this does: returns the max_connections value from PostgreSQL's configuration (default: 100). Compare against pg_stat_activity_count to see how close you are to the limit.
Row throughput (inserts / updates / deletes)
How much write activity is happening?
rate(pg_stat_database_tup_inserted{datname="crawbl"}[5m])
What this does: shows rows inserted per second over the last 5 minutes. Replace tup_inserted with tup_updated or tup_deleted to see other operations.
These are counters -- always use rate() to get meaningful per-second values.
pg_locks_count
Are queries waiting on locks?
pg_locks_count
What this does: shows the number of active locks, broken down by mode.
A high count of ExclusiveLock or AccessExclusiveLock can indicate contention that slows down queries significantly.
pg_database_connection_limit
Per-database connection limit
pg_database_connection_limit{datname="crawbl"}
What this does: shows the connection limit configured for a specific database (as opposed to the server-wide max_connections). A value of -1 means no per-database limit.
PostgreSQL does not perform well as it approaches max_connections. If connection counts exceed 80% of the limit, investigate connection pooling or idle connections before it becomes an outage.
Redis Metrics
These come from the Bitnami redis-exporter sidecar. It translates the output of Redis's INFO command into metrics.
Type redis_ in the vmui query bar to see all available Redis metrics.
Metric reference
| Metric | Type | What it tells you |
|---|---|---|
redis_up | Gauge | Is Redis alive? (1 = yes, 0 = no) |
redis_connected_clients | Gauge | Number of connected clients |
redis_memory_used_bytes | Gauge | Current memory usage |
redis_commands_processed_total | Counter | Total commands processed |
redis_keyspace_hits_total | Counter | Total cache hits |
redis_keyspace_misses_total | Counter | Total cache misses |
redis_connected_slaves | Gauge | Number of connected replicas |
redis_db_keys | Gauge | Number of keys stored |
redis_evicted_keys_total | Counter | Total keys evicted due to memory pressure |
redis_up
Is Redis alive?
redis_up
What this does: returns 1 if Redis is reachable, 0 if it is down.
redis_connected_clients
How many clients are connected?
redis_connected_clients
What this does: shows the current number of connected clients. Compare against redis_config_maxclients to see capacity.
redis_memory_used_bytes
How much memory is Redis using?
redis_memory_used_bytes / 1024 / 1024
What this does: shows current memory usage in megabytes.
The Redis pod in dev has a 2Gi persistent volume. Monitor this metric to make sure in-memory data is not growing unbounded. If it climbs toward the pod's memory limit, check that TTLs are set correctly.
redis_commands_processed_total
What is the throughput?
rate(redis_commands_processed_total[5m])
What this does: shows commands processed per second over the last 5 minutes. Spikes correspond to busy periods in the application.
redis_keyspace_hits_total / redis_keyspace_misses_total
Is the cache effective?
rate(redis_keyspace_hits_total[5m])
What this does: shows cache hits per second. A high hit rate means the cache is working well -- most lookups are served from Redis instead of hitting the database.
redis_connected_slaves
Replication status
redis_connected_slaves
What this does: shows the number of connected replicas.
In our dev environment this should be 0 (standalone mode, no replicas). A non-zero value means something unexpected is happening.
redis_db_keys
How many keys are stored?
redis_db_keys
What this does: shows the number of keys in each database. Useful for understanding how much data is cached.
redis_evicted_keys_total
Are keys being evicted?
rate(redis_evicted_keys_total[5m])
What this does: shows the rate of key evictions per second.
If this is non-zero, Redis is running out of memory and deleting keys to make space. The cache is too small for the workload -- investigate immediately.
Other Metrics
VictoriaMetrics also scrapes cert-manager, Envoy Gateway, and Cilium. These are primarily useful for infrastructure debugging.
| Service | Key metric | What it tells you |
|---|---|---|
| cert-manager | certmanager_certificate_ready_status | 1 = cert valid and ready, 0 = renewal failed |
| Envoy Gateway | envoy_cluster_upstream_cx_active | Active connections per Envoy cluster (backend service) |
| Cilium | cilium_* | Network-level metrics (flows, drops, policy verdicts) |
VictoriaMetrics does not currently expose self-monitoring metrics (vm_*) in this deployment. To check VictoriaMetrics health directly:
curl https://dev.metrics.crawbl.com/health
Chapter 4: Real-World Scenarios
Each scenario gives you the exact query to copy-paste, explains what it does, and tells you what the answer means.
"Is everything running?"
up
What this does: every scraped service appears with a value of 1 (healthy) or 0 (down). This is your health check dashboard in a single query.
- If all values are
1-- everything is healthy - If any value is
0-- that service needs immediate attention
"Is our database about to run out of connections?"
sum(pg_stat_activity_count) / scalar(pg_settings_max_connections) * 100
What this does: divides current connections by the maximum allowed and multiplies by 100 to get a percentage.
- Below 50% -- comfortable, no action needed
- Between 50-80% -- worth monitoring, consider optimizing idle connections
- Above 80% -- danger territory, investigate connection pooling immediately
"How fast is our database growing?"
pg_database_size_bytes{datname="crawbl"} / 1024 / 1024
What this does: returns the database size in megabytes.
Set the time range to "Last 7d" or "Last 30d" in vmui to see the growth trend on the graph.
"Is Redis being effective as a cache?"
rate(redis_keyspace_hits_total[5m]) / (rate(redis_keyspace_hits_total[5m]) + rate(redis_keyspace_misses_total[5m])) * 100
What this does: returns the cache hit rate as a percentage.
- Above 90% -- healthy, cache is working well
- Between 70-90% -- acceptable, but could improve
- Below 70% -- most lookups hit the database instead of cache. Investigate whether keys expire too quickly or the cache is being evicted
If the result is empty or shows NaN, it means both redis_keyspace_hits_total and redis_keyspace_misses_total are zero -- no cache lookups have occurred yet. This is normal in a low-traffic dev environment. The query will produce results once the application starts using Redis for caching.
"How much memory is Redis using?"
redis_memory_used_bytes / 1024 / 1024
What this does: returns memory usage in megabytes.
- If steady -- normal operation
- If climbing steadily -- check whether TTLs are configured and eviction policies are appropriate
- If near pod memory limit -- risk of OOM kill, investigate immediately
"Are there lock contention issues in the database?"
pg_locks_count
What this does: shows all active locks broken down by type.
- Low numbers across all types -- normal
- Sustained high
ExclusiveLock-- queries are blocking each other
To investigate further, check for long-running transactions:
pg_stat_activity_max_tx_duration
What this does: shows the duration (in seconds) of the longest-running transaction.
"What is the database write throughput?"
rate(pg_stat_database_tup_inserted{datname="crawbl"}[5m])
What this does: shows rows inserted per second.
To see all write operations combined:
rate(pg_stat_database_tup_inserted{datname="crawbl"}[5m])
+ rate(pg_stat_database_tup_updated{datname="crawbl"}[5m])
+ rate(pg_stat_database_tup_deleted{datname="crawbl"}[5m])
What this does: sums inserts, updates, and deletes into a single writes-per-second value.
"Are our TLS certificates about to expire?"
(certmanager_certificate_expiration_timestamp_seconds - time()) / 86400
What this does: returns the number of days until each certificate expires.
- Above 30 days -- healthy, cert-manager is renewing properly
- Between 7-30 days -- monitor closely
- Below 7 days -- investigate why cert-manager is not renewing
"Show me everything that is being monitored"
count by (__name__) ({__name__=~".+"})
What this does: lists every metric name and how many time series it has.
Run this in Table view (not Graph) to browse what is available without overloading the UI.
Chapter 5: Advanced Queries
Once you are comfortable with the basics, these functions let you ask more sophisticated questions.
rate()
How fast is this counter growing per second?
rate(redis_commands_processed_total[5m])
What this does: takes a counter and returns the per-second rate of increase, averaged over the window. The [5m] window smooths out spikes.
- Use
[5m]for stable, smoothed results (recommended default) - Use
[1m]for more responsive but noisier results
rate() only works on counters. Never apply it to a gauge -- the results will be meaningless. If the metric name does not end in _total, it is probably a gauge.
increase()
How much did this counter grow in the last hour?
increase(redis_commands_processed_total[1h])
What this does: returns the total increase over the window. Example result: "Redis processed 152,847 commands in the last hour."
Use increase() when you want absolute numbers rather than per-second rates.
avg_over_time()
What is the average over a time period?
avg_over_time(pg_stat_activity_count{state="active"}[30m])
What this does: returns the average value of a gauge over the window. Example: "We averaged 3.2 active database connections over the last 30 minutes."
max_over_time()
What was the peak?
max_over_time(redis_memory_used_bytes[1h]) / 1024 / 1024
What this does: returns the highest value a gauge reached during the window. Useful for capacity planning.
Combine with min_over_time() to see the full range of a metric over a period.
sum by (label)
Group and total
sum by (state) (pg_stat_activity_count)
What this does: aggregates multiple series into one, grouped by the specified label. This gives you total connections per state (active, idle, etc.) regardless of which pod they belong to.
topk(N, query)
Show me the top N
topk(5, rate(redis_commands_processed_total[5m]))
What this does: returns only the top N series by value. Useful when you have many targets and want to find the busiest ones.
Math operations
You can use standard arithmetic on metrics:
sum(pg_stat_activity_count) / scalar(pg_settings_max_connections) * 100
What this does: calculates connection usage as a percentage.
| Operator | Meaning |
|---|---|
+ | Add |
- | Subtract |
* | Multiply |
/ | Divide |
% | Modulo |
Comparison operators for alert-style queries
sum(pg_stat_activity_count) / scalar(pg_settings_max_connections) * 100 > 80
What this does: returns results only when the value exceeds 80. If the result is empty, you are below the threshold.
Think of it as "show me a problem only if there is one." Other comparison operators: <, >=, <=, ==, !=.
Chapter 6: Adding Metrics to the Orchestrator
When you are ready to add custom metrics to the Go orchestrator (or any new service), follow these steps.
Step 1: Add the Prometheus client library
go get github.com/prometheus/client_golang/prometheus
go get github.com/prometheus/client_golang/prometheus/promhttp
Step 2: Define and register your metrics
import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
)
var (
requestsTotal = promauto.NewCounterVec(prometheus.CounterOpts{
Name: "crawbl_http_requests_total",
Help: "Total number of HTTP requests handled.",
}, []string{"method", "path", "status"})
requestDuration = promauto.NewHistogramVec(prometheus.HistogramOpts{
Name: "crawbl_http_request_duration_seconds",
Help: "HTTP request latency in seconds.",
Buckets: prometheus.DefBuckets,
}, []string{"method", "path"})
)
promauto registers metrics automatically -- you do not need to call prometheus.MustRegister() separately.
Step 3: Expose the /metrics HTTP endpoint
import "github.com/prometheus/client_golang/prometheus/promhttp"
// In your router setup:
http.Handle("/metrics", promhttp.Handler())
Step 4: Instrument your handlers
func (h *Handler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
timer := prometheus.NewTimer(requestDuration.WithLabelValues(r.Method, r.URL.Path))
defer timer.ObserveDuration()
// ... handle request ...
requestsTotal.WithLabelValues(r.Method, r.URL.Path, strconv.Itoa(statusCode)).Inc()
}
Step 5: Add pod annotations in the Helm values
podAnnotations:
prometheus.io/scrape: "true"
prometheus.io/port: "7171"
prometheus.io/path: "/metrics"
Replace 7171 with whatever port your service listens on. VictoriaMetrics will discover the pod automatically on the next scrape cycle (within 30 seconds).
Step 6: Verify it is working
After deploying, run this query in vmui:
{__name__=~"crawbl_.*"}
What this does: finds all metrics with the crawbl_ prefix.
If you see your metrics, the scrape is working. If not, check that the pod has the correct annotations:
kubectl get pod <pod-name> -n backend -o jsonpath='{.metadata.annotations}'
Chapter 7: Quick Reference Card
Copy-paste these queries directly into vmui.
| I want to know... | Query |
|---|---|
| Is everything running? | up |
| Is Postgres alive? | pg_up |
| Is Redis alive? | redis_up |
| Database size in MB | pg_database_size_bytes{datname="crawbl"} / 1024 / 1024 |
| Active DB connections | pg_stat_activity_count{state="active"} |
| All DB connections | pg_stat_activity_count |
| Connection usage % | sum(pg_stat_activity_count) / scalar(pg_settings_max_connections) * 100 |
| Longest running transaction | pg_stat_activity_max_tx_duration |
| Database locks | pg_locks_count |
| DB write rate (rows/sec) | rate(pg_stat_database_tup_inserted{datname="crawbl"}[5m]) |
| Redis memory in MB | redis_memory_used_bytes / 1024 / 1024 |
| Redis commands/sec | rate(redis_commands_processed_total[5m]) |
| Redis cache hit rate % | rate(redis_keyspace_hits_total[5m]) / (rate(redis_keyspace_hits_total[5m]) + rate(redis_keyspace_misses_total[5m])) * 100 |
| Redis connected clients | redis_connected_clients |
| Redis keys stored | redis_db_keys |
| Redis evictions/sec | rate(redis_evicted_keys_total[5m]) |
| TLS cert days until expiry | (certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 |
| All Postgres metrics | {__name__=~"pg_.*"} |
| All Redis metrics | {__name__=~"redis_.*"} |
| Every metric (discovery) | count by (__name__) ({__name__=~".+"}) |
Retention and Limits
| Setting | Value |
|---|---|
| Retention period | 14 days |
| Scrape interval | 30 seconds |
| Deduplication window | 30 seconds |
| Storage volume | 10 Gi |
Metrics older than 14 days are automatically deleted. There is no archive.
If you need to preserve data for a specific investigation, export it before it ages out:
curl -s 'https://dev.metrics.crawbl.com/api/v1/export?match={__name__=~"pg_.*"}' > pg_metrics_export.jsonl