Skip to main content

Monitoring & Logging

This document describes how to set up observability for a Civitas Core V2 deployment. Monitoring and logging are not part of the Civitas Core deployment itself but rely on standard Kubernetes interfaces that integrate with your existing observability stack.

Metrics

Enabling Metrics

Enable Prometheus metrics scraping across all components by setting:

# deployment/environments/<env>/global.yaml.gotmpl
global:
metrics:
enabled: true

Kubernetes Metrics Server

The Metrics Server provides resource usage data (CPU, memory) for pods and nodes. It is required for kubectl top and HorizontalPodAutoscaler to work.

# Verify Metrics Server is installed
kubectl get deployment metrics-server -n kube-system

# Check resource usage
kubectl top pods -n <instanceSlug>
kubectl top nodes
info

Metrics Server is listed as an optional but recommended prerequisite. Without it, HPA-based autoscaling will not function.

Civitas Core V2 exposes metrics via standard Kubernetes interfaces (ServiceMonitor, PodMonitor) that are compatible with the Prometheus Operator (often deployed via the kube-prometheus-stack Helm chart).

Components with Prometheus Integration

ComponentMetrics TypeEnabled By
PostgreSQL (CloudNativePG)PodMonitorcluster.monitoring.enablePodMonitor: true
EtcdServiceMonitormetrics.enabled: true and metrics.serviceMonitor.enabled: true in etcd values
Kafka (Strimzi)JMX metrics (configurable)Strimzi metrics configuration in cluster values
KeycloakMetrics endpointKeycloak exposes /metrics when configured

Setup Example

If you don't have a Prometheus stack yet, install it via Helm:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring --create-namespace \
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
--set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false

The *SelectorNilUsesHelmValues=false flags ensure that Prometheus discovers ServiceMonitors and PodMonitors from all namespaces, including those created by Civitas Core components.

Access Grafana:

kubectl port-forward svc/monitoring-grafana 3000:80 -n monitoring
# Default credentials: admin / prom-operator

The kube-prometheus-stack ships with a set of pre-installed dashboards. To get additional visibility, import the following community dashboards from Grafana.com (Grafana → Dashboards → Import → Enter ID):

DashboardIDCovers
Kubernetes / Views / Global15757Cluster-wide resource overview
Node Exporter Full1860Detailed per-node system metrics
Kubernetes / Views / Pods15760Per-pod CPU, memory, network
NGINX Ingress Controller14314Request rates, latencies, error rates
cert-manager20842Certificate status and expiry
CloudNativePG20417PostgreSQL instances managed by CNPG

Key Metrics to Monitor

Metric / SignalSourceWhat It Tells You
Pod CPU/memory usageMetrics Server / PrometheusResource pressure, scaling needs
Pod restart countkube_pod_container_status_restarts_totalApplication stability
PostgreSQL replication lagCNPG PodMonitorData consistency across replicas
PostgreSQL active connectionsCNPG PodMonitorConnection pool pressure
Kafka consumer lagStrimzi metrics / Kafka UIMessage processing delays
Kafka broker disk usageStrimzi metricsStorage capacity planning
HTTP error rates (4xx/5xx)Ingress controller metricsApplication errors
Certificate expirycert-manager metricsTLS certificate renewal
Linkerd success rateLinkerd viz (if installed)Service-to-service reliability

Logging

Accessing Logs

Civitas Core V2 components write logs to stdout/stderr. Kubernetes captures these logs and makes them available via kubectl logs.

# Single pod
kubectl logs <pod-name> -n <instanceSlug>

# Follow logs in real-time
kubectl logs -f <pod-name> -n <instanceSlug>

# All pods of a component
kubectl logs -l app.kubernetes.io/name=<component> -n <instanceSlug>

# Previous container (after crash)
kubectl logs <pod-name> -n <instanceSlug> --previous

For centralized log aggregation, we recommend Grafana Loki with Grafana Alloy as the log collector. Alloy is the successor to Promtail and provides a unified agent for logs, metrics, and traces. Loki integrates natively with Grafana for querying and visualization.

warning

Change the following configuration values to your needs, especially for production deployments. The example below is suitable for development or testing environments but may require adjustments for production (e.g. using object storage for Loki, configuring retention policies, etc.).

Setup Example

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# Install Loki
helm install loki grafana/loki \
--namespace monitoring --create-namespace \
--set loki.commonConfig.replication_factor=1 \
--set loki.storage.type=filesystem \
--set singleBinary.replicas=1 \
--set singleBinary.persistence.enabled=true \
--set singleBinary.persistence.size=10Gi

# Install Alloy as log collector
helm install alloy grafana/alloy \
--namespace monitoring

Configure Alloy to ship logs to Loki by providing a configuration that discovers pod logs and forwards them. See the Alloy documentation for details.

After installation, add Loki as a data source in Grafana (URL: http://loki-gateway). You can then query logs using LogQL:

# All logs from a specific component
{namespace="<instanceSlug>", app_kubernetes_io_name="portal"}

# Error logs across all components
{namespace="<instanceSlug>"} |= "error"

# Keycloak authentication events
{namespace="<instanceSlug>", app_kubernetes_io_instance="keycloak-app"} |= "LOGIN"

To visualize logs in Grafana, import these community dashboards (Grafana → Dashboards → Import → Enter ID):

DashboardIDCovers
Loki & Alloy Logs18842Log volume, streams, and ingestion overview
Loki Kubernetes Logs15141Namespace/pod-level log browsing
Loki Log Search13639Interactive log search and filtering

Key Log Signals to Watch

SignalWhere to LookIndicates
CrashLoopBackOff eventskubectl get eventsApplication crash, check --previous logs
Database connection errorsPortal, Keycloak, Authz logsPostgreSQL unavailable or misconfigured
Certificate errorsIngress controller logs, cert-manager logsTLS provisioning failures
401 / 403 responsesAPISIX, Keycloak logsAuthentication/authorization issues
Kafka consumer errorsApicurio, Redpanda Connect logsMessage processing failures
OPA decision denialsOPA logsPolicy violations

Alerting Recommendations

If using Prometheus with Alertmanager, consider the following minimal alerting rules:

AlertConditionSeverity
PodCrashLoopingPod restart count > 3 in 15 minutesCritical
PodNotReadyPod not ready for > 5 minutesWarning
PVCAlmostFullPVC usage > 85%Warning
PostgreSQLReplicaLagReplication lag > 30 secondsWarning
CertificateExpiringSoonCertificate expires within 14 daysWarning
HighErrorRateHTTP 5xx rate > 5% over 5 minutesCritical
KafkaConsumerLagConsumer lag growing continuouslyWarning

These can be implemented as PrometheusRule resources when using the Prometheus Operator.