Monitoring & Logging
This document describes how to set up observability for a Civitas Core V2 deployment. Monitoring and logging are not part of the Civitas Core deployment itself but rely on standard Kubernetes interfaces that integrate with your existing observability stack.
Metrics
Enabling Metrics
Enable Prometheus metrics scraping across all components by setting:
# deployment/environments/<env>/global.yaml.gotmpl
global:
metrics:
enabled: true
Kubernetes Metrics Server
The Metrics Server provides resource usage data (CPU, memory) for pods and nodes. It is required for kubectl top and HorizontalPodAutoscaler to work.
# Verify Metrics Server is installed
kubectl get deployment metrics-server -n kube-system
# Check resource usage
kubectl top pods -n <instanceSlug>
kubectl top nodes
Metrics Server is listed as an optional but recommended prerequisite. Without it, HPA-based autoscaling will not function.
Recommended Metrics Stack: Prometheus + Grafana
Civitas Core V2 exposes metrics via standard Kubernetes interfaces (ServiceMonitor, PodMonitor) that are compatible with the Prometheus Operator (often deployed via the kube-prometheus-stack Helm chart).
Components with Prometheus Integration
| Component | Metrics Type | Enabled By |
|---|---|---|
| PostgreSQL (CloudNativePG) | PodMonitor | cluster.monitoring.enablePodMonitor: true |
| Etcd | ServiceMonitor | metrics.enabled: true and metrics.serviceMonitor.enabled: true in etcd values |
| Kafka (Strimzi) | JMX metrics (configurable) | Strimzi metrics configuration in cluster values |
| Keycloak | Metrics endpoint | Keycloak exposes /metrics when configured |
Setup Example
If you don't have a Prometheus stack yet, install it via Helm:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring --create-namespace \
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
--set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false
The *SelectorNilUsesHelmValues=false flags ensure that Prometheus discovers ServiceMonitors and PodMonitors from all namespaces, including those created by Civitas Core components.
Access Grafana:
kubectl port-forward svc/monitoring-grafana 3000:80 -n monitoring
# Default credentials: admin / prom-operator
Recommended Grafana Dashboards
The kube-prometheus-stack ships with a set of pre-installed dashboards. To get additional visibility, import the following community dashboards from Grafana.com (Grafana → Dashboards → Import → Enter ID):
| Dashboard | ID | Covers |
|---|---|---|
| Kubernetes / Views / Global | 15757 | Cluster-wide resource overview |
| Node Exporter Full | 1860 | Detailed per-node system metrics |
| Kubernetes / Views / Pods | 15760 | Per-pod CPU, memory, network |
| NGINX Ingress Controller | 14314 | Request rates, latencies, error rates |
| cert-manager | 20842 | Certificate status and expiry |
| CloudNativePG | 20417 | PostgreSQL instances managed by CNPG |
Key Metrics to Monitor
| Metric / Signal | Source | What It Tells You |
|---|---|---|
| Pod CPU/memory usage | Metrics Server / Prometheus | Resource pressure, scaling needs |
| Pod restart count | kube_pod_container_status_restarts_total | Application stability |
| PostgreSQL replication lag | CNPG PodMonitor | Data consistency across replicas |
| PostgreSQL active connections | CNPG PodMonitor | Connection pool pressure |
| Kafka consumer lag | Strimzi metrics / Kafka UI | Message processing delays |
| Kafka broker disk usage | Strimzi metrics | Storage capacity planning |
| HTTP error rates (4xx/5xx) | Ingress controller metrics | Application errors |
| Certificate expiry | cert-manager metrics | TLS certificate renewal |
| Linkerd success rate | Linkerd viz (if installed) | Service-to-service reliability |
Logging
Accessing Logs
Civitas Core V2 components write logs to stdout/stderr. Kubernetes captures these logs and makes them available via kubectl logs.
# Single pod
kubectl logs <pod-name> -n <instanceSlug>
# Follow logs in real-time
kubectl logs -f <pod-name> -n <instanceSlug>
# All pods of a component
kubectl logs -l app.kubernetes.io/name=<component> -n <instanceSlug>
# Previous container (after crash)
kubectl logs <pod-name> -n <instanceSlug> --previous
Recommended Logging Stack: Loki
For centralized log aggregation, we recommend Grafana Loki with Grafana Alloy as the log collector. Alloy is the successor to Promtail and provides a unified agent for logs, metrics, and traces. Loki integrates natively with Grafana for querying and visualization.
Change the following configuration values to your needs, especially for production deployments. The example below is suitable for development or testing environments but may require adjustments for production (e.g. using object storage for Loki, configuring retention policies, etc.).
Setup Example
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
# Install Loki
helm install loki grafana/loki \
--namespace monitoring --create-namespace \
--set loki.commonConfig.replication_factor=1 \
--set loki.storage.type=filesystem \
--set singleBinary.replicas=1 \
--set singleBinary.persistence.enabled=true \
--set singleBinary.persistence.size=10Gi
# Install Alloy as log collector
helm install alloy grafana/alloy \
--namespace monitoring
Configure Alloy to ship logs to Loki by providing a configuration that discovers pod logs and forwards them. See the Alloy documentation for details.
After installation, add Loki as a data source in Grafana (URL: http://loki-gateway). You can then query logs using LogQL:
# All logs from a specific component
{namespace="<instanceSlug>", app_kubernetes_io_name="portal"}
# Error logs across all components
{namespace="<instanceSlug>"} |= "error"
# Keycloak authentication events
{namespace="<instanceSlug>", app_kubernetes_io_instance="keycloak-app"} |= "LOGIN"
Recommended Grafana Dashboards for Loki
To visualize logs in Grafana, import these community dashboards (Grafana → Dashboards → Import → Enter ID):
| Dashboard | ID | Covers |
|---|---|---|
| Loki & Alloy Logs | 18842 | Log volume, streams, and ingestion overview |
| Loki Kubernetes Logs | 15141 | Namespace/pod-level log browsing |
| Loki Log Search | 13639 | Interactive log search and filtering |
Key Log Signals to Watch
| Signal | Where to Look | Indicates |
|---|---|---|
CrashLoopBackOff events | kubectl get events | Application crash, check --previous logs |
| Database connection errors | Portal, Keycloak, Authz logs | PostgreSQL unavailable or misconfigured |
| Certificate errors | Ingress controller logs, cert-manager logs | TLS provisioning failures |
401 / 403 responses | APISIX, Keycloak logs | Authentication/authorization issues |
| Kafka consumer errors | Apicurio, Redpanda Connect logs | Message processing failures |
| OPA decision denials | OPA logs | Policy violations |
Alerting Recommendations
If using Prometheus with Alertmanager, consider the following minimal alerting rules:
| Alert | Condition | Severity |
|---|---|---|
| PodCrashLooping | Pod restart count > 3 in 15 minutes | Critical |
| PodNotReady | Pod not ready for > 5 minutes | Warning |
| PVCAlmostFull | PVC usage > 85% | Warning |
| PostgreSQLReplicaLag | Replication lag > 30 seconds | Warning |
| CertificateExpiringSoon | Certificate expires within 14 days | Warning |
| HighErrorRate | HTTP 5xx rate > 5% over 5 minutes | Critical |
| KafkaConsumerLag | Consumer lag growing continuously | Warning |
These can be implemented as PrometheusRule resources when using the Prometheus Operator.