Troubleshooting
Quick Checks
Run these commands first to get an overview of the deployment state:
# Check all pods in the instance namespace
kubectl get pods -n <instanceSlug> -o wide
# Check all Helm releases
helm list -n <instanceSlug> -a
# Check recent events (last 10 minutes)
kubectl get events -n <instanceSlug> --sort-by='.lastTimestamp' | tail -30
# Check ingress resources
kubectl get ingress -n <instanceSlug>
# Helmfile status overview
helmfile -f deployment/helmfile.yaml status -e <environment>
Where to Find Logs
| Component | How to Access Logs |
|---|---|
| Any pod | kubectl logs <pod-name> -n <instanceSlug> |
| Previous crashed pod | kubectl logs <pod-name> -n <instanceSlug> --previous |
| Multi-container pod | kubectl logs <pod-name> -c <container-name> -n <instanceSlug> |
| Linkerd sidecar | kubectl logs <pod-name> -c linkerd-proxy -n <instanceSlug> |
| All pods of a component | kubectl logs -l app.kubernetes.io/name=<component> -n <instanceSlug> |
| CloudNativePG operator | kubectl logs -l app.kubernetes.io/name=cloudnative-pg -n <instanceSlug> |
| Strimzi operator | kubectl logs -l app.kubernetes.io/name=strimzi-kafka-operator -n <instanceSlug> |
Where to Find Metrics
If global.metrics.enabled is set to true, metrics are exposed via:
| Source | Endpoint | Notes |
|---|---|---|
| PostgreSQL | PodMonitor (auto-discovered by Prometheus) | Enabled in production profile |
| Etcd | ServiceMonitor at /metrics | Requires metrics.enabled: true in etcd values |
| Kubernetes Metrics Server | kubectl top pods -n <instanceSlug> | Requires Metrics Server installed in cluster |
See Monitoring & Logging for the full observability setup.
Common Failure Modes
Pods Not Starting
Symptom: Pods stuck in Pending, CrashLoopBackOff, or ImagePullBackOff.
Diagnosis:
# Check pod events
kubectl describe pod <pod-name> -n <instanceSlug>
# Check pod logs
kubectl logs <pod-name> -n <instanceSlug>
# Check events in the namespace
kubectl get events -n <instanceSlug> --sort-by='.lastTimestamp'
Common causes:
- Pending: Insufficient resources (CPU/memory). Scale your cluster or reduce resource requests.
- ImagePullBackOff: Container image not accessible. Check image names in
components/<name>/images.yamland ensure network/registry access. - CrashLoopBackOff: Application error. Check logs for configuration issues or missing dependencies.
If you see a keycloak-config-cli pod stuck in CrashLoopBackOff with errors related to missing secrets, ensure that you added the secrets as described in the prerequisites documentation.
Helmfile Sync Fails
Symptom: helmfile sync or helmfile apply exits with an error without any abnormalities in the Kubernetes cluster.
Diagnosis:
# Run with verbose output
helmfile -f deployment/helmfile.yaml sync -e local --debug
# Validate templates without deploying
helmfile -f deployment/helmfile.yaml lint -e local
# Render templates to check for errors
helmfile -f deployment/helmfile.yaml template -e local
Common causes:
- Template rendering error: Syntax error in
.gotmplfiles. Check the error message for file and line number. - Timeout: A release took too long to become ready. The default timeout is 300 seconds (configurable in
defaults/helm-defaults.yaml). Increase it if needed. - CRD not installed: Some components (PostgreSQL, Kafka) require CRDs from their operators. Ensure the operator part deploys before the cluster part.
PostgreSQL (CloudNativePG) Issues
Symptom: PostgreSQL cluster pods not ready, or dependent components failing to connect.
Diagnosis:
# Check CNPG cluster status
kubectl get cluster -n <instanceSlug>
# Check CNPG operator logs
kubectl logs -l app.kubernetes.io/name=cloudnative-pg -n <instanceSlug>
# Check PostgreSQL pod logs
kubectl logs <postgres-cluster-pod> -n <instanceSlug>
# Check database role setup job
kubectl get jobs -n <instanceSlug> | grep database
kubectl logs job/<database-role-setup-job> -n <instanceSlug>
Common causes:
- PVC not bound: StorageClass not available or insufficient capacity. Check
kubectl get pvc -n <instanceSlug>. - Database role setup fails: The role setup job may fail if the PostgreSQL cluster is not yet ready. The job will retry automatically.
- Connection refused: Ensure the service name matches
postgres-cluster-rw(the read-write service endpoint).
Kafka (Strimzi) Issues
Symptom: Kafka pods not starting, or producers/consumers failing to connect.
Diagnosis:
# Check Kafka cluster status
kubectl get kafka -n <instanceSlug>
# Check Strimzi operator logs
kubectl logs -l app.kubernetes.io/name=strimzi-kafka-operator -n <instanceSlug>
# Check Kafka broker logs
kubectl logs <kafka-cluster-kafka-0> -n <instanceSlug>
# Verify bootstrap service
kubectl get svc kafka-cluster-kafka-bootstrap -n <instanceSlug>
Common causes:
- Broker not ready: Strimzi operator may still be reconciling. Check operator logs for progress.
- Topic creation fails: Ensure
auto.create.topics.enableistrueor create topics via Strimzi KafkaTopic CRDs. - PVC issues: Kafka requires persistent storage. Verify PVCs are bound.
Keycloak Issues
Symptom: Keycloak not accessible, login fails, or dependent components cannot authenticate.
Diagnosis:
# Check Keycloak pod
kubectl logs -l app.kubernetes.io/instance=keycloak-app -n <instanceSlug>
# Check keycloak-config-cli job (configures realms, clients, etc.)
kubectl logs -l app.kubernetes.io/name=keycloak-config-cli -n <instanceSlug>
# Verify admin secret exists
kubectl get secret keycloak-admin-user -n <instanceSlug>
Common causes:
- keycloak-config-cli fails: Missing
keycloak-smtpsecret. Create it as described in prerequisites. - Database connection error: PostgreSQL must be running and the
keycloakdatabase must exist. Check the database role setup job. - Ingress not working: Verify DNS records point to the ingress controller and that the
idm.<domain>subdomain is configured.
APISIX / Etcd Issues
Symptom: API gateway not routing traffic, or APISIX pods in CrashLoopBackOff.
Diagnosis:
# Check APISIX pods
kubectl logs -l app.kubernetes.io/name=apisix -n <instanceSlug>
# Check etcd cluster
kubectl logs -l app.kubernetes.io/name=etcd -n <instanceSlug>
# Verify etcd is healthy
kubectl exec -n <instanceSlug> etcd-0 -- etcdctl endpoint health
Common causes:
- Etcd not ready: APISIX depends on etcd. If etcd is not healthy, APISIX cannot store or retrieve route configurations.
- RBAC misconfiguration: APISIX connects to etcd with specific credentials. Check that the etcd user and role for APISIX exist.
Linkerd / Service Mesh Issues
Symptom: Inter-service communication fails, mTLS errors, or sidecars not injected.
Diagnosis:
# Check if Linkerd sidecars are injected
kubectl get pods -n <instanceSlug> -o custom-columns=NAME:.metadata.name,CONTAINERS:.spec.containers[*].name
# Check Linkerd proxy logs
kubectl logs <pod-name> -c linkerd-proxy -n <instanceSlug>
# Verify namespace annotation
kubectl get namespace <instanceSlug> -o jsonpath='{.metadata.annotations.linkerd\.io/inject}'
Common causes:
- Sidecars not injected: Namespace missing
linkerd.io/inject: enabledannotation. Setglobal.serviceMesh.patchNamespaces: trueor annotate manually. - Init job failures with sidecars: Ensure
proxy.nativeSidecar=trueis set in Linkerd configuration for proper init container handling.
Increasing Timeouts
If deployments timeout during helmfile sync, increase the timeout in defaults/helm-defaults.yaml:
helmDefaults:
wait: true
waitForJobs: true
timeout: 600 # seconds (default: 300)