Skip to main content
Version: V2-Next

Troubleshooting

Quick Checks

Run these commands first to get an overview of the deployment state:

# Check all pods in the instance namespace
kubectl get pods -n <instanceSlug> -o wide

# Check all Helm releases
helm list -n <instanceSlug> -a

# Check recent events (last 10 minutes)
kubectl get events -n <instanceSlug> --sort-by='.lastTimestamp' | tail -30

# Check ingress resources
kubectl get ingress -n <instanceSlug>

# Helmfile status overview
helmfile -f deployment/helmfile.yaml status -e <environment>

Where to Find Logs

ComponentHow to Access Logs
Any podkubectl logs <pod-name> -n <instanceSlug>
Previous crashed podkubectl logs <pod-name> -n <instanceSlug> --previous
Multi-container podkubectl logs <pod-name> -c <container-name> -n <instanceSlug>
Linkerd sidecarkubectl logs <pod-name> -c linkerd-proxy -n <instanceSlug>
All pods of a componentkubectl logs -l app.kubernetes.io/name=<component> -n <instanceSlug>
CloudNativePG operatorkubectl logs -l app.kubernetes.io/name=cloudnative-pg -n <instanceSlug>
Strimzi operatorkubectl logs -l app.kubernetes.io/name=strimzi-kafka-operator -n <instanceSlug>

Where to Find Metrics

If global.metrics.enabled is set to true, metrics are exposed via:

SourceEndpointNotes
PostgreSQLPodMonitor (auto-discovered by Prometheus)Enabled in production profile
EtcdServiceMonitor at /metricsRequires metrics.enabled: true in etcd values
Kubernetes Metrics Serverkubectl top pods -n <instanceSlug>Requires Metrics Server installed in cluster

See Monitoring & Logging for the full observability setup.


Common Failure Modes

Pods Not Starting

Symptom: Pods stuck in Pending, CrashLoopBackOff, or ImagePullBackOff.

Diagnosis:

# Check pod events
kubectl describe pod <pod-name> -n <instanceSlug>

# Check pod logs
kubectl logs <pod-name> -n <instanceSlug>

# Check events in the namespace
kubectl get events -n <instanceSlug> --sort-by='.lastTimestamp'

Common causes:

  • Pending: Insufficient resources (CPU/memory). Scale your cluster or reduce resource requests.
  • ImagePullBackOff: Container image not accessible. Check image names in components/<name>/images.yaml and ensure network/registry access.
  • CrashLoopBackOff: Application error. Check logs for configuration issues or missing dependencies.
tip

If you see a keycloak-config-cli pod stuck in CrashLoopBackOff with errors related to missing secrets, ensure that you added the secrets as described in the prerequisites documentation.

Helmfile Sync Fails

Symptom: helmfile sync or helmfile apply exits with an error without any abnormalities in the Kubernetes cluster.

Diagnosis:

# Run with verbose output
helmfile -f deployment/helmfile.yaml sync -e local --debug

# Validate templates without deploying
helmfile -f deployment/helmfile.yaml lint -e local

# Render templates to check for errors
helmfile -f deployment/helmfile.yaml template -e local

Common causes:

  • Template rendering error: Syntax error in .gotmpl files. Check the error message for file and line number.
  • Timeout: A release took too long to become ready. The default timeout is 300 seconds (configurable in defaults/helm-defaults.yaml). Increase it if needed.
  • CRD not installed: Some components (PostgreSQL, Kafka) require CRDs from their operators. Ensure the operator part deploys before the cluster part.

PostgreSQL (CloudNativePG) Issues

Symptom: PostgreSQL cluster pods not ready, or dependent components failing to connect.

Diagnosis:

# Check CNPG cluster status
kubectl get cluster -n <instanceSlug>

# Check CNPG operator logs
kubectl logs -l app.kubernetes.io/name=cloudnative-pg -n <instanceSlug>

# Check PostgreSQL pod logs
kubectl logs <postgres-cluster-pod> -n <instanceSlug>

# Check database role setup job
kubectl get jobs -n <instanceSlug> | grep database
kubectl logs job/<database-role-setup-job> -n <instanceSlug>

Common causes:

  • PVC not bound: StorageClass not available or insufficient capacity. Check kubectl get pvc -n <instanceSlug>.
  • Database role setup fails: The role setup job may fail if the PostgreSQL cluster is not yet ready. The job will retry automatically.
  • Connection refused: Ensure the service name matches postgres-cluster-rw (the read-write service endpoint).

Kafka (Strimzi) Issues

Symptom: Kafka pods not starting, or producers/consumers failing to connect.

Diagnosis:

# Check Kafka cluster status
kubectl get kafka -n <instanceSlug>

# Check Strimzi operator logs
kubectl logs -l app.kubernetes.io/name=strimzi-kafka-operator -n <instanceSlug>

# Check Kafka broker logs
kubectl logs <kafka-cluster-kafka-0> -n <instanceSlug>

# Verify bootstrap service
kubectl get svc kafka-cluster-kafka-bootstrap -n <instanceSlug>

Common causes:

  • Broker not ready: Strimzi operator may still be reconciling. Check operator logs for progress.
  • Topic creation fails: Ensure auto.create.topics.enable is true or create topics via Strimzi KafkaTopic CRDs.
  • PVC issues: Kafka requires persistent storage. Verify PVCs are bound.

Keycloak Issues

Symptom: Keycloak not accessible, login fails, or dependent components cannot authenticate.

Diagnosis:

# Check Keycloak pod
kubectl logs -l app.kubernetes.io/instance=keycloak-app -n <instanceSlug>

# Check keycloak-config-cli job (configures realms, clients, etc.)
kubectl logs -l app.kubernetes.io/name=keycloak-config-cli -n <instanceSlug>

# Verify admin secret exists
kubectl get secret keycloak-admin-user -n <instanceSlug>

Common causes:

  • keycloak-config-cli fails: Missing keycloak-smtp secret. Create it as described in prerequisites.
  • Database connection error: PostgreSQL must be running and the keycloak database must exist. Check the database role setup job.
  • Ingress not working: Verify DNS records point to the ingress controller and that the idm.<domain> subdomain is configured.

APISIX / Etcd Issues

Symptom: API gateway not routing traffic, or APISIX pods in CrashLoopBackOff.

Diagnosis:

# Check APISIX pods
kubectl logs -l app.kubernetes.io/name=apisix -n <instanceSlug>

# Check etcd cluster
kubectl logs -l app.kubernetes.io/name=etcd -n <instanceSlug>

# Verify etcd is healthy
kubectl exec -n <instanceSlug> etcd-0 -- etcdctl endpoint health

Common causes:

  • Etcd not ready: APISIX depends on etcd. If etcd is not healthy, APISIX cannot store or retrieve route configurations.
  • RBAC misconfiguration: APISIX connects to etcd with specific credentials. Check that the etcd user and role for APISIX exist.

Linkerd / Service Mesh Issues

Symptom: Inter-service communication fails, mTLS errors, or sidecars not injected.

Diagnosis:

# Check if Linkerd sidecars are injected
kubectl get pods -n <instanceSlug> -o custom-columns=NAME:.metadata.name,CONTAINERS:.spec.containers[*].name

# Check Linkerd proxy logs
kubectl logs <pod-name> -c linkerd-proxy -n <instanceSlug>

# Verify namespace annotation
kubectl get namespace <instanceSlug> -o jsonpath='{.metadata.annotations.linkerd\.io/inject}'

Common causes:

  • Sidecars not injected: Namespace missing linkerd.io/inject: enabled annotation. Set global.serviceMesh.patchNamespaces: true or annotate manually.
  • Init job failures with sidecars: Ensure proxy.nativeSidecar=true is set in Linkerd configuration for proper init container handling.

Increasing Timeouts

If deployments timeout during helmfile sync, increase the timeout in defaults/helm-defaults.yaml:

helmDefaults:
wait: true
waitForJobs: true
timeout: 600 # seconds (default: 300)