Version: V2-Next

Troubleshooting

Quick Checks

Run these commands first to get an overview of the deployment state:

# Check all pods in the instance namespace
kubectl get pods -n <instanceSlug> -o wide

# Check all Helm releases
helm list -n <instanceSlug> -a

# Check recent events (last 10 minutes)
kubectl get events -n <instanceSlug> --sort-by='.lastTimestamp' | tail -30

# Check ingress resources
kubectl get ingress -n <instanceSlug>

# Helmfile status overview
helmfile -f deployment/helmfile.yaml status -e <environment>

Where to Find Logs

Component	How to Access Logs
Any pod	`kubectl logs <pod-name> -n <instanceSlug>`
Previous crashed pod	`kubectl logs <pod-name> -n <instanceSlug> --previous`
Multi-container pod	`kubectl logs <pod-name> -c <container-name> -n <instanceSlug>`
Linkerd sidecar	`kubectl logs <pod-name> -c linkerd-proxy -n <instanceSlug>`
All pods of a component	`kubectl logs -l app.kubernetes.io/name=<component> -n <instanceSlug>`
CloudNativePG operator	`kubectl logs -l app.kubernetes.io/name=cloudnative-pg -n <instanceSlug>`
Strimzi operator	`kubectl logs -l app.kubernetes.io/name=strimzi-kafka-operator -n <instanceSlug>`

Where to Find Metrics

If global.metrics.enabled is set to true, metrics are exposed via:

Source	Endpoint	Notes
PostgreSQL	PodMonitor (auto-discovered by Prometheus)	Enabled in production profile
Etcd	ServiceMonitor at `/metrics`	Requires `metrics.enabled: true` in etcd values
Kubernetes Metrics Server	`kubectl top pods -n <instanceSlug>`	Requires Metrics Server installed in cluster

See Monitoring & Logging for the full observability setup.

Common Failure Modes

Pods Not Starting

Symptom: Pods stuck in Pending, CrashLoopBackOff, or ImagePullBackOff.

Diagnosis:

# Check pod events
kubectl describe pod <pod-name> -n <instanceSlug>

# Check pod logs
kubectl logs <pod-name> -n <instanceSlug>

# Check events in the namespace
kubectl get events -n <instanceSlug> --sort-by='.lastTimestamp'

Common causes:

Pending: Insufficient resources (CPU/memory). Scale your cluster or reduce resource requests.
ImagePullBackOff: Container image not accessible. Check image names in components/<name>/images.yaml and ensure network/registry access.
CrashLoopBackOff: Application error. Check logs for configuration issues or missing dependencies.

tip

If you see a keycloak-config-cli pod stuck in CrashLoopBackOff with errors related to missing secrets, ensure that you added the secrets as described in the prerequisites documentation.

Helmfile Sync Fails

Symptom: helmfile sync or helmfile apply exits with an error without any abnormalities in the Kubernetes cluster.

Diagnosis:

# Run with verbose output
helmfile -f deployment/helmfile.yaml sync -e local --debug

# Validate templates without deploying
helmfile -f deployment/helmfile.yaml lint -e local

# Render templates to check for errors
helmfile -f deployment/helmfile.yaml template -e local

Common causes:

Template rendering error: Syntax error in .gotmpl files. Check the error message for file and line number.
Timeout: A release took too long to become ready. The default timeout is 300 seconds (configurable in defaults/helm-defaults.yaml). Increase it if needed.
CRD not installed: Some components (PostgreSQL, Kafka) require CRDs from their operators. Ensure the operator part deploys before the cluster part.

PostgreSQL (CloudNativePG) Issues

Symptom: PostgreSQL cluster pods not ready, or dependent components failing to connect.

Diagnosis:

# Check CNPG cluster status
kubectl get cluster -n <instanceSlug>

# Check CNPG operator logs
kubectl logs -l app.kubernetes.io/name=cloudnative-pg -n <instanceSlug>

# Check PostgreSQL pod logs
kubectl logs <postgres-cluster-pod> -n <instanceSlug>

# Check database role setup job
kubectl get jobs -n <instanceSlug> | grep database
kubectl logs job/<database-role-setup-job> -n <instanceSlug>

Common causes:

PVC not bound: StorageClass not available or insufficient capacity. Check kubectl get pvc -n <instanceSlug>.
Database role setup fails: The role setup job may fail if the PostgreSQL cluster is not yet ready. The job will retry automatically.
Connection refused: Ensure the service name matches postgres-cluster-rw (the read-write service endpoint).

Kafka (Strimzi) Issues

Symptom: Kafka pods not starting, or producers/consumers failing to connect.

Diagnosis:

# Check Kafka cluster status
kubectl get kafka -n <instanceSlug>

# Check Strimzi operator logs
kubectl logs -l app.kubernetes.io/name=strimzi-kafka-operator -n <instanceSlug>

# Check Kafka broker logs
kubectl logs <kafka-cluster-kafka-0> -n <instanceSlug>

# Verify bootstrap service
kubectl get svc kafka-cluster-kafka-bootstrap -n <instanceSlug>

Common causes:

Broker not ready: Strimzi operator may still be reconciling. Check operator logs for progress.
Topic creation fails: Ensure auto.create.topics.enable is true or create topics via Strimzi KafkaTopic CRDs.
PVC issues: Kafka requires persistent storage. Verify PVCs are bound.

Keycloak Issues

Symptom: Keycloak not accessible, login fails, or dependent components cannot authenticate.

Diagnosis:

# Check Keycloak pod
kubectl logs -l app.kubernetes.io/instance=keycloak-app -n <instanceSlug>

# Check keycloak-config-cli job (configures realms, clients, etc.)
kubectl logs -l app.kubernetes.io/name=keycloak-config-cli -n <instanceSlug>

# Verify admin secret exists
kubectl get secret keycloak-admin-user -n <instanceSlug>

Common causes:

keycloak-config-cli fails: Missing keycloak-smtp secret. Create it as described in prerequisites.
Database connection error: PostgreSQL must be running and the keycloak database must exist. Check the database role setup job.
Ingress not working: Verify DNS records point to the ingress controller and that the idm.<domain> subdomain is configured.

APISIX / Etcd Issues

Symptom: API gateway not routing traffic, or APISIX pods in CrashLoopBackOff.

Diagnosis:

# Check APISIX pods
kubectl logs -l app.kubernetes.io/name=apisix -n <instanceSlug>

# Check etcd cluster
kubectl logs -l app.kubernetes.io/name=etcd -n <instanceSlug>

# Verify etcd is healthy
kubectl exec -n <instanceSlug> etcd-0 -- etcdctl endpoint health

Common causes:

Etcd not ready: APISIX depends on etcd. If etcd is not healthy, APISIX cannot store or retrieve route configurations.
RBAC misconfiguration: APISIX connects to etcd with specific credentials. Check that the etcd user and role for APISIX exist.

Linkerd / Service Mesh Issues

Symptom: Inter-service communication fails, mTLS errors, or sidecars not injected.

Diagnosis:

# Check if Linkerd sidecars are injected
kubectl get pods -n <instanceSlug> -o custom-columns=NAME:.metadata.name,CONTAINERS:.spec.containers[*].name

# Check Linkerd proxy logs
kubectl logs <pod-name> -c linkerd-proxy -n <instanceSlug>

# Verify namespace annotation
kubectl get namespace <instanceSlug> -o jsonpath='{.metadata.annotations.linkerd\.io/inject}'

Common causes:

Sidecars not injected: Namespace missing linkerd.io/inject: enabled annotation. Set global.serviceMesh.patchNamespaces: true or annotate manually.
Init job failures with sidecars: Ensure proxy.nativeSidecar=true is set in Linkerd configuration for proper init container handling.

Increasing Timeouts

If deployments timeout during helmfile sync, increase the timeout in defaults/helm-defaults.yaml:

helmDefaults:
  wait: true
  waitForJobs: true
  timeout: 600    # seconds (default: 300)

Quick Checks​

Where to Find Logs​

Where to Find Metrics​

Common Failure Modes​

Pods Not Starting​

Helmfile Sync Fails​

PostgreSQL (CloudNativePG) Issues​

Kafka (Strimzi) Issues​

Keycloak Issues​

APISIX / Etcd Issues​

Linkerd / Service Mesh Issues​

Increasing Timeouts​

Quick Checks

Where to Find Logs

Where to Find Metrics

Common Failure Modes

Pods Not Starting

Helmfile Sync Fails

PostgreSQL (CloudNativePG) Issues

Kafka (Strimzi) Issues

Keycloak Issues

APISIX / Etcd Issues

Linkerd / Service Mesh Issues

Increasing Timeouts