liminfo

Kubernetes Pod Failure Root Cause Analysis

A K8s troubleshooting guide for systematically debugging Pods in CrashLoopBackOff state using kubectl and finding solutions by root cause

Kubernetes pod failurekubectl debuggingCrashLoopBackOff fixPod status analysisk8s troubleshootingImagePullBackOffOOMKilledliveness probereadiness probekubectl describe pod

Problem

After deploying a new version through the CI/CD pipeline, Pods are stuck in CrashLoopBackOff state and the service won't come up. The kubectl get pods output shows the RESTARTS count continuously increasing, while the previous version was working fine. A container that worked without issues in the development environment fails only in the Kubernetes cluster, making root cause identification difficult. Service downtime is growing, so rapid root cause analysis and recovery are needed.

Required Tools

kubectl

CLI tool for managing Kubernetes clusters. Diagnoses Pod status using commands like get, describe, logs, and exec.

Kubernetes Dashboard

A web-based Kubernetes UI. Allows visual inspection of Pod status, events, and resource usage.

stern

A tool that can tail logs from multiple Pods simultaneously. Monitors logs from all related Pods at once using the Deployment name.

kubectl debug

A debugging feature supported in Kubernetes 1.18+. Attaches a debug container to a running Pod or creates an ephemeral Pod.

Solution Steps

1

Check Pod status with kubectl get pods

First, assess the current state of all Pods. Check the STATUS column for abnormal states such as CrashLoopBackOff, ImagePullBackOff, Pending, or Error. A high RESTARTS count means the Pod is repeatedly restarting. Using the -o wide option also shows which node the Pod is placed on, helping rule out node-related issues.

# Check all Pod statuses in the namespace
kubectl get pods -n production

# Example output:
# NAME                        READY   STATUS             RESTARTS   AGE
# api-server-7d8f9b6c4-x2k9p  0/1    CrashLoopBackOff   8          12m
# api-server-7d8f9b6c4-m3j5q  0/1    CrashLoopBackOff   8          12m
# web-frontend-5c4d8f9-h7k2l  1/1    Running            0          3d

# Detailed info (including node and IP)
kubectl get pods -n production -o wide

# Filter by specific label
kubectl get pods -n production -l app=api-server

# Check abnormal Pods across all namespaces
kubectl get pods -A | grep -v Running | grep -v Completed
2

Analyze events and details with kubectl describe pod

The describe command shows all events from Pod creation to the present. Checking error messages in the Events section is the key. - Failed to pull image: Image tag error or registry authentication issue - FailedScheduling: Scheduling failed due to insufficient resources (CPU/memory) - Unhealthy: liveness/readiness probe failure - FailedMount: ConfigMap/Secret mount failure The exit code (Exit Code) in Last State under the Containers section is also an important clue.

# View detailed Pod information
kubectl describe pod api-server-7d8f9b6c4-x2k9p -n production

# Key checkpoints:
# 1) Events section (at the bottom)
#    - Warning  BackOff  pod/api-server...  Back-off restarting failed container
#    - Warning  Failed   pod/api-server...  Error: ImagePullBackOff
#
# 2) Containers section
#    - State:       Waiting (Reason: CrashLoopBackOff)
#    - Last State:  Terminated (Exit Code: 1, Reason: Error)
#    - Ready:       False
#
# 3) Conditions section
#    - Ready: False
#    - ContainersReady: False

# Also check Deployment-level events
kubectl describe deployment api-server -n production

# View recent events separately
kubectl get events -n production --sort-by='.lastTimestamp' | tail -20
3

Check container error logs with kubectl logs

When a Pod crashes, logs are lost, so you must use the --previous flag to check the previous instance's logs. For multi-container Pods, use the -c option to specify a particular container's logs. Using the stern tool allows real-time consolidated monitoring of logs from all Pods belonging to a Deployment.

# Current container logs (up until the crash)
kubectl logs api-server-7d8f9b6c4-x2k9p -n production

# Previous (crashed) container logs (very important!)
kubectl logs api-server-7d8f9b6c4-x2k9p -n production --previous

# View only the last 100 lines
kubectl logs api-server-7d8f9b6c4-x2k9p -n production --previous --tail=100

# Specific container logs in a multi-container Pod
kubectl logs api-server-7d8f9b6c4-x2k9p -n production -c api-container

# Real-time log monitoring for entire Deployment with stern
stern api-server -n production --since 10m

# Common error patterns found in logs:
# - "Error: Cannot find module '/app/server.js'" -> Dockerfile path error
# - "ECONNREFUSED 127.0.0.1:5432" -> DB connection failure (check Service/ConfigMap)
# - "JavaScript heap out of memory" -> Insufficient memory limit
4

Inspect resource limits and requests

If OOMKilled (Exit Code 137) occurs, the memory limit is insufficient. Compare the container's actual memory usage with the configured limit to set an appropriate value. CPU throttling can cause response delays that fail the liveness probe, resulting in Pod restarts. Requests are the scheduling criterion and limits are the forced termination criterion, so set both values appropriately.

# Check real-time resource usage of Pods (requires metrics-server)
kubectl top pods -n production

# Example output:
# NAME                         CPU(cores)   MEMORY(bytes)
# api-server-7d8f9b6c4-x2k9p   450m         512Mi

# Check resource settings for the Deployment
kubectl get deployment api-server -n production -o yaml | grep -A 10 resources

# Example: Modify resource limits
kubectl edit deployment api-server -n production
# Or modify directly with kubectl patch:
kubectl patch deployment api-server -n production --type='json' -p='[
  {"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value": "1Gi"},
  {"op": "replace", "path": "/spec/template/spec/containers/0/resources/requests/memory", "value": "512Mi"}
]'

# Check available resources on the node (Allocatable vs Allocated)
kubectl describe node <node-name> | grep -A 5 "Allocated resources"
5

Verify and adjust liveness/readiness probes

Incorrect probe settings can cause even healthy containers to be judged as unhealthy and restarted. - liveness probe failure: kubelet restarts the container (a cause of CrashLoopBackOff) - readiness probe failure: Removed from Service endpoints (traffic blocked) - startup probe: Guarantees startup time for containers with slow initialization Set initialDelaySeconds generously, and adjust timeoutSeconds and periodSeconds to match your application's characteristics.

# Check current probe settings
kubectl get deployment api-server -n production -o yaml | grep -A 15 'liveness\|readiness\|startup'

# Example of appropriate probe settings (deployment YAML)
# spec.template.spec.containers[0]:
#   livenessProbe:
#     httpGet:
#       path: /health
#       port: 3000
#     initialDelaySeconds: 30    # Account for app startup time
#     periodSeconds: 10          # Check every 10 seconds
#     timeoutSeconds: 5          # Must respond within 5 seconds
#     failureThreshold: 3        # Restart after 3 consecutive failures
#   readinessProbe:
#     httpGet:
#       path: /ready
#       port: 3000
#     initialDelaySeconds: 5
#     periodSeconds: 5
#     failureThreshold: 3
#   startupProbe:               # Recommended for apps with slow initialization
#     httpGet:
#       path: /health
#       port: 3000
#     failureThreshold: 30       # 30 x 10s = up to 5 minutes wait
#     periodSeconds: 10

# Temporarily remove probes to check if another cause is responsible
kubectl patch deployment api-server -n production --type='json' -p='[
  {"op": "remove", "path": "/spec/template/spec/containers/0/livenessProbe"}
]'

Core Code

kubectl debugging command checklist to execute in order during Pod failures. Run from step 1 sequentially; in emergency situations, perform step 7 rollback first.

# ========================================
# K8s Pod Troubleshooting Command Checklist
# ========================================

# 1. Check status
kubectl get pods -n production -o wide
kubectl get events -n production --sort-by='.lastTimestamp'

# 2. Detailed analysis
kubectl describe pod <pod-name> -n production

# 3. Check logs (including crashed containers)
kubectl logs <pod-name> -n production --previous --tail=200

# 4. Check resource usage
kubectl top pods -n production

# 5. Enter debug container directly (k8s 1.18+)
kubectl debug -it <pod-name> -n production --image=busybox --target=api-container

# 6. Test network/DNS with an ephemeral Pod
kubectl run debug-pod --rm -it --image=curlimages/curl -n production -- sh
# Inside: curl http://api-server-service:3000/health
#          nslookup api-server-service

# 7. Rollback to previous stable version (emergency)
kubectl rollout undo deployment/api-server -n production
kubectl rollout status deployment/api-server -n production

Common Mistakes

Image tag error (using the latest tag or specifying a non-existent tag)

The latest tag may run an unexpected image due to caching issues. Use specific semantic version tags like v1.2.3, and specify imagePullPolicy: IfNotPresent. For private registries, also verify the imagePullSecrets configuration.

Environment variables are empty due to missing ConfigMap/Secret mounts

If you see "MountVolume.SetUp failed" or "couldn't find key" events in kubectl describe pod, check whether the ConfigMap/Secret exists in the namespace. Verify with kubectl get configmap -n production and kubectl get secret -n production.

OOMKilled (Exit Code 137) due to insufficient resource limits

Check actual memory usage with kubectl top pods, and set limits.memory generously (1.5-2x the actual usage). For Java applications, also verify the relationship between JVM heap size (-Xmx) and the container memory limit.

Pod restarts before app startup because liveness probe initialDelaySeconds is too short

Measure the actual startup time of your application and set initialDelaySeconds generously. In Kubernetes 1.18+, using a startupProbe provides more flexibility for applications with variable startup times.

Related liminfo Services