March 12, 2026 · 9 min read

How to Evaluate a Kubernetes Consultant: 15 Questions to Ask

How to evaluate a Kubernetes consultant: 15 technical and process questions, red flags to watch for, green flags to seek, and engagement model guidance.

How to Evaluate a Kubernetes Consultant: 15 Questions to Ask

Hiring a Kubernetes consultant is not like hiring a general cloud consultant. The skill set is specific — the right engineer can cut your K8s costs by 40% and eliminate most production incidents in 90 days. The wrong one will create a cluster configuration that looks sophisticated but breaks in ways you won’t understand, leaving you worse off than when you started.

This guide gives you 15 specific questions to evaluate any K8s consultant, along with what strong and weak answers look like, red flags to disqualify candidates, and how to structure the engagement for your situation.


Why K8s Consulting Is Different

Kubernetes expertise is a genuine specialization. An engineer can be excellent at general cloud architecture and mediocre at Kubernetes. The reasons:

K8s has sharp failure modes. Misconfigured probes cause cascading CrashLoopBackOff incidents. RBAC mistakes expose your entire cluster to a compromised pod. etcd corruption can bring down an entire cluster. These failure modes require specific knowledge to prevent and diagnose.

K8s best practices change with each major version. What was correct K8s configuration in 1.21 may be deprecated or dangerous in 1.29. A consultant who learned Kubernetes three years ago and hasn’t kept up is a liability.

The gap between “I know K8s” and “I operate K8s at scale” is enormous. Running a few deployments on a personal cluster is not the same as managing a 500-node multi-cluster production environment under SLA.


The 15 Questions

Technical Depth Questions

Question 1: “Walk me through how you’d debug a pod stuck in CrashLoopBackOff.”

Strong answer: Starts with kubectl describe pod to check events and exit codes. Checks kubectl logs --previous for the last crash output. Distinguishes between OOMKill (exit code 137), application error (non-zero exit code from app), and liveness probe failure (reports in events). Knows to check if it’s a startup issue vs runtime issue. Mentions checking resource limits if memory-related.

Weak answer: “I’d look at the logs.” (Generic, no specifics about what to look for or the debugging process.)


Question 2: “Explain the difference between a liveness probe and a readiness probe. When have you seen each one misconfigured in production?”

Strong answer: Clearly distinguishes the two (liveness restarts the container, readiness removes from service endpoints). Has specific examples: “We had a Java app where the liveness probe was set with failureThreshold: 1 and it was causing cascading restarts during garbage collection pauses.” Or: “Readiness probe wasn’t configured for a deployment, and pods were receiving traffic before the app finished loading its configuration.”

Weak answer: Can explain the conceptual difference from the docs but has no real-world examples of misconfiguration patterns.


Question 3: “How do you handle Kubernetes version upgrades for a production cluster?”

Strong answer: Describes a structured process: check changelog for breaking changes and deprecated API versions, upgrade in test environment first, upgrade control plane before nodes, drain nodes one at a time during upgrade, have a rollback plan, use kubectl get all -A | grep -v 'apps/v1' or similar to find deprecated API versions in use. Mentions tools like Pluto for API version deprecation scanning.

Weak answer: “I’d follow the official documentation.” (No mention of real-world complications or risk mitigation.)


Question 4: “How would you right-size resource requests and limits for a production deployment you’ve never seen before?”

Strong answer: Start VPA in Off mode (recommendations only), wait 24-48 hours to get recommendations, check p99 CPU and memory usage against requests, set requests to p50 and limits to p99 or higher, monitor for OOMKills after change. Mentions Goldilocks as a UI tool for this. Explicitly says they wouldn’t just apply VPA Auto mode to production.

Weak answer: “I’d look at the metrics and set something reasonable.” (No specific methodology or tools mentioned.)


Question 5: “Describe an RBAC configuration you’d implement for a multi-team cluster.”

Strong answer: Dedicated namespace per team, dedicated service accounts per workload (not default), LimitRange and ResourceQuota per namespace, Role + RoleBinding for namespace-scoped access (not ClusterRole by default), automated auditing with kubectl auth can-i --list. Mentions blocking cluster-admin for application service accounts.

Weak answer: “I’d create roles for each team.” (No specifics about service accounts, namespace isolation, or audit practices.)


Question 6: “What’s your approach to K8s cost optimization? What tools do you use?”

Strong answer: Mentions Kubecost for visibility, VPA for right-sizing, spot/preemptible nodes for tolerant workloads, ResourceQuota to prevent waste, namespace-level downscaling for dev/staging with tools like kube-downscaler. Gives specific numbers: “We typically find 30-40% of compute is unused on first audit.”

Weak answer: “I’d scale down unused resources.” (No tools, no specific methodology, no numbers.)


Question 7: “How do you implement GitOps for Kubernetes? What tool would you recommend and why?”

Strong answer: Explains GitOps principles (Git as source of truth, pull-based reconciliation). Recommends ArgoCD as the current standard for most use cases, with specific mention of App of Apps pattern, ApplicationSets for multi-cluster. Would consider Flux for simpler setups or teams with strong Git workflow preferences. Can articulate specific trade-offs between ArgoCD and Flux.

Weak answer: “ArgoCD or Flux, whichever you prefer.” (No articulation of trade-offs or specific use cases for each.)


Question 8: “How would you secure a K8s cluster before putting production workloads on it?”

Strong answer: Mentions CIS Benchmark as the baseline, kube-bench for assessment. Covers the specific items: Pod Security Standards enforcement, network policies with default deny, RBAC review, etcd encryption at rest, audit logging. Mentions admission controllers (Kyverno or OPA Gatekeeper) for policy enforcement. Has actually run kube-bench against clusters.

Weak answer: “I’d set up RBAC and network policies.” (Incomplete, no mention of benchmarking tools or specific controls.)


Process and Methodology Questions

Question 9: “What does your engagement kickoff process look like?”

Strong answer: Describes a structured discovery process: cluster audit with kube-bench, resource utilization review, cost analysis with Kubecost, RBAC review, current deployment processes. Delivers a written findings report before making any changes. Sets clear scope and deliverables for the engagement.

Weak answer: “We’d start working on your cluster right away.” (No discovery, no audit before changes — red flag.)


Question 10: “How do you handle knowledge transfer at the end of an engagement?”

Strong answer: Mentions documentation of everything implemented (runbooks, architecture diagrams, decision logs), hands-on workshops with the internal team, documented processes for common operations, and optionally a short-term retainer for questions after handoff.

Weak answer: “We provide documentation at the end.” (Vague, no specifics about format or training.)


Question 11: “Have you worked with [our stack]? EKS? GitLab CI? Terraform?”

Strong answer: Honest about what they’ve worked with and what they haven’t. Describes how they’d approach unfamiliar tools (“I haven’t used GitLab’s K8s integration specifically, but I’ve worked extensively with GitHub Actions and the principles are the same — I’d review GitLab’s documentation and could have a working pipeline within a day”).

Weak answer: Claims expertise in everything without specifics or waffles about whether they’ve actually used specific tools.


Track Record Questions

Question 12: “Can you describe a production Kubernetes incident you’ve handled? What was the root cause, and what did you implement to prevent recurrence?”

Strong answer: Has a specific, detailed story. “We had an etcd disk saturation issue that caused API server timeouts during a high-deploy period. Root cause was too-frequent audit log rotation and several jobs generating excessive API calls. We implemented API call rate limiting per service account and moved audit logs to a separate volume.”

Weak answer: Generic incident description without specific root cause or preventive measures, or no incident experience at all (concerning for a claimed production consultant).


Question 13: “What Kubernetes certifications do you hold, and how recent are your hands-on cluster operations?”

Strong answer: CKA (Certified Kubernetes Administrator) at minimum. CKAD and CKS are strong additional signals. Can describe active, current cluster management — not a certification from 3 years ago with no recent hands-on work. Note: certifications are signals, not requirements — experienced practitioners without CKA exist, but the lack of any certification combined with vague answers about recent work is a red flag.

Weak answer: Certification from 4+ years ago with no mention of recent hands-on cluster operations, or dismissal of certifications without demonstrable deep experience.


Question 14: “Can you provide a case study or reference from a previous K8s engagement?”

Strong answer: Can provide at least one detailed case study: what the client situation was, what work was done, specific measurable outcomes (cost reduction percentage, incident frequency before/after, deployment frequency improvement). Can provide a reference contact.

Weak answer: Only vague descriptions of past work, no specific outcomes, no references available.


Commercial Questions

Question 15: “How do you structure your engagements? T&M, fixed-scope, or retainer?”

Strong answer: Can clearly describe different engagement models and when each is appropriate:

  • T&M (Time & Materials): best for discovery and advisory work where scope is unknown
  • Fixed-scope: best for defined deliverables (e.g., “implement GitOps with ArgoCD across 3 clusters”)
  • Retainer (Managed Ops): best for ongoing operations, incident response, and continuous optimization

Red flag: only offers one model and insists on it for all engagements regardless of scope.


Red Flags

Vague answers without specifics. A consultant who says “I’d look at the metrics” or “it depends” without being able to articulate specific tools, commands, or methodology has not operated production Kubernetes. Technical consulting requires demonstrable specifics.

No real-world incident experience. Anyone claiming to be a production K8s consultant who can’t describe a specific incident they’ve handled in detail has not operated clusters under real load. Everyone who runs production K8s has incident stories.

No mention of runbooks or documentation. Consultants who work without documentation create a dependency on themselves. This is either incompetence or a deliberate strategy to ensure repeat work.

Promises with no caveats. “I can cut your K8s costs by 60% in 30 days” without understanding your current setup. Legitimate consultants are specific about what they need to assess before making commitments.

One-size-fits-all tooling recommendations. Recommending the same tool stack for every client regardless of context suggests limited experience. A consultant who always recommends Istio for service mesh (regardless of cluster size or team experience) may not have experience with the range of real-world K8s implementations.


Green Flags

  • Names specific tools unprompted (Kubecost, Goldilocks, kube-bench, ArgoCD, Kyverno) and can articulate why they chose each
  • Has a documented methodology — a repeatable process for K8s assessments, not a custom approach for every engagement
  • Asks questions about your requirements before proposing solutions
  • Is honest about what they don’t know and how they’d learn it
  • Has CKA certification and recent hands-on K8s operations (within the last 12 months)
  • Can share a reference case study with specific, measurable outcomes

Talk to Our Team

Ready to evaluate whether our team is the right fit for your cluster?

→ Talk to our certified K8s experts at kubernetes.ae — we’ll answer all 15 of these questions on our first call.

Get Expert Kubernetes Help

Talk to a certified Kubernetes expert. Free 30-minute consultation — actionable findings within days.

Talk to an Expert