
A team tells me they’re running Kubernetes in production. I ask to see it. Often what I find is a cluster. Pods are running, an ingress responds, the demo works. That’s a cluster. It isn’t a platform, and the gap between the two is where the 3am pages live.
A production-ready platform clears a higher bar. Here’s the bar I hold work to, and the things I check when I audit a cluster someone is about to bet a business on.
The cluster itself is defined in code. Terraform, Cluster API, or the managed equivalent. You can destroy it and recreate it from a repository without anyone remembering the magic clicks. Click-ops clusters can’t be reliably rebuilt, can’t be reviewed, and drift the moment two people touch them. If your cluster only exists because of an afternoon someone spent in the portal, you don’t have a platform. You have a pet.
Application state lives in Git, and a controller reconciles the cluster to match it. ArgoCD or Flux. The benefit isn’t fashion. It’s that the desired state is auditable, revertible, and the same whether a human or a pipeline triggered it. kubectl apply from a laptop has no audit trail and no rollback story. When something breaks at 3am, you want to revert a commit, not reconstruct what the last person ran.
External Secrets Operator pulling from Azure Key Vault, or the equivalent. Secrets are referenced, rotated centrally, and never committed. Sealed Secrets is fine for a small cluster, but it doesn’t scale: you end up with encrypted blobs in Git that nobody can rotate without a re-seal dance, and the key management becomes its own problem. At platform scale, sync from a managed vault.
cert-manager issues and renews TLS, from Let’s Encrypt for public endpoints or an internal CA for east-west traffic. The test is simple. If a certificate expiring would cause an outage, the platform isn’t ready. Expiry should be a non-event because renewal is automatic and monitored.
A single ingress controller, one routing convention, predictable hostnames. The failure mode I see is three ingress controllers nobody chose, installed by three different Helm charts, fighting over the same traffic. Pick one. Make routing boring and predictable. Boring ingress is good ingress.
Separate teams and environments get namespaces with network policies and resource quotas, not a brand new cluster each. Spinning up a cluster per team feels isolated and quietly becomes a fleet nobody can patch, upgrade, or afford. Network policies enforce the boundaries between namespaces. A cluster per tenant is a last resort for a hard isolation requirement, not a default.
Metrics in Prometheus, logs aggregated somewhere you can query them, dashboards that show the state of the platform at a glance. Traces are increasingly standard and worth adding for any service mesh or chatty microservice estate. The test: when a service degrades, can you tell why in minutes from the data you already collect? If diagnosis means SSHing to nodes and reading raw logs, observability isn’t done.
Velero or similar for cluster resources, plus a real backup of persistent volume data. Etcd holds your cluster state; your PVs hold the data that matters. Both need backing up, and the restore needs testing. A backup nobody has restored is a guess.
Admission control rejecting non-compliant workloads, image scanning blocking known-vulnerable images, and runtime detection watching for the things that slip through. A cluster that will schedule any image from anywhere with no policy is one bad pull from a problem. Pod security standards, a policy engine, and a scanner in the pipeline are table stakes.
You can see what each namespace or team is spending. Without it, the cluster grows, the bill grows, and nobody can attribute either. Per-namespace cost visibility turns “the cluster is expensive” into “this workload is expensive,” which is a problem someone can actually own.
There’s an on-call rotation and written runbooks for the failures you can anticipate. A platform with no defined owner at 3am isn’t production. It’s a liability waiting for its first real incident. The runbook for “ingress is down” should exist before the night it’s needed.
Most clusters we audit and hear described as production fail three or four of these. No GitOps, so deployments are untracked. Sealed Secrets straining at scale. No tested restore. No on-call. Each gap is survivable until the day it isn’t, and they tend to compound during the same incident.
That’s not production. That’s a cluster doing production work until something exposes the difference.
The gaps rarely surface one at a time. They compound. The deploy that broke things can’t be reverted because there’s no GitOps history, the data you’d restore from was never backed up, and the person who could fix it isn’t on a rotation because there isn’t one. A single incident walks straight through three missing controls at once. Production-readiness isn’t a checklist you pass for a certificate. It’s the set of properties that decide whether your worst night is a twenty-minute revert or a multi-day rebuild.
If you’re standing up a Kubernetes platform, or you suspect the one you have wouldn’t survive a real incident, our DevOps team builds and audits platforms against exactly this bar. Book a consultation and we’ll walk your setup and tell you, honestly, which of these you’re missing.