Kubernetes Rolling Updates and Hidden Capacity Costs

The Kubernetes documentation describes rolling updates in one sentence: “Rolling updates allow Deployments’ update to take place with zero downtime by incrementally updating Pods instances with new ones.”

The mechanism itself is simple. Kubernetes gradually creates Pods running the new version of the application and terminates Pods running the previous one. Nothing is stopped all at once; instances are replaced step by step.

“Zero downtime”, however, is not something a rolling update gives us for free. During the rollout, old and new versions of the application run at the same time, which means the two versions must be able to coexist safely. If the new version requires a database migration, that migration has to be backward compatible. Otherwise the new Pods may work correctly while the old Pods start failing, or the old Pods may keep serving traffic based on assumptions that are no longer valid. In that situation the rollout strategy is not the problem; the application is simply not safe to run in a mixed-version state.

This article focuses on the part of rolling updates that is easy to underestimate in production: how maxUnavailable, maxSurge, terminating Pods, and cluster capacity interact during a real deployment.

Traffic routing during the rollout is handled by the Service. A Service does not care whether a Pod is old or new; it only cares whether the Pod matches its selector and is ready to receive traffic. With proper readiness probes, Kubernetes avoids sending requests to Pods that are still starting or unhealthy.

Up to this point, the behavior is predictable: create new Pods, wait until they are ready, shift traffic, remove old Pods. The complications begin with the two parameters that control this process: maxUnavailable and maxSurge. They decide how aggressive or conservative the rollout will be, how many Pods can be unavailable, how many extra Pods can exist temporarily, and how much spare capacity the cluster needs while all of this happens.

Why `maxUnavailable` and `maxSurge` matter

When a rolling update starts, Kubernetes has to answer two questions:

How many existing Pods can be taken down during the update?
How many extra new Pods can be created before the old ones are removed?

maxUnavailable answers the first question. It defines how many Pods are allowed to be unavailable during the rollout. If the value is too high, Kubernetes may remove too much capacity at once, leaving the application without enough ready Pods to handle traffic.

maxSurge answers the second. It defines how many extra Pods Kubernetes may create temporarily during the rollout. A higher surge lets new Pods come up before old ones are removed, which is usually safer from an availability perspective — but the cluster must have enough CPU, memory, and scheduling capacity for those temporary Pods.

These two values are often treated as rollout-speed settings. They are not. They directly affect availability, resource usage, and how safe the deployment is.

Why percentages become a problem at high replica counts

The default values look harmless:

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxUnavailable: 25%
    maxSurge: 25%

Both percentages are calculated from the desired replica count. For a 10-replica Deployment, 25% means two or three Pods. For a 150-replica Deployment, the same line in YAML means something very different (maxUnavailable rounds down, maxSurge rounds up):

maxUnavailable: 25%  -> up to 37 Pods may be unavailable
maxSurge: 25%        -> up to 38 extra Pods may be created

The intuitive upper bound for the Pod count during the rollout is:

150 + 38 = 188 Pods

That is not the full picture. Kubernetes creates new Pods and terminates old Pods at the same time, and terminated Pods do not disappear immediately. A Pod marked for deletion stays in the Terminating state until its terminationGracePeriodSeconds expires. From the controller’s point of view it no longer counts, so a replacement can be created right away — but the old Pod still exists on the node and still consumes resources.

The real temporary Pod count is therefore closer to:

desired replicas + maxSurge + terminating Pods

If Kubernetes has already marked 37 old Pods for termination while surge Pods are being created, the same Deployment can briefly look like this:

150 + 38 + 37 = 225 Pods

Not all of these Pods are available or receiving traffic. But from a resource perspective they still matter: terminating Pods keep consuming CPU, memory, sidecar resources, node capacity, and Pod IPs until they fully exit.

This is why replicas + maxSurge is not a reliable upper bound for resource consumption during a rolling update. With 150 replicas, a value like 25% can produce dozens of surge Pods and dozens of terminating Pods at the same time. If the application also has a long shutdown period, heavy sidecars, slow connection draining, or large resource requests, a single rollout can put far more pressure on the cluster than the manifest suggests.

What this affects in production

The extra Pods are not only a number in kubectl get pods. They influence several parts of the deployment process and the cluster itself.

1. Cluster resource capacity

New Pods need CPU, memory, ephemeral storage, Pod IPs, and volumes. At the same time, old terminating Pods may still hold their resources until the grace period ends. The temporary usage follows replicas + maxSurge + terminating Pods, not replicas + maxSurge. For a 150-replica Deployment with large resource requests or heavy sidecars, the difference is significant.

2. Scheduling pressure

Even when Kubernetes wants to create new Pods, they still have to be scheduled onto Nodes. Without enough free capacity, they stay in Pending. The cause can be a CPU or memory shortage, but also node selectors, affinity rules, topology spread constraints, taints and tolerations, volume attachment limits, or namespace quotas. In a small Deployment this delays a few Pods; in a high-replica Deployment it can delay dozens at once.

3. Cluster autoscaler behavior

A large surge may not fit on the existing Nodes, in which case the cluster autoscaler has to add new ones first. The rollout is then waiting not only for the application to start, but for infrastructure to scale: cloud provisioning time, node bootstrap, image pulls, CNI setup, and node readiness. If the cluster is limited by cloud quotas, node group limits, or available IP addresses, the rollout can slow down or get stuck entirely.

4. Rollout duration

An aggressive configuration makes the rollout faster only when the cluster has capacity and the application starts quickly. When it does not, the same configuration makes the rollout slower. New Pods wait for scheduling, image pulls, readiness probes, or minReadySeconds; old Pods take time to exit because of terminationGracePeriodSeconds, preStop hooks, sidecar shutdown, or connection draining. Rollout duration depends as much on the cluster and the application as on the Deployment strategy.

5. CI/CD pipeline duration

Most pipelines wait for the rollout to finish:

kubectl rollout status deployment/<deployment-name>

Anything that slows the rollout — pending Pods, node scale-up, image pulls, slow termination — also slows the pipeline. Sometimes the application is already healthy, but the pipeline keeps waiting because the Deployment has not reached the completed state yet. In worse cases, the rollout exceeds the pipeline timeout or progressDeadlineSeconds, and the deployment step fails even though nothing is actually broken. Rollout settings are part of pipeline design, not only manifest design.

6. Application availability and traffic load

maxUnavailable defines how much serving capacity can disappear during the rollout. If 37 of 150 Pods are allowed to be unavailable, the remaining ready Pods absorb their traffic. With enough headroom this is fine. If the application already runs close to its limits, removing dozens of ready Pods increases latency, error rates, CPU usage, and queue depth. maxSurge helps by bringing capacity up before taking it down — but only if the new Pods become ready quickly and the cluster can actually run them. The question is not whether Kubernetes keeps enough Pods nominally available; it is whether the remaining ready Pods can safely handle production traffic while the rollout is in progress.

7. Shutdown and connection draining

Terminating Pods matter most for applications with long-running requests, persistent connections, message consumers, or graceful shutdown logic. A Pod marked for deletion may still be finishing requests, draining connections, committing offsets, closing database connections, or running a preStop hook. This is exactly what graceful shutdown is for — but it also means the Pod keeps consuming resources while its replacement is already being created. In a high-replica Deployment, many Pods can be in this state at the same time.

8. Cost and temporary over-provisioning

If the rollout forces the autoscaler to add Nodes, those Nodes may stay alive for some time after the rollout finishes. For small services this is negligible. For high-replica Deployments with large Pods, heavy sidecars, or expensive node types, a single rollout can produce a visible, if temporary, increase in infrastructure cost.

9. Operational visibility and troubleshooting

Engineers who expect the Pod count to stay under replicas + maxSurge will see more Pods than that during the rollout, which leads to familiar questions: Why are there more Pods than expected? Why is the cluster scaling up during a routine deployment? Why is the pipeline still waiting? Why are new Pods pending? Why did the Deployment exceed its progress deadline? In most cases the answer is that a rolling update involves more than desired replicas and surge replicas. Terminating Pods, scheduling capacity, readiness behavior, shutdown time, autoscaling, and pipeline timeouts are all part of the real rollout behavior.

Conclusion

For high-replica Deployments, maxUnavailable and maxSurge should never be reviewed as percentages alone. Translate them into actual Pod counts, include terminating Pods in the calculation, and consider the temporary resource impact on the cluster before the rollout starts — not while debugging it.