Application Scalability, Part 2: Kubernetes

Tomasz Urbaszek
6 min readFeb 7, 2021


Illustration by Alejandra Ramos

These days, for many, Kubernetes is nearly a synonym of cloud-native. In many cases, this means abandoning the on-premise infrastructure to gain all advantages of infrastructure-as-a-service. One of the most interesting advantages of such an approach is the ease of scalability and accompanying cost optimization. In the previous article, I covered the basic concepts of application scaling. This time, we will dive deep into Kubernetes’ autoscaling capabilities with a special focus on the horizontal type.

Vertical Pod Autoscaler

In the case of Kubernetes autoscaler, when we say autoscaling, we generally think about scaling pods. Kubernetes clusters ship with built-in mechanisms for automated vertical scaling of pods — Vertical Pod Autoscaler. If you are not familiar with the idea of a pod, you can simply think about it as a single VM running a docker container with your application.

So what VPA does is that it scales up your existing pods if they need more resources (CPU or memory) and scales them down if they are over-requesting resources. All of this is done automagically by Kubernetes. Deploying VPA is a simple as applying this short custom resource definition to your cluster:

kind: VerticalPodAutoscaler
name: my-app-vpa
apiVersion: apps/v1
kind: Deployment
name: my-app
updateMode: Auto

As you can see, all you have to do is specify the deployment, which should be scaled by VPA (here = my-app). As we discussed in previous article, one of the significant drawbacks of vertical scaling is linked to short downtime of your service when scaling is performed. This is also addressed in VPA documentation:

Updating running pods is an experimental feature of VPA. Whenever VPA updates the pod resources the pod is recreated, which causes all running containers to be restarted. The pod may be recreated on a different node. (Source)

Horizontal Pod Autoscaler

As we already know, vertical scaling is not as fruitful as horizontal scaling. Happily, Kubernetes clusters also have a built-in powerful Horizontal Pod Autoscaler, which — as the name suggests — the procedure of horizontal scaling automatically. This autoscaler is much more powerful than VPA as it allows a lot of customization and extension that we will cover later on.

The HPA continuously monitors resource usage across pods of a single deployment. In the case of vanilla Kubernetes clusters, the only resource you can use is the CPU utilization. And once the average CPU utilization (across pods in single deployment) is above a defined target, the HPA will start to scale up. Same, if the average is below, the deployment will be scaled down.

Here’s a simple example of horizontal pod autoscaler definition:

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
name: my-app-hpa
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 1
maxReplicas: 4
- type: Resource
name: cpu
type: Utilization
averageUtilization: 50

Let’s split it into smaller pieces as it contains more information than previously discussed VPA.

  • specifies the deployment that will be scaled by this HPA.
  • spec.minReplicas and spec.maxReplicas define the minimum and maximum number of pods in the deployment.
  • finally, spec.metric defines the metrc that will be used to make the decision to scale up or down.

Of course, CPU on its own is not a lot of information. Luckily, the HPA capabilities can be extended by Custom Metric Adapters. However, the problem with metric adapters is that they usually provide a predefined bunch of metrics with unchangeable behavior (i.e., pool time for the desired metric), and they are often “cluster state” metrics than “application metrics.” Thus, they are not always the best choice for a decision source for your application.

What is more, the built-in HPA mechanism has still one small drawback. It doesn’t allow scale deployment to 0 replicas. This means that even if there’s no job to do, our deployment has to include at least one pod. And in some specific cases, that may sound like a potential waste of resources and money. In the next part of this short series, we will take a closer look at a few interesting ways that improve Kubernetes horizontal scalability.

Tips for HPA

At the end of last year, I was heavily occupied with researching the Kubernetes HPA mechanism. During this time, I’ve learned a few lessons that I would like to share with you.

  • Design your docker image properly — this may sound like something trivial. However, not everyone is familiar with the problem of system signals propagation between docker containers and underlying machines. And this is crucial to make sure that pods with your application will be terminated gracefully. Happily, some simple solutions solve this issue and are simple to use. Take a look at tiny or dumb-init to get grasp of the problem.
  • Use PreStop hook — once your image handles SIGTERM signal properly, you can try to use PreStop hook to perform the action before your application pod will be terminated (i.e., cancel some running processes gracefully). For more information, I recommend familiarizing yourself with container lifecycle.
  • Adjust terminationGracePeriodSeconds — if you know that your application needs some time after receiving a stop command or SIGTERM signal, you should consider adjusting the terminationGracePeriodSeconds of your pods. This will increase the probability that whatever you need to do before the termination of a pod, it will be done. Although it may sound like a simple thing, it’s worth to take a look at the process of pods termination.

Being aware of all those moving parts may help you assure that your HPA works smoothly and there’s no unexpected behavior (i.e., lost tasks or data) when shutting off your application.

Cluster Autoscaler

At the beginning, I said that Kubernetes autoscaling is mostly about pods scalability. That’s generally true from a software engineer perspective. But there’s also a lower level of what can be scaled in Kubernetes. The Cluster Autoscaler is a Kubernetes component that monitors the whole cluster, and if needed, it will increase or decrease the number of cluster nodes. By enabling the CA, we can increase the probability that any requested/scheduled cluster workload will be successfully executed. What is the difference between HPA and CA?

If there are not enough resources, CA will try to bring up some nodes, so that the HPA-created pods have a place to run. (Source)

So basically the CA works on node level, assuring that the cluster has enough available resources that can be used to create pods required to run components of our application. The whole configuration of CA comes down to enabling this feature on your cluster (provider specific, here is GKE example with required minimum and maximum number of nodes. And that’s it, not a lot to configure here.


In this part, we took a look at basic autoscaling options of Kubernetes clusters: VPA, HPA, and CA. Hopefully, next time when you will enable HPA on your cluster, you will be aware of its limitations and some additional settings that can improve the reliability of your deployment. In the next part, I will cover a few interesting ways that extend Kubernetes capability of horizontal scaling and make the autoscaling even more powerful.

Next in the series Part 3: Knative and KEDA.

This blog post was originally published at



Tomasz Urbaszek

Opportunity seeker, software engineer, open source enthusiast.