How proper default settings can save money and trees

Published in

ML6team

9 min readJun 29, 2021

Here at ML6, we always try to stay up-to-date with the latest tools and innovations across the different fields in which we work. One of the tools on our tech radar is Terraform, which is an infrastructure as code software tool created by HashiCorp. It allows one to build, change and version infrastructure in a safe and efficient way.

A few months back, I came across an interesting flaw that occurs when using Google Kubernetes Engine in combination with Terraform and decided to write a blog post about it. In this blog post, I’ll first give a small overview of what Terraform does and what its main advantages are, after which I’ll outline the flaw that I encountered and how I ended up fixing it. Knowledge of Terraform is not needed, but understanding the basic concepts of Kubernetes will be helpful to understand what’s going on.

The main goal of using Terraform is setting up a high quality and secure infrastructure in the easiest and fastest way possible, such that we can focus on developing the actual AI solution. Another advantage of using Terraform is that common and proven infrastructure can be shared and re-used, allowing us to capture and re-use internal knowledge for kickstarting new projects. From a security viewpoint, this also allows us to bake our security policies right into our Terraform boilerplates. Furthermore, Terraform allows us to do version control on infrastructure, create multiple environments in a structured way and prevent human error.

For non-believers, it might feel like Terraform is slowing down your initial development speed. This is often true, but thanks to the higher internal quality that you are creating, you will be able to create new functionality at a faster pace later on. This phenomenon is described by Martin Fowler as the ‘impact of internal quality’ and can be visualised with the pseudo-graph below.

*Image by Martin Fowler (https://martinfowler.com/articles/is-quality-worth-cost.html)*

One of the great things about Terraform is that it has a hosted registry containing all kinds of modules that help you to quickly deploy common infrastructure configurations. These modules are created and updated by different providers, such as AWS and Google Cloud. At the time of writing, the Terraform registry contains over 6124 modules from 1162 different providers. As ML6 is a Google Premier Partner, we primarily use Google Cloud and therefore often make use of the Google Cloud Terraform modules. One of those modules is the Google Cloud Kubernetes Engine module, which allows us to create Google Cloud Kubernetes clusters through Terraform.

I recently had to adapt an internal project running on Google Kubernetes Engine to make use of Terraform. The architecture, which is shown below in a schema, is rather simple. It consists of one Kubernetes cluster containing two node pools:

A GPU node pool, that can scale between 0 and 1 instances, and which every now and then runs a machine learning inference task that requires a GPU.
A system node pool, which takes care of all the system pods that Kubernetes requires.

The goal of this particular architecture is to be able to run a machine learning inference task every once in a while, without keeping a GPU instance running 24/7. You can never scale a Kubernetes cluster down to zero instances, as you must always have at least one node available to run the system pods, which is the reason why we added the system node pool. If you’re interested in this kind of flexible architecture, I recommend you to read this excellent blog post by Alfonso Palacios on Medium, which explains you how to create such an architecture yourself in Google Cloud.

But enough about the architecture now, let’s dive into the issue that I encountered. I replicated the exact same architecture through Terraform in Google Cloud, which seemed to be working fine, until I got a notification saying that my cluster has one or more unschedulable pods.

Unschedulable pods warning

This was rather bizarre, as the architecture was exactly the same as the one that was created manually in the past through the Google Cloud Platform interface. This meant that for some reason, the system node pool suddenly could not handle all the system pods anymore on a single instance. As an official ML6 agent, I decided to start an investigation. 🕵️‍♂️

The first thing I did was checking the details and possible actions according to GCP, from which I learned that the unschedulable pod’s name was calico-typha-5754cbfbdd-fvpzg. Although this information was helpful, it didn’t immediately ring a bell.

Unschedulable pods warning details and possible actions

After looking a bit further, I noticed that there were some other pods on the node that also had a calico prefix, but that they were successfully running. Upon this, I decided to compare the pods that were running on the original cluster against the ones that were running on this new cluster that was created through Terraform. This led to the conclusion that there were no pods with a calico prefix running on the original cluster, which was created through the Google Cloud Platform interface. If you Google for calico, you’ll discover that it is a network policy provider that you can enable in Google Kubernetes Engine (GKE) by activating the network policy enforcement feature.

The GKE’s network policy enforcement feature can be used to control and limit communication between your cluster’s pods and services. A very interesting feature for more complex applications (multi-tenant or multi-level applications), allowing you to configure firewall rules between services and pods running in your cluster. (eg. disabling direct communication between frontend & billing microservice or communication between namespaces of different tenants.) However, you have to manually define a network policy by using the Kubernetes Network Policy API to create pod-level firewall rules. If you don’t define anything, the network policy add-on won’t do anything other than taking up your valuable resources. And according to the Google documentation, this option increases the memory footprint by approximately 128 MB and requires approximately 300 millicores of CPU. I suppose you can imagine how this is problematic in my use-case, where I’m only using a single ‘e2-small’ machine type which has limited resources.

At this point I knew that, for a still unknown reason, GKE’s network policy was enabled when creating the infrastructure though Terraform, and that it wasn’t when the infrastructure was created through the Google Cloud Platform interface. This feature increased the number of resources required, which exceeded the limit of the node, hence the unschedulable pod. This means that I had found the root cause of the error I was getting, but that I still had to look into how this could happen and how I could avoid it in the future.

Of course, it was obvious now that this had something to do with the Terraform Google Kubernetes Engine module. After consulting their documentation, I noticed that there is an option to enable the network policy add-on, but that it is enabled by default without any further documentation. This raised in my opinion three main concerns:

First of all, it means that every user gets an overhead without even realising it. After all, the only reason I discovered this is because I could no longer run all my system pods on a single e2-small instance node in this use-case specific architecture. In another architecture with larger instances, I probably wouldn’t have noticed this. Google actually documented the resource requirements, stating that you need at least two e2-medium instances for the network policy feature, and it is even recommended to have at least three e2-medium instances.
Secondly, the earlier discussed overhead could be acceptable if it would improve the security out-of-the-box, but as explained earlier, this add-on does not contribute anything to the security of your system as long as it is not configured.
This also brings us to the third and last argument, which is that the network policy add-on is not enabled by default when creating a cluster directly through the Google Cloud Platform interface. This means that Google realises that this is an overhead for most people and decided to not enable it by default.

Given all these arguments, I felt like there were no real pros for enabling the network policy add-on by default when creating a GKE cluster through Terraform. Of course, I could have overridden the default setting in Terraform and disable the network policy manually, but I still wanted to avoid other people stumbling into the same conclusion in the future. Upon this, I decided to create an issue in the Terraform Google Kubernetes Engine Module GitHub repository, in which I tried to convince the maintainers of the repository to change the default behaviour using the arguments from earlier. This was received positively, on which I created a pull request that changed the source code such that the network policy add-on would be disabled by default. This pull request eventually got accepted, and the rest is (git) history. As a consequence, starting from v14.0, the network policy add-on is no longer enabled by default.

Changing these default settings has an impact in multiple ways. The most obvious win is achieved financial-wise, as this change can potentially decrease your GKE cost per cluster by up to 83%*, but a less obvious win can also be found on the environmental side. Although Google Cloud is taking great initiatives to be as green as possible, an instance that is running on Google Cloud is still consuming power that might somehow damage our environment. According to their documentation, the Belgian (europe-west1) datacenter is consuming on average 68% carbon free energy, which is great, but still means that 32% of the consumed energy is harmful for the environment. Decreasing the required resources thus also lowers the impact on the environment. An important note here is that there are other Google Cloud datacenters that are consuming far less carbon free energy or where it is even unknown how much carbon free energy is consumed.

Google’s St. Ghislain, Belgium (europe-west1) data center

I think it is safe to say that this small effort change has a big impact in multiple ways. You might have already heard about the Boy Scout rule saying:

“Always leave the campground cleaner than you found it.”

The idea behind that rule is that when you find a mess on the ground, you clean it up regardless of who might have made that mess, because you want to intentionally improve the environment for the next group of campers. Well, the same principle applies to software too, and it is exactly what I did. If everyone would leave code a little better than they found it, the world would be an even greater place!

* Based on the particular architecture that we discussed and on the fact that you would need to have the recommended three e2-medium instances (€66.20/month, europe-west1) to run the network policy enforcement feature, compared to running the system pods on a single e2-small instance (€11.03/month, europe-west1)

How proper default settings can save money and trees

Written by Jérémy Keusters