The life aquatic with Aqua Security

Drowning, is what happened. This is my personal experience with trying to implement some simple guidelines.

Terraform

I was working on a project that uses Terraform as the Infrastructure as Code (IaC). I do not mind using Terraform. I am quite sure I am not fully utilizing the potential it offers, and so far I am trying to find the edges of it's potential. One of them is it is quite difficult to conditionally make resources. It is not impossible, and there are a small number of ways in order to do it, but none of them are nice to maintain and easy to understand and convey to other people that need to learn Terraform as well as DevOps. So I stayed away from them.

AstroNvim

I use AstroNvim in order to edit everything. It comes with a nice Terraform plugin/module setup from the community. As I was editing all these files, it quite nicely gave me some security related advisory that ultimately was sourced from Aqua Security.

Advisories

The advisories ranged from disabling project wide SSH access to disabling legacy APIs and enabling secure boot on VM instances to the one that caused me to sink down to the bottom of the ocean; creating a service account to run your nodes as inside a Google Kubernetes Engine cluster.

I will offer to spare you the details and tell you the conclusion immediately; don't. Now the details are what makes for a good story down the road, one of those war stories where you can reminisce about the good ol' days of yore when you had to manually create Kubernetes clusters.

So I first created an account with the necessary role permission and...actually no. The first thing I did was do something to my IAM project wide making it impossible for me to login anymore and do anything. I will just link to what happened. That warning was skipped by me and that is exactly what happened. I also did a second time to another project as I was not understanding what just happened.

Take 2

Okay, after my team lead helped me restore my project, I was moving onward again. You can never go back, only forwards. So I figured out not to do that, but actually use the binding rather than direct policy. I created a service account, gave it the role it needed and then it still did not work. It spit out an error, but different from before. Somewhere in a GitHub issue I found the answer. At least I think that is the one that lead me in the right direction. I suddenly saw these two “hidden” accounts, since they do exist and Google uses them internally to do everything in your project but they do not show up in the UI. I added the necessary rights to these accounts to be able to access my custom service account and all of a sudden, things worked. I was happy. Changed all the nodes (VM instances) to use the custom service account, and things were working. The relevant Terraform code:

resource "google_service_account" "cluster_service_account" {
  account_id   = "${var.env_prefix}-cluster-account"
  display_name = "A service account with minimal access for the cluster"
}

resource "google_project_iam_member" "cluster_account_container_member" {
  project = var.gcp_proj_id
  role    = "roles/container.defaultNodeServiceAccount"
  member  = "serviceAccount:${google_service_account.cluster_service_account.email}"

}

resource "google_project_iam_member" "cluster_default_container_engine_member" {
  project = var.gcp_proj_id
  role    = "roles/container.serviceAgent"
  member  = "serviceAccount:service-${var.gcp_proj_nr}@container-engine-robot.iam.gserviceaccount.com"
}

resource "google_project_iam_member" "cluster_default_cloudservices_member" {
  project = var.gcp_proj_id
  role    = "roles/container.serviceAgent"
  member  = "serviceAccount:${var.gcp_proj_nr}@cloudservices.gserviceaccount.com"
}

resource "google_artifact_registry_repository_iam_member" "cluster_artifact_reader" {
  repository = google_artifact_registry_repository.fioed.name
  role       = "roles/artifactregistry.reader"
  member     = "serviceAccount:${google_service_account.cluster_service_account.email}"
}

Recreating it again

Another customer had to be brought online soon, and I wanted to redo the creation one more time as a preparation for the actual thing. Almost like a general rehearsal for a play, I just wanted to go through the steps so it would not all be new again and any and all problems that still might be there could be fixed again. So I cut a new release and I destroyed my infrastructure to start over again.

It would not even create the GKE cluster this time. It failed. Now mind you, there was about a month and a half gap between me doing the custom service account and starting over. So again somewhere deep in the docs of GCP it talks about what could happen if someone were to edit the default service accounts. Meaning those “hidden” accounts, and a light bulb slowly lit up like someone was turning on a dimmer switch. I suddenly remembered I did exactly that to see what it would take to use a custom service account in our cluster.

So I disabled the APIs and re-enabled them again. That got me past the error. I could create the cluster again. Perfect. Next up creating the first pods. They would not get the images. Suddenly I saw ErrImagePull everywhere. I retried the steps for the APIs, recreating the service accounts, but nothing helped. Then again, I just happened to read a sentence mentioning that the default Compute Engine service account gets assigned a role to automatically read your Artifact Registry Repositories. I did not know where that role had to be configured, so I decided to add it to the service account for some reason.

Quick side note, apparently service accounts are both Principals and Resources. They are like Quantum IAM permissions, always in a super position until you interact with them.

My command kept failing, and then someone suggested to do a project wide IAM setting again through a CLI command. I learned my lesson, and tried to do it via the UI this time to make sure I would not do anything foolish. I added the role for the Compute Engine service account that gets created; <project-nr>-compute@developer.gserviceaccount.com , for anyone who is interested. Make sure that that account has the Artifact Registry Reader role assigned.

Conclusion

It is not worth it to have a custom service account for your GKE cluster. It instantly makes your project IAM bespoke and very difficult to maintain and regulate due to all the hidden logic. I will say, if you are a high value target, it does make sense. Since you can probably afford to have a person or a whole team just do IAM all day. So they can manage all these complex rules and permissions in order to make everything work. It also does add a layer of protection since all normal regular tools will not work anymore, since you get to set the permissions of this custom account to be far different than the regular normal service accounts you get. So for one, the Kubernetes Service Account agent or whatever creates your cluster. Then the first node pool VM instances, but that is it. The rest of your cluster gets created by your custom account and all the custom accounts are running the VM. You could make it so limited that it only has read only permissions to do the absolute bare minimum to run the services, it makes it more secure that way.

However for almost all other cases, meaning most of us, it is not worth it. The opposite even, it is very costly to implement this and the security benefit it would give is negligible.