ACME solver issues

This time I took a deep dive into the cert-manager that can automatically make Let's Encrypt Certificates inside a Kubernetes cluster.

Why

First let me start off I do not do these things solely because I think they are fun. In this case I had a problem that I wanted to get to the bottom of. The problem was that I was using the cert-manager to automatically provision Let's Encrypt certificates inside a GKE cluster. The Ingress we made, created the actual GCP resources of a LoadBalancer HTTP that redirected all traffic to HTTPS and a LoadBalancer HTTPS that did all the routing work.

The so-called acme-http01-solver pod that created a rule in the Ingress, made it in the HTTPS LoadBalancer. Yet we do not have a certificate yet there so the error messages kept coming back about a HTTP –> HTTPS redirect ending in a EOF on the SSL connection handshake.

cert-manager internals

The internals of the cert-manager project are quite interesting. It does as much as it can through pure k8s resources, utilizing Custom Resource Definitions to trigger things and have it's own API. It is quite handy. The business logic to see what Ingress it needs to use, or create a new one, is not that hard to follow. Basically what happened is, we specified the Ingress that had to be used so it was using that one to update and add the rule for the challenge path to direct to the HTTP solver pod.

Manual workaround

I did a manual workaround in order to get the certs to work. I took a note of the settings as they were now. I just added the BackendConfig to the HTTP LoadBalancer, and then specified a simple Host Path rule to send all traffic to the acme-httpo1-solver pod. After obtaining the cert, I just reverted to the settings it had before that. This worked nicely.

Potential fixes

One thing that struck me as odd, was that for our dev environments/clusters this was never an issue. So I looked at the differences and saw that we told cert-manager to issue a temporary self signed certificate to the Ingress. That meant that there was a cert the solver pod would accept and therefore we could use the same “fix” for the actual production certificates rather than only for the staging Let's Encrypt certificates.

Another fix would be to not specify the Ingress name eventually to be using the created cert, just only specify the class. In this case it would be gce, the only thing I am unsure of is if it will put the certificate into the right Secret.

Another fix would be to add a HTTP only rule for the .well-known paths, but the problem is that we cannot control how the LoadBalancers are created. This is the logic that Google has made to translate GKE to GCP resources. Also cert-manager does not need a fix, or has the place for a fix, because it just created kubernetes only resources. In this case an Ingress.

Final potential fix

The final option is actually to create a FrontendConfig like the following:

apiVersion: cloud.google.com/v1
kind: FrontendConfig
metadata:
  name: my-frontend-config
spec:
  redirects:
    - name: redirect-http-to-https
      pathMatch: "/"
      redirectType: MOVED_PERMANENTLY
      stripQuery: false
      statusCode: 301
      pathPrefixRedirect: "/"
      redirectScheme: "HTTPS"
      # specify to not redirect certain paths
      pathPattern:
        - path: "/.well-known/acme-challenge/**" # Exclude "/.well-known/acme-challenge" path
          scheme: "HTTP" # Keep it HTTP

This should potentially create the fact that there will be no redirect to HTTPS but it should still send the traffic further to the pod. I have not verified this, but this seems to be the nicest solution.

Conclusion

This post belies how much effort it was to figure all of this out and trying to fully understand the problem in order to come up with solutions.

In order to fully come to the conclusions of this article you have to understand: