What a fantastic ride

I was put in charge to write some extra tests in our framework covering our Docker registry endpoints. We created a framework around Locust. Naturally, I first started learning the framework and it is pretty nice to use. You create simple classes that house the flow of the requests you want to execute and you call them one by one, stating what should be the success and what the failure.

The goal was to prove that our Python code was horrible and needed to be switched to Golang implementation ASAP.

Rough start

I could not even start our Docker container because some Werkzeug, Flask and Locust combo made it all not work anymore. So I first had to untangle that mess. It turned out that some older code of Flask used a specific call to a function that does not exist anymore at the provided location.

For all who are interested, the actual error is: cannot import name 'BaseResponse' from 'werkzeug.wrappers'.

After that initial rough start I started out by mapping out how Docker actually works. What happens when you do docker pull or docker login for example. Turns out they are all just HTTP calls to a REST API backend. That returns some data and with that data we continue onward to more calls until all the data has been gotten for docker to actually create the containers and/or images.

Docker API

I wrote the simple Python PoC code for the DockerAPI client. In principle I can use that code now to get any image I want, but I do not use that. So I included that whole code into our Locust framework to make sure the test was always set up correctly, and that subsequent images were deleted.

I ran into the second problem. Images cannot be removed from a Docker registry by default. You have to enable that feature. So when I started talking to our devs, they said just forget about it. Do the setup code once, so that the image exists that is needed in a shared test repository and continue onward.

So I scrapped the entire code out of Locust and began again anew.

Concurrent issues

Next up came the problem that I wanted to only get credentials once, and share those credentials amongst the distributed workers. There were several hosts that each run multiple workers as separate processes. I wanted on each of those hosts, that one call got made by the worker process to get a nice token and share that token in memory with the rest. In comes SharedMemory by Python. I got it to finally work after fixing all my concurrent race condition failures, where there was no synchronised flag to make sure everybody waited on each other.

After all that code, the rest of the devs were that is cool but we do not need it. Just call the login at each start of the flow, it will create credentials and if there are already credentials it will return them. So again rip out the code written so far and start anew.

Finally on my way

Started again with the new flow and now I got a nice test up and running. The data returned was a bit baffling and showed our Python code was not the bottleneck as previously thought, hoped for. It was our Nginx reverse proxy setup. Split out the nginx pods unto their own and updated the config to handle things a bit better and give more threads and workers basically.

Okay after fixing the nginx pods, then ran the tests again and it turned out the Docker registry itself was a bottleneck. It just could not cope in terms of memory usage and freeing up stuff. We use Redis as our cache layer and Google Cloud Storage (GCS) as our bucket to actually store the data retrieved by Docker registry.

Breathing room

We had so much services jammed together in one pod it was crazy. Basically one pod ran the following services: – Nginx – Redis – Docker registry – Flask app

Then there was no control of what pod ran what services, so it could be that one pod ran 2 nginx + redis + docker registry + flask, whilst another ran only docker registry + flask. So back to basics, get one service per pod and split off the docker registry unto it's own node. Now we have the following setup:

Nodepool A:
- Three nodes
  - running one pod each of Nginx
  - running one pod each of Flask
  - running one pod total of Redis
Nodepool B:
- Three nodes
  - running one pod each of Docker Registry

Now that that was cleared up, the next bottleneck seemed to be Redis? So I turned to Redis and it's config and found out we actually were not using the staging Redis but the production Redis ?!?!?!

I quickly changed that config and made it so there was one node running a dedicated Redis. So the full situation becomes:

Nodepool A:
- Three nodes
  - running one pod each of Nginx
  - running one pod each of Flask
Nodepool B:
- Three nodes
  - running one pod each of Docker Registry
Nodepool C:
- One node
  - running one pod total of Redis

Okay, now can we finally move onward to find out that the Python code itself is so slow?

gunicorn

Well not so fast. Turns out that gunicorn was behaving badly and might do with some optimisation. gunicorn uses different worker classes and if we do not feed it the right ones with the right parameters it might actually be blocking. The reason I started looking down this rabbit hole was because of the gunicorn logs stating they ran out of workers.

After much experimenting on what parameters work best, turns out the best one that worked for us was the following:

CONCURRENCY_SETTING=$(python3 -c 'import multiprocessing as mp; print(mp.cpu_count() * 2)')
exec /usr/local/bin/gunicorn -n internal_auth_secret -w${CONCURRENCY_SETTING} -k gevent --worker-connections=1000 -b 0.0.0.0:8000 internal_auth_secret:app -t 180

Meaning use the gevent type worker class, with 1000 worker connections. Also use a total amount of workers to twice the amount of cores available to us in whatever host we are running as. This also meant it is dynamic to the point where if we would ever upgrade the hardware of the node underlying the pod it will grow with it automatically without us having to make sure we also update the amount of workers.

Conclusion

After fixing all the infrastructure setup of correctly allocating memory and CPU to each of the services, coupled with separating them out to make sure each of them gets the appropriate amount needed. Making sure our nginx was configured correctly. Followed by actually configuring the services in staging correctly to point at services in staging rather than production, followed by configuring the gunicorn service and fine-tuning it, there was still a slight bottleneck.

Yeey, finally Python code is slow and dumb and move on to Golang. Hold on, let us first see what is being the bottleneck. I made some call graphs using the following module https://github.com/daneads/pycallgraph2. It showed that the bottleneck was partly in our shared library code that handled authentication and also the way we were determining when to call that particular function. Finally the culprit has been located.

To fix the shared library code was easy, just improved the for loops and small optimizations in terms of what to store so we do not do a constant looking up of the same values. Cache more in Redis, then also use a Redis connection pool rather than starting up a new connection every time for each query.

To fix the problem of knowing when to call the function in the shared code was a literal one if else statement added to the previously declaring of the variable logic. It was a code fix of 44 characters that resulted in an improvement of the total time spent. The longest before this fix was 465ms on the shared library code path. After fixing both it was only around 60ms. So instead of the code being able to handle roughly 2 per second we could now handle roughly 15 per second per worker per worker_connection.

After that roller-coaster of a ride, I made sure we could handle millions of requests coming in rather than just couple of hundred. The next optimisations lie in Network I/O and other factors. Even if we would move towards Golang implementation it might gain us 1ms max in terms of code maybe, that is even highly optimistic and probably not even realistic. The rest lies in the fact that we have a nginx going to a docker registry talking to another service running somewhere else again on the network that talks to Redis. Those round trip times are starting to add up.

However that is for another time. Right now we got enough to make sure we can get through the next years of running our service. If we need more, just scale the entire setup to include more nodes, until the bottleneck is network throughput/bandwith. Then we will revisit this.

#100DaysToOffload #DevOps #python