What a work to do system monitoring at scale

Recently I started a small adventure to implement the following process:

The reason for this was the relative recent SSH backdoor that almost was not caught. The reason it was caught was because of a developer was excessively testing a PostgreSQL lab and noticed significant sudden increase in SSH login.

This is written more like a reference for myself to look back on, rather than a suggestion to implement it this way or stating it is the only || best way to do this.

My work loved this as a feature

In order to implement this feature a process was needed that would harvest the data, send it somewhere to aggregate and then visualize and potentially let loose some Machine Learning on it. Because of my interests I knew of tools that were used in this space; Fluentbit, Prometheus, OpenTelemetry, Grafana, Kibana, and some more. I decided to start with making a mock up of sorts. A flow of how things would look hooked up.

Prometheus

Prometheus did not operate the way I expected. I thought it would expose a port / URI / API that you could connect to or send your, structured, data to. This is not how it works. It does so by getting the data from places and ingesting it that way. So rather than waiting for an agent to send data in ( รก la DataDog, New Relic and their ilk), it goes out to what is configured and gets the data from endpoints.

This did not necessarily throw a wrench in the works, but it did cause a pause. So I wanted to see if a central place could be used. Prometheus also supports something called remote_write and remote_read, so using that I found a place to store data: Google Cloud Spanner. This is a relatively cheap service that makes it so all data can be sent in a central store and multiple Prometheus instances can read from the data store, and write to it.

Fluentbit

Fluentbit can be configured to immediately do a remote_write, so that made it nicer to work with. In this case I went with truestreet by Google. It was running in a container and would send the data to the central place from each device. Then in essence all that is needed to run the Prometheus and rest anywhere is another one of these exactly the same truestreet containers and use it to do remote_read against.

Also it has a nice Node exporter to automatically export almost all statistics that were relevant or interesting.

Grafana

I went with Grafana to visualize the data. It can also run anomaly detection and custom scripts, but that is for later. What I did encounter was installing, configuring and running Grafana in a secure manner is an ungodly amount of work. So I went with just running it locally on my dev machine for now. That seemed to be the easiest.

Recap

To recap what I ended up doing was running fluentbit as a process on the device itself, talking to a container running truestreet that sent the data to a Google Cloud Spanner instance. Then anywhere else I wanted to have the data available to look into, all I needed was a docker compose file that spun up: โ€“ truestreet container โ€“ Prometheus configured to do remote_read of the truestreet container โ€“ Grafana with Prometheus as the data source