What a work to do system monitoring at scale
Recently I started a small adventure to implement the following process:
- Log statistics of devices across a bunch of metrics like CPU, memory, Network and Disk I/O, etc.
- Aggregate these in a central place
- Make forecast / anomaly detection on this data
The reason for this was the relative recent SSH backdoor that almost was not caught. The reason it was caught was because of a developer was excessively testing a PostgreSQL lab and noticed significant sudden increase in SSH login.