Sharding Prometheus using service discovery
Thanos is great to query from multiple prometheus instances, and after setting up needs almost very little maintenance, but is still not horizontally scalable atleast not out of the box. Although the querier and store components are scalable, a significant amount of efforts are spent on the setting up individual prometheus instances to scrap the targets. this means setting up prometheus instances everytime a major metrics component is released in the infrastructure like cadvisor, node exporter, envoy or application level proxies like redis-proxy(predixy), mysql-proxy(proxysql) etc,..
It is generally expected to isolate these components from each other like below
But even then every time the business scales, all 3 components would scale along with it, which will lead to further sharding those components manually, this means we have to manually understand how much each of the targets generate metrics and try to balance those targets in each of them.
To improve this we can automatically shard them by tweaking the service discovery part of prometheus. prometheus supports misc service discovery by using file based service discovery, so we can write our own service discovery which will balance the targets automatically between different shards i.e between cadvisor-1 and cadvisor-2

as shown the service discovery basically runs else where and distributes the prometheus targets configuration via the efs, and each of the prometheus shards scrap using the file_sd_config
, this already reduced a lot of developer hours spent on manually sharding them and balancing them. since thanos can aggregate the metrics from multiple prometheus instances, there isn’t any affect on the querying as well.
Implementation
zomato/prometheus-ecs-discovery#8
currently the above sharding technique was implemented for ecs but this can be easily expanded to other types of discovery like eks
Future
In future, we are planning to expand this to other service discovery like eks and ec2 etc,…. The current sharding is by hashing the task arn which is randomly generated, which will distribute the targets across the shards but this has a few side affects i.e the queries which are aggregates i.e average of all 5xx in a service would span across multiple prometheus instances and adds more work on the thanos querier to further do the aggregation operations on the data fetched from each of the shards, but if our hashing is based on the service name instead of random task id, all the service metrics end up on a single shard reducing the aggregation overhead involved in querier, but even such a service name hash based sharding is bound to be not horizontally scalable because a big service can create unequal utilization across shards. so instead of a hashing mechanism a better approach would be to sort services by their names and equally distribute targets between the shards, but then again during the auto-scaling of the service the targets would shuffle between the prometheus shards which would be bad. sharding can also be at a metric level instead at a target level which would add a bit more complexity by adding another proxy but would provide a more granular sharing approach. even though we already have a vertical auto-scaler for prometheus, in future we would be scaling horizontally after a maximum vertical scaling threshold to further cut down the management cost involved in monitoring.
So even though the current implementation is not an ideal way to shard prometheus but could be a base framework to implement one and already solves an important problem for our team.