I have multiple Prometheus instances reporting to my AM and have a Dead Man's Switch alert defined for all of them. When a single Prometheus goes offline, Karma is effectively disabled since all I can see is the warning about the one offline instance, and cannot dismiss it to see alerts that are being reported by other, healthy Prometheus instances. In the screenshot below, you can see there are 3 other alerts being reported but there is no way to see them (these are not the healthcheck alerts). Is this intended?
Cannot reproduce, if I configure all alertmanagers in karma to have a healthcheck with invalid filter I do get toast messages with errors, but I still see all alerts.
It seems that you're getting this error while karma is unable to talk to your alertmanager. What does your karma log says when it happens?
I am running into a similar situation. A single AlertManager being fed from multiple Prometheus instances all running dead man switch alerts (used as separate health checks in karma). I think this line: https://github.com/prymitive/karma/blob/main/ui/src/Components/Grid/index.tsx#L31 might prevent the alert grid from being displayed if one of the dead man switches fail the (single) AlertManager's health checks.
Not sure of best general purpose change (if any). A health check has failed but many other Prometheus instances are still sending alerts so that AlertManager is still functioning. Just wanted to clarify the situation above (if i am not off track).
So the health check in karma works by checking if there's at least one alert matching given filter.
So if you have multiple Prometheus servers and at least one of them is still alive and sending that alert then it's fine.
If you see health check popups if only one Prometheus is down, then that suggests either:
a) something isn't working and other Prometheus servers are not sending those alerts
b) your health check is only matching alerts generated by that specific Prometheus
How does your alert rule look like and what's your health check definition? Do you have a health check per Prometheus by any chance?
I believe we fall into case b) above where we have a specific match per Prometheus (using a label match). We have a central AlertManager (in one cloud provider) getting alerts from multiple deployments of our software in other cloud provider clusters. We are using the healthcheck to signal the failure to communicate with an individual Prometheus instance as it covers its deployment (maybe not an expected use case).
healthcheck: filters: X: - alertname=DeadMansSwitch - deployment=X Y: - alertname=DeadMansSwitch - deployment=Y
So i forked and made this change: ddowker#1 (and made an image) to allow errors that are health check related to show the rest of the alerts for that single AlertManager. It is not a general (quality) fix and is specific to our use-case.
Overall i do not see this as a bug but potentially another use-case (or feature request). It did seem that the example in the documentation https://github.com/prymitive/karma/blob/main/docs/CONFIGURATION.md?plain=1#L308 was a bit similar to what i was doing (but maybe the 'instance' line acts differently).
I am fine with whatever you want to do with this issue. Just wanted to point out how originator may have fallen into this situation.
So the goal of Dead Man's Switch is to alert where all alerting is down, it's to tell you "there are no alerts because what generate alerts seems down", rather than be monitoring for each individual Prometheus.
Ideally you should have cross-Prometheus monitoring so, as long as you have at least one Prometheus running, they can alert about downtime of other Prometheus instances.
And you would use Dead Man's Switch in karma to tell if when all Prometheus servers are down.
That being said we could change the logic to only show
<FatalError /> if health checks are showing errors AND there are no alerts in alertmanager.
That makes sense.
In our case the remote Prometheus instances are running in different customer clusters and are sent up to our single AlertManager (to get a global view of the health of the deployments). Per customer we could have cross Prometheus monitoring but in our setup there will still be a number of isolated islands (one per customer) that will show up under the one central AlertManager filter rules. I think your potential change would handle our scenario.
|Issue Title||Created Date||Updated Date|