Can't view alerts in the presence of any Healthcheck failure

This issue has been tracked since 2022-05-16.

I have multiple Prometheus instances reporting to my AM and have a Dead Man's Switch alert defined for all of them. When a single Prometheus goes offline, Karma is effectively disabled since all I can see is the warning about the one offline instance, and cannot dismiss it to see alerts that are being reported by other, healthy Prometheus instances. In the screenshot below, you can see there are 3 other alerts being reported but there is no way to see them (these are not the healthcheck alerts). Is this intended?

Screen Shot 2022-05-16 at 8 01 54 AM

prymitive wrote this answer on 2022-05-18

Cannot reproduce, if I configure all alertmanagers in karma to have a healthcheck with invalid filter I do get toast messages with errors, but I still see all alerts.
It seems that you're getting this error while karma is unable to talk to your alertmanager. What does your karma log says when it happens?

twbecker wrote this answer on 2022-05-18

No errors of any kind in the log. To be clear, this Karma instance is only talking to a single alertmanager. That alert manager has multiple prometheus reporting to it, none of which are firing the dead mans switch.

prymitive wrote this answer on 2022-05-18

Anything logged in your browser console when you open developer tools?

twbecker wrote this answer on 2022-05-19

Nope, zero.

ddowker wrote this answer on 2022-08-09

I am running into a similar situation. A single AlertManager being fed from multiple Prometheus instances all running dead man switch alerts (used as separate health checks in karma). I think this line: might prevent the alert grid from being displayed if one of the dead man switches fail the (single) AlertManager's health checks.

Not sure of best general purpose change (if any). A health check has failed but many other Prometheus instances are still sending alerts so that AlertManager is still functioning. Just wanted to clarify the situation above (if i am not off track).

prymitive wrote this answer on 2022-08-10

So the health check in karma works by checking if there's at least one alert matching given filter.
So if you have multiple Prometheus servers and at least one of them is still alive and sending that alert then it's fine.
If you see health check popups if only one Prometheus is down, then that suggests either:
a) something isn't working and other Prometheus servers are not sending those alerts
b) your health check is only matching alerts generated by that specific Prometheus

How does your alert rule look like and what's your health check definition? Do you have a health check per Prometheus by any chance?

ddowker wrote this answer on 2022-08-10

I believe we fall into case b) above where we have a specific match per Prometheus (using a label match). We have a central AlertManager (in one cloud provider) getting alerts from multiple deployments of our software in other cloud provider clusters. We are using the healthcheck to signal the failure to communicate with an individual Prometheus instance as it covers its deployment (maybe not an expected use case).

healthcheck: filters: X: - alertname=DeadMansSwitch - deployment=X Y: - alertname=DeadMansSwitch - deployment=Y

So i forked and made this change: ddowker#1 (and made an image) to allow errors that are health check related to show the rest of the alerts for that single AlertManager. It is not a general (quality) fix and is specific to our use-case.

Overall i do not see this as a bug but potentially another use-case (or feature request). It did seem that the example in the documentation was a bit similar to what i was doing (but maybe the 'instance' line acts differently).

I am fine with whatever you want to do with this issue. Just wanted to point out how originator may have fallen into this situation.

prymitive wrote this answer on 2022-08-10

So the goal of Dead Man's Switch is to alert where all alerting is down, it's to tell you "there are no alerts because what generate alerts seems down", rather than be monitoring for each individual Prometheus.

Ideally you should have cross-Prometheus monitoring so, as long as you have at least one Prometheus running, they can alert about downtime of other Prometheus instances.
And you would use Dead Man's Switch in karma to tell if when all Prometheus servers are down.

That being said we could change the logic to only show <FatalError /> if health checks are showing errors AND there are no alerts in alertmanager.

ddowker wrote this answer on 2022-08-10

That makes sense.
In our case the remote Prometheus instances are running in different customer clusters and are sent up to our single AlertManager (to get a global view of the health of the deployments). Per customer we could have cross Prometheus monitoring but in our setup there will still be a number of isolated islands (one per customer) that will show up under the one central AlertManager filter rules. I think your potential change would handle our scenario.

More Details About Repo
Owner Name prymitive
Repo Name karma
Full Name prymitive/karma
Language TypeScript
Created Date 2018-09-09
Updated Date 2023-03-17
Star Count 1921
Watcher Count 33
Fork Count 166
Issue Count 2


Issue Title Created Date Updated Date