No metrics for probeVersion failures (bug ?)

This issue has been tracked since 2022-03-11.

I get sometimes 503 errors in the logs :

karma-5666566999-n28pv karma level=error msg="Request failed" error="request to https://***redacted***alertmanager:9093/metrics failed with 503 Service Unavailable" alertmanager=***redacted*** uri=https://***redacted***alertmanager:9093/

It seems that the error message comes from /internal/alertmanager/models.go#L92

And probeVersion is called at /internal/alertmanager/models.go#L366.

Question 1 : could you confirm this ?


In this code, I also notice that when an error occur, probeVersion will return "" with some logging, but :

  • fetching the status (line 379) will not be blocked. How can it work if you got a 503 error when trying to retrieve the Alertmanager version ?
  • there is no metric to show that probing the version failed.

Question 2 / Bug ? : when probing the version fails, but Karma goes on retrieving silences and alerts, is this a bug ?

Question 3 / Feature request : Could you create a metric that shows when probing Alertmanager version failed ? Shouldn't it stop at line 370 ?

For this feature request, maybe you could create a metric named karma_alertmanager_probed_version with the version as a label and with the value set to 1, or 0 if something failed ?

ngc104 wrote this answer on 2022-05-11

Hello,

Any ideas about my questions/feature request ?

prymitive wrote this answer on 2022-05-18

It's not a bug, if karma cannot detect alertmanager version it assumes latest compatible version.
What do you need a metric for? It sounds like your alertmanager is failing with 503 (or whatever it's behind).

ngc104 wrote this answer on 2022-05-18

I have not had 503 errors for a while.

When I wrote this issue, maybe there was a problem on Alertmanager that I could not reproduce at that moment, and that blocked Karma for the version but not for the alerts&silences.

As you say, Karma assumes the latest compatible version when it cannot retrieve the Alertmanager version. This it why Karma was still working and I noticed nothing but the log with the 503 error.

No problem for a while : should we close this issue ?

About feature request and the metric, it could be a counter that increments every time Karma fails to connect to Alertmanager. This should be easier to have a native metric than creating a custom metric with Promtail matching on 503. But I have had no problem for a while : do I still need it ? I don't know...

prymitive wrote this answer on 2022-05-18

There's karma_alertmanager_errors_total & karma_alertmanager_up metric already exported

ngc104 wrote this answer on 2022-05-18

Thanks, I'll give a try on these metrics.

More Details About Repo
Owner Name prymitive
Repo Name karma
Full Name prymitive/karma
Language TypeScript
Created Date 2018-09-09
Updated Date 2023-03-17
Star Count 1921
Watcher Count 33
Fork Count 166
Issue Count 2

YOU MAY BE INTERESTED

Issue Title Created Date Updated Date