I get sometimes 503 errors in the logs :
karma-5666566999-n28pv karma level=error msg="Request failed" error="request to https://***redacted***alertmanager:9093/metrics failed with 503 Service Unavailable" alertmanager=***redacted*** uri=https://***redacted***alertmanager:9093/
It seems that the error message comes from /internal/alertmanager/models.go#L92
probeVersion is called at /internal/alertmanager/models.go#L366.
Question 1 : could you confirm this ?
In this code, I also notice that when an error occur,
probeVersion will return
"" with some logging, but :
Question 2 / Bug ? : when probing the version fails, but Karma goes on retrieving silences and alerts, is this a bug ?
Question 3 / Feature request : Could you create a metric that shows when probing Alertmanager version failed ? Shouldn't it stop at line 370 ?
For this feature request, maybe you could create a metric named
karma_alertmanager_probed_version with the version as a label and with the value set to
0 if something failed ?
I have not had 503 errors for a while.
When I wrote this issue, maybe there was a problem on Alertmanager that I could not reproduce at that moment, and that blocked Karma for the version but not for the alerts&silences.
As you say, Karma assumes the latest compatible version when it cannot retrieve the Alertmanager version. This it why Karma was still working and I noticed nothing but the log with the 503 error.
No problem for a while : should we close this issue ?
About feature request and the metric, it could be a counter that increments every time Karma fails to connect to Alertmanager. This should be easier to have a native metric than creating a custom metric with Promtail matching on 503. But I have had no problem for a while : do I still need it ? I don't know...
|Issue Title||Created Date||Updated Date|