Alertmanager send resolved notification when problem not solved #3871

ysfnsrv · 2024-06-11T11:44:44Z

I want to monitor the status of docker containers.
The problem is as follows. I stop a test docker container, I get a notification in slack that there is a stopped container, SUPER!
But after exactly 5 minutes, I get a message that the problem is solved and the docker is up as if.

If alertmanager.yml change group_interval: 5m to group_interval: 15m, I get the wrong notification again, but now after 15 minutes.
I was confused to comment out this line group_interval: 15m, and then the notification came in 5 minutes again.
The problem is that the docker container is stopped and has not been started, but for some reason the erroneous notification comes.

Translated with DeepL.com (free version)

APP 12:47 PM
[FIRING:1] ContainerKilled (NGINX Docker Maintainers [email protected] /docker/084586b71d3605ea6657d2cb4530348438226d14af7d0a563427bb8bc6a51e46 nginx 192.168.100.1:8080 cadvisor-intra myngin5 warning)
New
12:52
[RESOLVED] ContainerKilled (NGINX Docker Maintainers [email protected] /docker/084586b71d3605ea6657d2cb4530348438226d14af7d0a563427bb8bc6a51e46 nginx 192.168.100.1:8080 cadvisor-intra myngin5 warning)

System information:

Linux 6.5.0-1022-azure x86_64
Alertmanager version:

alertmanager, version 0.23.0 (branch: debian/sid, revision: 0.23.0-4ubuntu0.2)
build user: [email protected]
build date: 20230502-12:28:45
go version: go1.18.1
platform: linux/amd64

Prometheus version:

prometheus, version 2.31.2+ds1 (branch: debian/sid, revision: 2.31.2+ds1-1ubuntu1.22.04.2)
build user: [email protected]
build date: 20230502-12:17:56
go version: go1.18.1
platform: linux/amd64

Alertmanager configuration file:

route:
  receiver: 'slack-notifications'
  group_by: ['alertname']
  group_wait: 30s
  #group_interval: 15m
  repeat_interval: 1h

receivers:
  - name: 'slack-notifications'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxxxxxx/xxxx'
        channel: '#alerts'
        send_resolved: true

templates:
  - '/etc/prometheus/alertmanager_templates/prod.tmpl'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

global:
external_labels:
monitor: ''
alerting:
alertmanagers:

static_configs:
- targets: ['localhost:9093']
  rule_files:
- "/etc/prometheus/rules/prod.yml"
  scrape_configs:
job_name: 'prometheus'
scrape_interval: 60s
scrape_timeout: 60s
job_name: 'cadvisor-intra'
static_configs:
- targets: ['192.168.100.1:8080']

/etc/prometheus/rules/prod.yml
groups:
  - name: ContainerHealthAlerts
    rules:
      - alert: CadvisorContainerDown
        expr: up{job="cadvisor"} == 0
        labels:
          severity: 'critical'
        annotations:
          summary: 'Alert: Cadvisor container is down'
          description: 'The Cadvisor container is down or not responding.'

      - alert: ContainerKilled
        expr: 'time() - container_last_seen > 60'
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: Container killed (instance {{ $labels.instance }})
          description: "A container has disappeared\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

      - alert: ContainerAbsent
        expr: 'absent(container_last_seen)'
        for: 7m
        labels:
          severity: warning
        annotations:
          summary: Container absent (instance {{ $labels.instance }})
          description: "A container is absent for 7 min\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

      - alert: ContainerHighMemoryUsage
        expr: '(sum(container_memory_working_set_bytes{name!=""}) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) > 80'
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: Container High Memory usage (instance {{ $labels.instance }})
          description: "Container Memory usage is above 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

      - alert: ContainerHighThrottleRate
        expr: 'rate(container_cpu_cfs_throttled_seconds_total[3m]) > 1'
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: Container high throttle rate (instance {{ $labels.instance }})
          description: "Container is being throttled\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

      - alert: ContainerLowCpuUtilization
        expr: '(sum(rate(container_cpu_usage_seconds_total{name!=""}[3m])) BY (instance, name) * 100) < 20'
        for: 7d
        labels:
          severity: info
        annotations:
          summary: Container Low CPU utilization (instance {{ $labels.instance }})
          description: "Container CPU utilization is under 20% for 1 week. Consider reducing the allocated CPU.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

      - alert: ContainerLowMemoryUsage
        expr: '(sum(container_memory_working_set_bytes{name!=""}) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) < 20'
        for: 7d
        labels:
          severity: info
        annotations:
          summary: Container Low Memory usage (instance {{ $labels.instance }})
          description: "Container Memory usage is under 20% for 1 week. Consider reducing the allocated memory.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

Logs:

Jun 11 12:46:36 dmp-monitoring prometheus-alertmanager[40364]: ts=2024-06-11T08:46:36.395Z caller=dispatch.go:165 level=debug component=dispatcher msg="Received alert" alert=ContainerKilled[5029f3e][active]
Jun 11 12:47:06 dmp-monitoring prometheus-alertmanager[40364]: ts=2024-06-11T08:47:06.396Z caller=dispatch.go:516 level=debug component=dispatcher aggrGroup="{}:{alertname=\"ContainerKilled\"}" msg=flushing alerts=[ContainerKilled[5029f3e][active]]
Jun 11 12:48:16 dmp-monitoring prometheus-alertmanager[40364]: ts=2024-06-11T08:48:16.391Z caller=dispatch.go:165 level=debug component=dispatcher msg="Received alert" alert=ContainerKilled[5029f3e][active]
Jun 11 12:49:56 dmp-monitoring prometheus-alertmanager[40364]: ts=2024-06-11T08:49:56.392Z caller=dispatch.go:165 level=debug component=dispatcher msg="Received alert" alert=ContainerKilled[5029f3e][active]
Jun 11 12:50:46 dmp-monitoring prometheus-alertmanager[40364]: ts=2024-06-11T08:50:46.393Z caller=dispatch.go:165 level=debug component=dispatcher msg="Received alert" alert=ContainerKilled[5029f3e][resolved]
Jun 11 12:52:06 dmp-monitoring prometheus-alertmanager[40364]: ts=2024-06-11T08:52:06.396Z caller=dispatch.go:516 level=debug component=dispatcher aggrGroup="{}:{alertname=\"ContainerKilled\"}" msg=flushing alerts=[ContainerKilled[5029f3e][resolved]]```

The text was updated successfully, but these errors were encountered:

grobinson-grafana · 2024-06-11T16:55:32Z

Your logs show that the alert was resolved, and this resolved alert was sent to the Alertmanager which then sent a resolved notification.

Jun 11 12:50:46 dmp-monitoring prometheus-alertmanager[40364]: ts=2024-06-11T08:50:46.393Z caller=dispatch.go:165 level=debug component=dispatcher msg="Received alert" alert=ContainerKilled[5029f3e][resolved]

You need to understand why your alert resolved by looking at the query. I'm afraid this isn't an issue with Alertmanager.

ysfnsrv · 2024-06-12T07:26:32Z

Yes, I totally agre with you, tha is why I don't understud what is problem... I stopped the docker container and didn't start it.
So I don't understand why it comes and who sends this message that the problem is solved.

grobinson-grafana · 2024-06-12T07:29:45Z

You need to look at your ContainerKilled alert in Prometheus and understand why it resolved. My guess is that the query time() - container_last_seen > 60 returned 0 (false). If you need additional help you can ask in the promtheus-users mailing list.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alertmanager send resolved notification when problem not solved #3871

Alertmanager send resolved notification when problem not solved #3871

ysfnsrv commented Jun 11, 2024

grobinson-grafana commented Jun 11, 2024

ysfnsrv commented Jun 12, 2024

grobinson-grafana commented Jun 12, 2024

Alertmanager send resolved notification when problem not solved #3871

Alertmanager send resolved notification when problem not solved #3871

Comments

ysfnsrv commented Jun 11, 2024

grobinson-grafana commented Jun 11, 2024

ysfnsrv commented Jun 12, 2024

grobinson-grafana commented Jun 12, 2024