Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alertmanager send resolved notification when problem not solved #3871

Open
ysfnsrv opened this issue Jun 11, 2024 · 3 comments
Open

Alertmanager send resolved notification when problem not solved #3871

ysfnsrv opened this issue Jun 11, 2024 · 3 comments

Comments

@ysfnsrv
Copy link

ysfnsrv commented Jun 11, 2024

I want to monitor the status of docker containers.
The problem is as follows. I stop a test docker container, I get a notification in slack that there is a stopped container, SUPER!
But after exactly 5 minutes, I get a message that the problem is solved and the docker is up as if.

If alertmanager.yml change group_interval: 5m to group_interval: 15m, I get the wrong notification again, but now after 15 minutes.
I was confused to comment out this line group_interval: 15m, and then the notification came in 5 minutes again.
The problem is that the docker container is stopped and has not been started, but for some reason the erroneous notification comes.

Translated with DeepL.com (free version)

APP 12:47 PM
[FIRING:1] ContainerKilled (NGINX Docker Maintainers [email protected] /docker/084586b71d3605ea6657d2cb4530348438226d14af7d0a563427bb8bc6a51e46 nginx 192.168.100.1:8080 cadvisor-intra myngin5 warning)
New
12:52
[RESOLVED] ContainerKilled (NGINX Docker Maintainers [email protected] /docker/084586b71d3605ea6657d2cb4530348438226d14af7d0a563427bb8bc6a51e46 nginx 192.168.100.1:8080 cadvisor-intra myngin5 warning)

  • System information:

    Linux 6.5.0-1022-azure x86_64

  • Alertmanager version:

alertmanager, version 0.23.0 (branch: debian/sid, revision: 0.23.0-4ubuntu0.2)
build user: [email protected]
build date: 20230502-12:28:45
go version: go1.18.1
platform: linux/amd64

  • Prometheus version:

prometheus, version 2.31.2+ds1 (branch: debian/sid, revision: 2.31.2+ds1-1ubuntu1.22.04.2)
build user: [email protected]
build date: 20230502-12:17:56
go version: go1.18.1
platform: linux/amd64

  • Alertmanager configuration file:
route:
  receiver: 'slack-notifications'
  group_by: ['alertname']
  group_wait: 30s
  #group_interval: 15m
  repeat_interval: 1h

receivers:
  - name: 'slack-notifications'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxxxxxx/xxxx'
        channel: '#alerts'
        send_resolved: true

templates:
  - '/etc/prometheus/alertmanager_templates/prod.tmpl'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

global:
external_labels:
monitor: ''
alerting:
alertmanagers:

  • static_configs:
    • targets: ['localhost:9093']
      rule_files:
    • "/etc/prometheus/rules/prod.yml"
      scrape_configs:
  • job_name: 'prometheus'
    scrape_interval: 60s
    scrape_timeout: 60s
  • job_name: 'cadvisor-intra'
    static_configs:
    • targets: ['192.168.100.1:8080']
/etc/prometheus/rules/prod.yml
groups:
  - name: ContainerHealthAlerts
    rules:
      - alert: CadvisorContainerDown
        expr: up{job="cadvisor"} == 0
        labels:
          severity: 'critical'
        annotations:
          summary: 'Alert: Cadvisor container is down'
          description: 'The Cadvisor container is down or not responding.'

      - alert: ContainerKilled
        expr: 'time() - container_last_seen > 60'
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: Container killed (instance {{ $labels.instance }})
          description: "A container has disappeared\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

      - alert: ContainerAbsent
        expr: 'absent(container_last_seen)'
        for: 7m
        labels:
          severity: warning
        annotations:
          summary: Container absent (instance {{ $labels.instance }})
          description: "A container is absent for 7 min\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

      - alert: ContainerHighMemoryUsage
        expr: '(sum(container_memory_working_set_bytes{name!=""}) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) > 80'
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: Container High Memory usage (instance {{ $labels.instance }})
          description: "Container Memory usage is above 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

      - alert: ContainerHighThrottleRate
        expr: 'rate(container_cpu_cfs_throttled_seconds_total[3m]) > 1'
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: Container high throttle rate (instance {{ $labels.instance }})
          description: "Container is being throttled\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

      - alert: ContainerLowCpuUtilization
        expr: '(sum(rate(container_cpu_usage_seconds_total{name!=""}[3m])) BY (instance, name) * 100) < 20'
        for: 7d
        labels:
          severity: info
        annotations:
          summary: Container Low CPU utilization (instance {{ $labels.instance }})
          description: "Container CPU utilization is under 20% for 1 week. Consider reducing the allocated CPU.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

      - alert: ContainerLowMemoryUsage
        expr: '(sum(container_memory_working_set_bytes{name!=""}) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) < 20'
        for: 7d
        labels:
          severity: info
        annotations:
          summary: Container Low Memory usage (instance {{ $labels.instance }})
          description: "Container Memory usage is under 20% for 1 week. Consider reducing the allocated memory.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  • Logs:
Jun 11 12:46:36 dmp-monitoring prometheus-alertmanager[40364]: ts=2024-06-11T08:46:36.395Z caller=dispatch.go:165 level=debug component=dispatcher msg="Received alert" alert=ContainerKilled[5029f3e][active]
Jun 11 12:47:06 dmp-monitoring prometheus-alertmanager[40364]: ts=2024-06-11T08:47:06.396Z caller=dispatch.go:516 level=debug component=dispatcher aggrGroup="{}:{alertname=\"ContainerKilled\"}" msg=flushing alerts=[ContainerKilled[5029f3e][active]]
Jun 11 12:48:16 dmp-monitoring prometheus-alertmanager[40364]: ts=2024-06-11T08:48:16.391Z caller=dispatch.go:165 level=debug component=dispatcher msg="Received alert" alert=ContainerKilled[5029f3e][active]
Jun 11 12:49:56 dmp-monitoring prometheus-alertmanager[40364]: ts=2024-06-11T08:49:56.392Z caller=dispatch.go:165 level=debug component=dispatcher msg="Received alert" alert=ContainerKilled[5029f3e][active]
Jun 11 12:50:46 dmp-monitoring prometheus-alertmanager[40364]: ts=2024-06-11T08:50:46.393Z caller=dispatch.go:165 level=debug component=dispatcher msg="Received alert" alert=ContainerKilled[5029f3e][resolved]
Jun 11 12:52:06 dmp-monitoring prometheus-alertmanager[40364]: ts=2024-06-11T08:52:06.396Z caller=dispatch.go:516 level=debug component=dispatcher aggrGroup="{}:{alertname=\"ContainerKilled\"}" msg=flushing alerts=[ContainerKilled[5029f3e][resolved]]```
@grobinson-grafana
Copy link
Contributor

Your logs show that the alert was resolved, and this resolved alert was sent to the Alertmanager which then sent a resolved notification.

Jun 11 12:50:46 dmp-monitoring prometheus-alertmanager[40364]: ts=2024-06-11T08:50:46.393Z caller=dispatch.go:165 level=debug component=dispatcher msg="Received alert" alert=ContainerKilled[5029f3e][resolved]

You need to understand why your alert resolved by looking at the query. I'm afraid this isn't an issue with Alertmanager.

@ysfnsrv
Copy link
Author

ysfnsrv commented Jun 12, 2024

Yes, I totally agre with you, tha is why I don't understud what is problem... I stopped the docker container and didn't start it.
So I don't understand why it comes and who sends this message that the problem is solved.

@grobinson-grafana
Copy link
Contributor

You need to look at your ContainerKilled alert in Prometheus and understand why it resolved. My guess is that the query time() - container_last_seen > 60 returned 0 (false). If you need additional help you can ask in the promtheus-users mailing list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants