Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different aggregation groups can share the same nflog #3808

Open
grobinson-grafana opened this issue Apr 15, 2024 · 4 comments
Open

Different aggregation groups can share the same nflog #3808

grobinson-grafana opened this issue Apr 15, 2024 · 4 comments

Comments

@grobinson-grafana
Copy link
Contributor

What did you do?

In certain cases, it is possible for two (or more) different aggregation groups to share an nflog. The conditions for this to happen are the following:

  1. There must be at least two routes with the same matchers and group labels (group_by).
  2. At least one of the routes must have continue: true.
  3. Both routes must use the same receivers.

For example, take the following configuration:

receivers:
  - name: test1
    webhook_configs:
      - url: http://127.0.0.1:8080/test1
  - name: test2
    webhook_configs:
      - url: http://127.0.0.1:8080/test2
route:
  receiver: test1
  group_wait: 15s
  group_interval: 30s
  routes:
    - continue: true
      receiver: test1
      matchers:
      - foo=bar
    - continue: true
      receiver: test1
      matchers:
      - foo=bar

This configuration meets all of the requirements to share an nflog.

When an Alertmanager is run with this configuration, you can see from the flushing alerts and Notify success lines that two aggregation groups are created and then flushed:

ts=2024-04-15T08:33:35.749Z caller=dispatch.go:164 level=debug component=dispatcher msg="Received alert" alert=[3fff2c2][active]
ts=2024-04-15T08:33:50.750Z caller=dispatch.go:516 level=debug component=dispatcher aggrGroup="{}/{foo=\"bar\"}:{}" msg=flushing alerts=[[3fff2c2][active]]
ts=2024-04-15T08:33:50.749Z caller=dispatch.go:516 level=debug component=dispatcher aggrGroup="{}/{foo=\"bar\"}:{}" msg=flushing alerts=[[3fff2c2][active]]
ts=2024-04-15T08:33:50.753Z caller=notify.go:863 level=debug component=dispatcher receiver=test1 integration=webhook[0] aggrGroup="{}/{foo=\"bar\"}:{}" alerts=[[3fff2c2][active]] msg="Notify success" attempts=1 duration=2.743458ms
ts=2024-04-15T08:33:50.753Z caller=notify.go:863 level=debug component=dispatcher receiver=test1 integration=webhook[0] aggrGroup="{}/{foo=\"bar\"}:{}" alerts=[[3fff2c2][active]] msg="Notify success" attempts=1 duration=2.85275ms

However, reading the nflog file (after shutdown) shows just one nflog entry on disk:

N
>
{}/{foo="bar"}:{}
test1webhook*
            ����˗�2	����ދ��
                               ���˗�%

However, when each route uses a different receiver then there are two entries on disk:

N
>
{}/{foo="bar"}:{}
test1webhook*
            ������2	����ދ��
                               �����N
>
{}/{foo="bar"}:{}
test2webhook*
            ���в�2	����ދ��
                               ��в�%

I expect this to create even more issues if the routes have different timers (i.e. group_wait, group_interval and repeat_interval) or active and mute time intervals:

route:  
  receiver: test1  
  group_wait: 15s  
  group_interval: 30s  
  routes:  
    - continue: true  
      receiver: test1  
      matchers:  
      - foo=bar  
    - continue: true  
      receiver: test1  
      matchers:  
      - foo=bar  
      group_wait: 30s  
      group_interval: 5m  
      mute_time_intervals:  
        - evenings

What did you expect to see?

I expected to see a separate entry for each aggregation group. Here is the nflog from the first example, but instead running the code in this branch:

P
@
{}/{foo="bar"}/1:{}
test1webhook*
            ���ػ��2	����ދ��
                               Ԯ�ػ��P
@
{}/{foo="bar"}/0:{}
test1webhook*
            ������2	����ދ��
                               Ԯ����%
grobinson-grafana added a commit to grobinson-grafana/alertmanager that referenced this issue Apr 15, 2024
This commit replaces the code in route.Key() with that of route.ID(),
and removes route.ID(). The motivation behind this change is to fix
a number of bugs caused by conflicting group keys such as
"Different aggregation groups can share the same nflog" (prometheus#3808)
and also prevent an issue where groups are incorrectly marked as muted
when they are not.

Signed-off-by: George Robinson <[email protected]>
@grobinson-grafana
Copy link
Contributor Author

This is related to #3817.

@filippog
Copy link

Hello, I found this issue while investigating why repeat_interval is seemingly not honored, e.g. this is at the top route in alertmanager configuration:

    - match:
        alertname: 'SystemdUnitFailed'
        repeat_interval: 24h
        continue: true

yet SystemdUnitFailed alerts do re-notify every 4h. This is a generic enough alert that alerts come and go constantly from the group (grouping by alertname). My understanding is that repeat_interval is applied to individual alerts and not the group. Is that the case? At any rate, could this issue be showing up as the behavior above? Thank you!

@grobinson-grafana
Copy link
Contributor Author

I don't think this is the same issue. Repeat interval is applied to the group, as Alertmanager sends notifications for groups.

@filippog
Copy link

I don't think this is the same issue. Repeat interval is applied to the group, as Alertmanager sends notifications for groups.

Thank you for the quick reply and confirmation. I'll be investigating more and open a new issue as needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants