Different aggregation groups can share the same nflog #3808

grobinson-grafana · 2024-04-15T08:50:10Z

What did you do?

In certain cases, it is possible for two (or more) different aggregation groups to share an nflog. The conditions for this to happen are the following:

There must be at least two routes with the same matchers and group labels (group_by).
At least one of the routes must have continue: true.
Both routes must use the same receivers.

For example, take the following configuration:

receivers:
  - name: test1
    webhook_configs:
      - url: http://127.0.0.1:8080/test1
  - name: test2
    webhook_configs:
      - url: http://127.0.0.1:8080/test2
route:
  receiver: test1
  group_wait: 15s
  group_interval: 30s
  routes:
    - continue: true
      receiver: test1
      matchers:
      - foo=bar
    - continue: true
      receiver: test1
      matchers:
      - foo=bar

This configuration meets all of the requirements to share an nflog.

When an Alertmanager is run with this configuration, you can see from the flushing alerts and Notify success lines that two aggregation groups are created and then flushed:

ts=2024-04-15T08:33:35.749Z caller=dispatch.go:164 level=debug component=dispatcher msg="Received alert" alert=[3fff2c2][active]
ts=2024-04-15T08:33:50.750Z caller=dispatch.go:516 level=debug component=dispatcher aggrGroup="{}/{foo=\"bar\"}:{}" msg=flushing alerts=[[3fff2c2][active]]
ts=2024-04-15T08:33:50.749Z caller=dispatch.go:516 level=debug component=dispatcher aggrGroup="{}/{foo=\"bar\"}:{}" msg=flushing alerts=[[3fff2c2][active]]
ts=2024-04-15T08:33:50.753Z caller=notify.go:863 level=debug component=dispatcher receiver=test1 integration=webhook[0] aggrGroup="{}/{foo=\"bar\"}:{}" alerts=[[3fff2c2][active]] msg="Notify success" attempts=1 duration=2.743458ms
ts=2024-04-15T08:33:50.753Z caller=notify.go:863 level=debug component=dispatcher receiver=test1 integration=webhook[0] aggrGroup="{}/{foo=\"bar\"}:{}" alerts=[[3fff2c2][active]] msg="Notify success" attempts=1 duration=2.85275ms

However, reading the nflog file (after shutdown) shows just one nflog entry on disk:

N
>
{}/{foo="bar"}:{}
test1webhook*
            ����˗�2	����ދ��
                               ���˗�%

However, when each route uses a different receiver then there are two entries on disk:

N
>
{}/{foo="bar"}:{}
test1webhook*
            ������2	����ދ��
                               �����N
>
{}/{foo="bar"}:{}
test2webhook*
            ���в�2	����ދ��
                               ��в�%

I expect this to create even more issues if the routes have different timers (i.e. group_wait, group_interval and repeat_interval) or active and mute time intervals:

route:  
  receiver: test1  
  group_wait: 15s  
  group_interval: 30s  
  routes:  
    - continue: true  
      receiver: test1  
      matchers:  
      - foo=bar  
    - continue: true  
      receiver: test1  
      matchers:  
      - foo=bar  
      group_wait: 30s  
      group_interval: 5m  
      mute_time_intervals:  
        - evenings

What did you expect to see?

I expected to see a separate entry for each aggregation group. Here is the nflog from the first example, but instead running the code in this branch:

P
@
{}/{foo="bar"}/1:{}
test1webhook*
            ���ػ��2	����ދ��
                               Ԯ�ػ��P
@
{}/{foo="bar"}/0:{}
test1webhook*
            ������2	����ދ��
                               Ԯ����%

The text was updated successfully, but these errors were encountered:

This commit replaces the code in route.Key() with that of route.ID(), and removes route.ID(). The motivation behind this change is to fix a number of bugs caused by conflicting group keys such as "Different aggregation groups can share the same nflog" (prometheus#3808) and also prevent an issue where groups are incorrectly marked as muted when they are not. Signed-off-by: George Robinson <[email protected]>

grobinson-grafana · 2024-04-25T10:00:59Z

This is related to #3817.

filippog · 2024-08-26T13:25:35Z

Hello, I found this issue while investigating why repeat_interval is seemingly not honored, e.g. this is at the top route in alertmanager configuration:

    - match:
        alertname: 'SystemdUnitFailed'
        repeat_interval: 24h
        continue: true

yet SystemdUnitFailed alerts do re-notify every 4h. This is a generic enough alert that alerts come and go constantly from the group (grouping by alertname). My understanding is that repeat_interval is applied to individual alerts and not the group. Is that the case? At any rate, could this issue be showing up as the behavior above? Thank you!

grobinson-grafana · 2024-08-26T14:36:51Z

I don't think this is the same issue. Repeat interval is applied to the group, as Alertmanager sends notifications for groups.

filippog · 2024-08-27T07:45:49Z

I don't think this is the same issue. Repeat interval is applied to the group, as Alertmanager sends notifications for groups.

Thank you for the quick reply and confirmation. I'll be investigating more and open a new issue as needed

grobinson-grafana mentioned this issue Apr 15, 2024

Make route.Key() unique and remove route.ID() #3809

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different aggregation groups can share the same nflog #3808

Different aggregation groups can share the same nflog #3808

grobinson-grafana commented Apr 15, 2024

grobinson-grafana commented Apr 25, 2024

filippog commented Aug 26, 2024

grobinson-grafana commented Aug 26, 2024

filippog commented Aug 27, 2024

Different aggregation groups can share the same nflog #3808

Different aggregation groups can share the same nflog #3808

Comments

grobinson-grafana commented Apr 15, 2024

grobinson-grafana commented Apr 25, 2024

filippog commented Aug 26, 2024

grobinson-grafana commented Aug 26, 2024

filippog commented Aug 27, 2024