Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add alerting rules for operator (#526) #828

Merged
merged 4 commits into from
Dec 23, 2023
Merged

Conversation

Amper
Copy link
Contributor

@Amper Amper commented Dec 20, 2023

@Amper Amper force-pushed the issue-526-alerting-rules branch from a5d7fb0 to 00f92c5 Compare December 20, 2023 16:10
@Amper Amper force-pushed the issue-526-alerting-rules branch 2 times, most recently from 009c3b3 to 0525e13 Compare December 20, 2023 16:22
@Amper Amper force-pushed the issue-526-alerting-rules branch from 0525e13 to 18f9d77 Compare December 20, 2023 16:35
@Amper Amper marked this pull request as ready for review December 22, 2023 07:32
@Amper Amper requested a review from Haleygo as a code owner December 22, 2023 07:32
Haleygo
Haleygo previously approved these changes Dec 22, 2023
Copy link
Contributor

@Haleygo Haleygo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

expr: sum(rate(operator_log_messages_total{level="error", job=~".*((victoria.*)|vm)-?operator"}[5m])) > 0
for: 15m
labels:
severity: high
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use the same set of severity like other VictoriaMetrics rules here, like "critical"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that it's critical, because it doesn't directly affect the deployments and monitoring work, but you are right, we have to use the same severity names. Maybe change to warning, wdyt? (+ @f41gh7)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with warning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

dashboard: "{{ $externalURL }}/d/1H179hunk/victoriametrics-operator?ds={{ $labels.dc }}&orgId=1&viewPanel=10"
summary: "Too many errors at reconcile loop of operator: {{ $value}}"
- alert: HighQueueDepth
expr: (sum(workqueue_depth{job=~".*((victoria.*)|vm)-?operator"}) by (name)) > 10
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my experience, workqueue_depth for vmuser and vmalertmanager config can be slow and trigger given alert. I propose to exclude it from matching.

vmuser reconcile is slow because of additional secret creation. The same for alertmanagerconfig

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator

@f41gh7 f41gh7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@f41gh7 f41gh7 merged commit eac9561 into master Dec 23, 2023
2 checks passed
@f41gh7 f41gh7 deleted the issue-526-alerting-rules branch December 23, 2023 17:56
@f41gh7
Copy link
Collaborator

f41gh7 commented Dec 23, 2023

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants