Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Icinga sends notifications for hosts a second after getting into soft state (1 out of 3 tries) #10262

Open
mihaiste opened this issue Dec 3, 2024 · 5 comments

Comments

@mihaiste
Copy link

mihaiste commented Dec 3, 2024

Hello everybody,

Some bit of context on the issue my team is facing.
We have an Icinga based environment set up on Kubernetes consisting of 2 masters and 2 satellites.
Our environment involves a down-up communication model, meaning that the agents (monitored VMs) connect to the satellites and the satellites connect to the masters.
The host template we are using for the monitored VMs is the following:
{ "accept_config": true, "check_command": "cluster-zone", "check_interval": "120", "max_check_attempts": "3", "retry_interval": "60", "enable_active_checks": true, "enable_flapping": true, "enable_passive_checks": false, "enable_perfdata": false, "has_agent": true, "master_should_connect": false, "object_type": "template", "vars": { "entity_of": "", "entity_type": "", "subscriptions": [ "INIT" ] }, "volatile": false }'

Our notification object is configured as:
{ "apply_to": "host", "assign_filter": "host.vars.team=%22MyTeam%22&host.zone=%22satellite%22", "imports": [ "template_mail-host-notification" ], "object_name": "mail-host-notification", "object_type": "apply", "period": "24x7", "states": [ "Down", "Up" ], "types": [ "Acknowledgement", "DowntimeEnd", "DowntimeRemoved", "DowntimeStart", "FlappingEnd", "FlappingStart", "Problem" ], "users": [ "my.user" ], "notification_interval": "3600", "times_begin": "0" }

Describe the bug

We have a CI/CD pipeline that updates or enforces the configuration to the Icingaweb Director component.
When applying the Director configuration, a few hosts changes their states to DOWN, but they get into a soft state first.
Although our host configuration implies max_check_attempts being set to 3, sometimes Icinga sends notifications for these hosts exactly 1 second after running the first check (see the screenshots).

To Reproduce

The issue at hand is not reproducible at every Director apply.

Expected behavior

Icinga to send notification when the object gets into Hard state.

Screenshots

icinga
icinga2

Your Environment

Include as many relevant details about the environment you experienced the problem in

  • Version used (icinga2 --version): v2.14.2

  • Operating System and version: N/A (deployed on Kubernetes)

  • Enabled features (icinga2 feature list):
    Disabled features: command compatlog debuglog elasticsearch gelf graphite influxdb influxdb2 journald livestatus opentsdb perfdata syslog mainlog
    Enabled features: api checker icingadb notification

  • Icinga Web 2 version and modules (System - About):
    Icinga Web 2 - 2.12.1
    Loaded Modules
    icingadb - 1.1.3
    cube - 1.3.3
    director - 1.11.1
    incubator - 0.22.0
    reporting - 1.0.2
    x509 - 1.3.2

  • Config validation (icinga2 daemon -C):
    [2024-12-03 09:57:51 +0000] information/cli: Icinga application loader (version: v2.14.2)
    [2024-12-03 09:57:51 +0000] information/cli: Loading configuration file(s).
    [2024-12-03 09:57:51 +0000] information/ConfigItem: Committing config item(s).
    [2024-12-03 09:57:51 +0000] information/ApiListener: My API identity: satellite-0
    [2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 1 NotificationComponent.
    [2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 7 Downtimes.
    [2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 1 CheckerComponent.
    [2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 59 Users.
    [2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 2 TimePeriods.
    [2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 1837 Services.
    [2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 162 Zones.
    [2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 5 NotificationCommands.
    [2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 2770 Notifications.
    [2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 1 IcingaApplication.
    [2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 236 Hosts.
    [2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 16 HostGroups.
    [2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 162 Endpoints.
    [2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 1 ApiUser.
    [2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 1 ApiListener.
    [2024-12-03 09:57:52 +0000] information/ConfigItem: Instantiated 540 CheckCommands.
    [2024-12-03 09:57:52 +0000] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
    [2024-12-03 09:57:52 +0000] information/cli: Finished validating the configuration file(s).

  • If you run multiple Icinga 2 instances, the zones.conf file (or icinga2 object list --type Endpoint and icinga2 object list --type Zone) from all affected nodes.
    Here is the zones.conf from one of the satellites:
    object Endpoint "satellite-0" {
    // this is me
    }
    // the masters
    object Endpoint "master-0" {
    host = "master-0"
    port = "443"
    }
    // the masters
    object Endpoint "master-1" {
    host = "master-1"
    port = "443"
    }

// the other satellites
object Endpoint "satellite-1" {
host = "satellite-1"
port = "443"
}

object Zone "master" {
endpoints = [
"master-1",
"master-0"]
}

object Zone "satellite" {
endpoints = [
"satellite-1",
"satellite-0"]
parent = "master"
}

object Zone "global-templates" {
global = true
}

object Zone "director-global" {
global = true
}

Additional context

Not sure what other details to provide in this context, please advise.

Thanks!

@oxzi
Copy link
Member

oxzi commented Dec 4, 2024

Thanks for creating this issue.

Could you please post your (redacted) Notification object for the Host in question? You should be able to find it with icinga2 object list -t Notification or further filtering based on your Host and Notification name.

Nevertheless, soft states should not result in a notification. Thus, could you please post the (redacted) icinga2.log around the time the state changed?

As the Director is involved in a CI/CD scenario, is the Host object in question being altered or even re-added? If so, could you please post the icinga2.log regarding the object creation including state changes?

Btw, please upgrade your Icinga 2 to the latest version 2.14.3 immediately as the 2.14.2 contains a known critical vulnerability: https://icinga.com/blog/icinga2-security-pre-announcement/, https://icinga.com/blog/critical-icinga-2-security-releases-2-14-3/, https://icinga.com/blog/uncovering-a-client-certificate-verification-bypass-in-icinga/, https://github.com/Icinga/icinga2/releases/tag/v2.14.3.

@mihaiste
Copy link
Author

mihaiste commented Dec 6, 2024

Hello,

I apologize for the delayed reply.
We have investigated a bit more on the topic and we will come back as soon as possible with the requested data. It's a fuss exporting the logs from cold storage.

Thanks for understanding!

Btw, please upgrade your Icinga 2 to the latest version 2.14.3 immediately as the 2.14.2 contains a known critical vulnerability: https://icinga.com/blog/icinga2-security-pre-announcement/, https://icinga.com/blog/critical-icinga-2-security-releases-2-14-3/, https://icinga.com/blog/uncovering-a-client-certificate-verification-bypass-in-icinga/, https://github.com/Icinga/icinga2/releases/tag/v2.14.3.

We have upgraded the version, thank you very much for the tip!

@mihaiste
Copy link
Author

mihaiste commented Dec 9, 2024

Hello,

Coming back with some additional details and the requested information so that maybe some light will shed over our environment.
We have actually narrowed things down to two "a bit more exact" scenarios in which we get notified for hosts in soft state.

Scenario 1 (this is the one described in the original post):

  1. Host X is OK and checked with cluster-zone command.
  2. Host X is down and gets into soft state (1st try out of 3).
  3. Exactly one second later, the notification appears in Icinga Web on the host history and in the History->Notifications.
  4. The notification is sent by one of the master entities.
  5. At the next run of the host check, everything is fine => host X is OK.

Requested screenshot of the host notification object:
icinga_host_notif

Requested logs from all components (2 masters and 2 satellites):
scenario_1-with_notif_in_webui.zip

Scenario 2:

  1. Host X is OK and checked with cluster-zone command.
  2. Host X is down and gets into soft state (1st try out of 3).
  3. NO notification appears in Icinga Web on the host history and/or in the History->Notifications.
  4. A notification for host X being down is sent by one of the master entities.
  5. At the next run of the host check, everything is fine => host X is OK.

Screenshots:
icinga-host
mail-from-icinga

Requested screenshot of the host notification object:
icinga-host-notification2

Logs from all components (2 masters and 2 satellites):
scenario_2-without_notif_in_webui.zip

Let me know if anything else is needed to get to the bottom of this mystery :)

@mihaiste mihaiste removed their assignment Dec 9, 2024
@aval13
Copy link

aval13 commented Dec 9, 2024

Hello,

Just to add a bit more information about this issue.
The CI/CD pipeline may only change service templates, commands, notifications and assigns. Some objects may get forced rewritten (due to Director limitations) or not touched.
Host objects are not touched by the CI/CD.
Nothing touches existing hosts.

As @mihaiste said, the design is top-down, we have 2 masters top (in the master zone), 2 satellites mid (in the satellite_ zone) and all the agents under the satellites connecting to both satellites.

Some time ago when we were testing Icinga, we noticed we were receiving notifications about an event (not sure, but I believe it was both Host and Service) from both a masters and a satellite, duplicating the emails. So we tweaked the email sending NotificationCommand script so that the masters send only if there is master related event (somehow this tweak fails for the cases we are seeing, we will need to look into that) and satellites notify only on non-master events.

One last things that we can't figure out is that for these events, the satellite correctly does not decide to send a notification, but the master does.

The only guess I have at this moment is that while the master processes any new (maybe) configuration update received from the Director, the Host check executed on the satellite hiccups, the satellite correctly thinks, hey, this is a soft state, nothing to do, reports the check result to the masters and somehow while being busy with the new configuration received the masters decide to notify although the check is in soft state.

Please let us know if any other information might be of use.

Thank you.

@w1ll-i-code
Copy link

w1ll-i-code commented Jan 15, 2025

This happens because of #10179. Icinga2 has not written the state change to Ok out to the DB or triggered a notification during a shutdown, because the components were already disabled. It still retained the state change internally though. We had the same issue and #10191 fixed it for us, but it hasn't been merged yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants