bosh/health_monitor at master · mouse3150/bosh

History

Name		Name	Last commit message	Last commit date
parent directory ..
bin		bin
lib		lib
spec		spec
vendor/cache		vendor/cache
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
README		README
Rakefile		Rakefile

README

h4. Synopsis

BOSH Health Monitor (BHM) is a component that monitors health of one or multiple BOSH deployments. It processes heartbeats and alerts from BOSH agents and notifies interested parties if something goes wrong.

h4. Heartbeats

Agent sends periodic heartbeats to HM. Heartbeats are sent via message bus and have the following format:

h6. Heartbeat processing

# If the agent is known to HM the last heartbeat timestamp gets updated. No analysis is attempted at this point, analyze agents routine is asynchronous to heartbeat processing.
# If the agent is unknown it gets registered with HM with a warning flag set (we call them rogue agents). Next director poll will possibly include this agent to a list of managed agents and clear the flag. We might generate the alert if the flag hasn't been cleared for some (configurable) time.

h4. Agents discovery

HM polls director periodically to get the list of managed VMs:

When new agent is discovered it gets registered and added to a managed deployment. No active operations are performed to reach the agent and query it, we only rely on heartbeats and agent alerts.

h4. Agents analysis

This is a periodic operation that goes through all known agents. First it tries to go through all managed deployments, then analyzes rogue agents as well. The following procedure is used:

# If agent missed more than N heartbeats the "Agent Missing" alert is generated.

h4. Alerts

Alert is a concept used by HM to flag and deliver information about important events. It includes the following data:

# Id
# Severity
# Source (usually deployment/job/index tuple)
# Timestamp
# Description
# Long description (optional)
# Tags (optional)

h6. Alert Processor

Alert Processor is a module that registers incoming alerts and routes them to interested parties via appropriate delivery agent. It should conform to the following interface:

| *Method* | *Arguments* | *Description* |
| *register_alert* | alert (object responding to :id, :severity, :timestamp, :description, :long_description, :source and :tags) | Registers an alert and invokes a delivery agent. Delivery agent might or might not deliver alert immediately depending on the implementation, so Alert Processor shouldn't make any assumptions about delivery (i.e. agent might queue up several alerts and send them asynchronously. |
| *add_delivery_agent* | delivery_agent, options | Adds a delivery agent to a processor |

Alert id can be an arbitrary string however Alert Processor might use it to keep track of registered alerts and don't process the same alert twice. This way other HM modules can just blindly register any incoming alerts and leave the dedup step to the alert processor).

Alerts are only persisted in HM memory (at least in the initial version) so losing HM leads to losing any undelivered alerts that might have been queued by a delivery agent or alert processor).

If alert processor has more than one delivery agents associated with it then it notifies all of them in order (i.e. we want to notify both Zabbix and Pager Duty).

h6. Delivery Agent

Delivery Agent is a module that takes care of an alert delivery mechanism (such as an email, Pager Duty alert, writing to a journal or even silently discarding the alert). It should conform to the following interface:

The initial implementation will have email and Pager Duty delivery agents.

Alert Processor is not pluggable, it's just one of HM classes. Delivery agents are pluggable but generally not changed in a runtime but initialized using an HM configuration file on HM startup.

h4. Alerts from agent

HM subscribes to agent alerts on a message bus:

BOSH Agent is responsible for mapping any underlying supervisor alert format to the expected JSON payload and send it to HM.

HM is responsible for interpreting JSON payload and mapping it to a sequence of HM actions and possibly creating an HM alert compatible with Alert Processor module. HM never dedups incoming alerts outside of Alert Processor (this adds some overhead to an incoming alert parser but shouldn't be too bad). Malformed payloads are ignored.

Job name and index are not featured in agent incoming alert, those are looked up in director. If heartbeat came from a rogue agent and we have no job name and/or index then we note that fact in alert description but don't try to be too worried about that (service name and agent id should be enough). We might consider including agent IP address as a part of heartbeat so we can track down rogue agents.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

health_monitor

health_monitor

README

Files

health_monitor

Directory actions

More options

Directory actions

More options

Latest commit

History

health_monitor

Folders and files

parent directory

README