Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sending metadata with metrics #118

Closed
AloisReitbauer opened this issue Nov 13, 2018 · 7 comments
Closed

Sending metadata with metrics #118

AloisReitbauer opened this issue Nov 13, 2018 · 7 comments

Comments

@AloisReitbauer
Copy link

Are there any plans to include metadata - like what is a good or bad value - as part of the data. This will be very helpful for analytics tools to interpret the data streams.

@SuperQ
Copy link
Member

SuperQ commented Nov 13, 2018

I usually do that by defining the threshold as another metric.

@AloisReitbauer
Copy link
Author

A standardized way would help to make it easier to automatically process this with a monitoring system.

@RichiH
Copy link
Member

RichiH commented Nov 14, 2018 via email

@SuperQ
Copy link
Member

SuperQ commented Nov 15, 2018

Yes, a good set of naming conventions for alerting thresholds would be useful.

@StevenLeRoux
Copy link

The threshold can be associated to a label value, not the whole metric property.

For example in our alerting system, you can define a threshold for os.cpu{}, but you can also subscope to a group or a host. When you use multiple hardware profiles, you can't define the same threshold for all machines cause 80% don't mean the same for a 4c/8t host than a 32c/64t one.

Example :

os.cpu{} > 90%
os.cpu{profile=low} > 60%
os.cpu{host=1234} > 95%
temperature{} > 60°C
temperature{profile=GPU} > 70°C

This is also why we don't pervert the data model for alerting, because it's an abstraction above the store. Mixing both of them will mostly apply to server monitoring only use case while you can use time series for business KPIs or whatever that can be moved from a host to another. A good example is managing canary tests and adjust dynamic alerting thresholds accordingly to the ratio.

@SuperQ
Copy link
Member

SuperQ commented Nov 15, 2018

@StevenLeRoux Yes, it's already a common pattern in Prometheus to have different thresholds by a label.

Typically users create these via recording rules in the monitoring system, because they are related to topology. (prod, canary, team, zone) Thus the target does not have any idea what the values should be.

This is done to allow for simplified alerting rules.

But there are other use case. So having this data coming from the monitored target may also be a good idea. For example, say you have devices with different temperature operating profiles.

You might have some metric for the per-device temp:

node_disk_temperature_celcius{device="/dev/sda"} 55.2
node_disk_temperature_celcius{device="/dev/sdb"} 57.2

Now, each device may be either an SSD or HDD, and have a different operating requirement

node_disk_temperature_critical_celcius{device="/dev/sda"} 50.0
node_disk_temperature_critical_celcius{device="/dev/sdb"} 60.0

This makes it easy to write the alert because the label lines up.

- alert: DeviceTempTooHigh
  expr: node_disk_temperature_celcius > node_disk_temperature_critical_celcius

@RichiH
Copy link
Member

RichiH commented Dec 1, 2020

@RichiH RichiH closed this as completed Dec 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants