latency slo #238

SanaZulfiqar73 · 2022-08-31T13:07:51Z

Hi,

I am currently using slo-generator for availability SLO and I am trying to add latency SLO as well.
I am using prometheus as backend and prometheus-pushgateway as exporter.
I want to understand how slo-generator expose latency SLO? I don't see any metrics related to latency.

thanks,

lvaylet · 2022-09-07T09:45:39Z

Hi and thanks for your interest in SLO Generator!

Can you share how you handle the Availability SLO? Then let's build upon that and come up with the Latency SLO.

Just note that SLO Generator does not expose anything by itself. You need to define your SLOs in YAML files so SLO Generator understands how to compute them and decide whether you are within your targets (or not).

Here is an example for a Latency SLO based on Prometheus metrics:

apiVersion: sre.google.com/v2
kind: ServiceLevelObjective
metadata:
  name: prom-metrics-latency
  labels:
    service_name: prom
    feature_name: metrics
    slo_name: latency
spec:
  description: 99.99% of Prometheus requests return in less than 250ms
  backend: prometheus
  method: distribution_cut
  exporters:
  - prometheus
  service_level_indicator:
    expression: http_request_duration_seconds_bucket{handler="/metrics", code=~"2.."}
    threshold_bucket: 0.25 # in seconds, corresponds to the `le` (less than) PromQL label
  goal: 0.9999

Does that make sense? Otherwise feel free to ask for more details.

lvaylet · 2022-09-09T10:26:05Z

Hi @SanaZulfiqar73, please let me know if my last comment helped.

SanaZulfiqar73 · 2022-09-12T08:40:10Z

Hey, thanks for the response. I have pretty much same idea as you mentioned.
So, I am creating the availability SLOs using the different yaml files for different endpoints. The SLO yaml file looks something like this:

apiVersion: cloud.google.com/v1
kind: ServiceLevelObjective
metadata:
  name: pets-availability-metrics
  labels:
    service_name: {{ .Values.service }}
    feature_name: get_pets
    slo_name: availability-pets
spec:
  description: 99.97% of Prometheus requests should have a valid HTTP status code
  goal: 0.9997
  backend: prometheus
  exporters:
    - prometheus
  method: good_bad_ratio
  service_level_indicator:
    filter_good: http_requests_total{status="200", path="/pets"}[window]
    filter_bad: http_requests_total{status="500", path="/pets"}[window]

In config file, I am exposing these configurations to prometheus pushgateway:

apiVersion: v1
data:
  config.yaml: |
    backends:
      prometheus:
        url: http://prometheus.local:9090
    exporters:
      prometheus:
        #Pushgateway for Prometheus backend
        url: http://prometheus-pushgateway:9091
        
        #Feilds to import as Prometheus metrics
        metrics:
          - alert
          - sli_measurement
          - slo_target
          - error_budget_burn_rate
          - alerting_burn_rate_threshold
          - error_budget_minutes
          - error_budget_remaining_minutes
          - error_budget_measurement
          - good_events_count
          - bad_events_count
          - gap

    error_budget_policies:
      default:
        steps:
        - name: 1 hour
          window: 3600
          burn_rate_threshold: 9
          alert: true
          message_alert: Page to defend the SLO
          message_ok: Last hour on track
        - name: 15 days
          window: 1296000
          burn_rate_threshold: 1.5
          alert: true
          message_alert: Freeze release, unless related to reliability or security
          message_ok: Unfreeze release, per the agreed roll-out policy
        - name: 30 days
          window: 2592000
          burn_rate_threshold: 1
          alert: true
          message_alert: Freeze release, unless related to reliability or security
          message_ok: Unfreeze release, per the agreed roll-out policy

In prometheus, I am getting the availability value for endpoint "/pets" using the metrics "sli_measurement" that are being exposed to prometheus pushgateway. Now, for the endpoint "/pets" I want to add the latency as well. I am creating another SLO yaml file for that like this:

apiVersion: sre.google.com/v2
kind: ServiceLevelObjective
metadata:
  name: pets-latency-metrics
  labels:
    service_name: {{ .Values.service }}
    feature_name: get pets
    slo_name: pets-99th-latency
spec:
  description: 99% of Prometheus requests return in less than 100ms
  backend: prometheus
  method: distribution_cut
  exporters:
  - prometheus
  service_level_indicator:
    expression: http_request_duration_seconds_bucket{path="/pets"}
    threshold_bucket: 0.10 # in seconds, corresponds to the `le` (less than) PromQL label
  goal: 0.99

I was wondering if is there a way we can have single SLO yaml file for latency and availability and have a separate metrics for latency like "sli_measurement".

Thanks again,

lvaylet · 2022-10-19T08:57:24Z

Hi @SanaZulfiqar73,

It is considered a best practice to have one SLO definition per file (= as many files as SLOs). You could try combining multiple SLO definitions in a single file by separating them with --- but I am not sure what the result would be.

Regarding the sli_measurement for latency, that is exactly what the service_level_indicator field does. It lets you query a single time series, compared to a ratio of two time series with good_bad_ratio. More details here regarding the actual implementation.

Does that help? If not, please let me know what you had in mind. Perhaps by sharing an example of how you'd like to write this latency SLO?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

latency slo #238

latency slo #238

SanaZulfiqar73 commented Aug 31, 2022

lvaylet commented Sep 7, 2022

lvaylet commented Sep 9, 2022

SanaZulfiqar73 commented Sep 12, 2022

lvaylet commented Oct 19, 2022

latency slo #238

latency slo #238

Comments

SanaZulfiqar73 commented Aug 31, 2022

lvaylet commented Sep 7, 2022

lvaylet commented Sep 9, 2022

SanaZulfiqar73 commented Sep 12, 2022

lvaylet commented Oct 19, 2022