Prometheus Rules

Prometheus Rules
- 1 etcd rules
- 2 mysqld rules

本文中记录着一些常用的Prometheus Rules，适用于kube-prometheus。

1 etcd rules

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    app.kubernetes.io/name: kube-prometheus
    app.kubernetes.io/part-of: kube-prometheus
    prometheus: k8s
    role: alert-rules
  name: etcd-k8s-monitoring-rules
  namespace: monitoring
spec:
  groups:
    - name: etcd-k8s
      rules:
        - alert: EtcdNoLeader
          expr: etcd_server_has_leader == 0
          for: 0m
          labels:
            severity: critical
          annotations:
            summary: Etcd no Leader (instance {{ $labels.instance }})
            description: "Etcd cluster have no leader\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
        - alert: EtcdInsufficientMembers
          expr: count(etcd_server_id) % 2 == 0
          for: 0m
          labels:
            severity: critical
          annotations:
            summary: Etcd insufficient Members (instance {{ $labels.instance }})
            description: "Etcd cluster should have an odd number of members\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
        - alert: EtcdHighNumberOfLeaderChanges
          expr: increase(etcd_server_leader_changes_seen_total[10m]) > 2
          for: 0m
          labels:
            severity: warning
          annotations:
            summary: Etcd high number of leader changes (instance {{ $labels.instance }})
            description: "Etcd leader changed more than 2 times during 10 minutes\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
        - alert: EtcdHighNumberOfFailedGrpcRequests
          expr: sum(rate(grpc_server_handled_total{grpc_code!="OK"}[1m])) BY (grpc_service, grpc_method) / sum(rate(grpc_server_handled_total[1m])) BY (grpc_service, grpc_method) > 0.01
          for: 2m
          labels:
            severity: warning
          annotations:
            summary: Etcd high number of failed GRPC requests (instance {{ $labels.instance }})
            description: "More than 1% GRPC request failure detected in Etcd\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
        - alert: EtcdHighNumberOfFailedGrpcRequests
          expr: sum(rate(grpc_server_handled_total{grpc_code!="OK"}[1m])) BY (grpc_service, grpc_method) / sum(rate(grpc_server_handled_total[1m])) BY (grpc_service, grpc_method) > 0.05
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: Etcd high number of failed GRPC requests (instance {{ $labels.instance }})
            description: "More than 5% GRPC request failure detected in Etcd\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
        - alert: EtcdGrpcRequestsSlow
          expr: histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{grpc_type="unary"}[1m])) by (grpc_service, grpc_method, le)) > 0.15
          for: 2m
          labels:
            severity: warning
          annotations:
            summary: Etcd GRPC requests slow (instance {{ $labels.instance }})
            description: "GRPC requests slowing down, 99th percentile is over 0.15s\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
        - alert: EtcdHighNumberOfFailedHttpRequests
          expr: sum(rate(etcd_http_failed_total[1m])) BY (method) / sum(rate(etcd_http_received_total[1m])) BY (method) > 0.01
          for: 2m
          labels:
            severity: warning
          annotations:
            summary: Etcd high number of failed HTTP requests (instance {{ $labels.instance }})
            description: "More than 1% HTTP failure detected in Etcd\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
        - alert: EtcdHighNumberOfFailedHttpRequests
          expr: sum(rate(etcd_http_failed_total[1m])) BY (method) / sum(rate(etcd_http_received_total[1m])) BY (method) > 0.05
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: Etcd high number of failed HTTP requests (instance {{ $labels.instance }})
            description: "More than 5% HTTP failure detected in Etcd\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
        - alert: EtcdHttpRequestsSlow
          expr: histogram_quantile(0.99, rate(etcd_http_successful_duration_seconds_bucket[1m])) > 0.15
          for: 2m
          labels:
            severity: warning
          annotations:
            summary: Etcd HTTP requests slow (instance {{ $labels.instance }})
            description: "HTTP requests slowing down, 99th percentile is over 0.15s\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
        - alert: EtcdMemberCommunicationSlow
          expr: histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[1m])) > 0.15
          for: 2m
          labels:
            severity: warning
          annotations:
            summary: Etcd member communication slow (instance {{ $labels.instance }})
            description: "Etcd member communication slowing down, 99th percentile is over 0.15s\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
        - alert: EtcdHighNumberOfFailedProposals
          expr: increase(etcd_server_proposals_failed_total[1h]) > 5
          for: 2m
          labels:
            severity: warning
          annotations:
            summary: Etcd high number of failed proposals (instance {{ $labels.instance }})
            description: "Etcd server got more than 5 failed proposals past hour\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
        - alert: EtcdHighFsyncDurations
          expr: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[1m])) > 0.5
          for: 2m
          labels:
            severity: warning
          annotations:
            summary: Etcd high fsync durations (instance {{ $labels.instance }})
            description: "Etcd WAL fsync duration increasing, 99th percentile is over 0.5s\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
        - alert: EtcdHighCommitDurations
          expr: histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[1m])) > 0.25
          for: 2m
          labels:
            severity: warning
          annotations:
            summary: Etcd high commit durations (instance {{ $labels.instance }})
            description: "Etcd commit duration increasing, 99th percentile is over 0.25s\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

2 mysqld rules

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    app.kubernetes.io/name: kube-prometheus
    app.kubernetes.io/part-of: kube-prometheus
    prometheus: k8s
    role: alert-rules
  name: mysqld-monitoring-rules
  namespace: monitoring
spec:
  groups:
    - name: etcd-k8s
      rules:
        - alert: MysqlDown
          expr: mysql_up == 0
          for: 0m
          labels:
            severity: critical
          annotations:
            summary: MySQL down (instance {{ $labels.instance }})
            description: "MySQL instance is down on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
        - alert: MysqlTooManyConnections(>80%)
          expr: max_over_time(mysql_global_status_threads_connected[1m]) / mysql_global_variables_max_connections * 100 > 80
          for: 2m
          labels:
            severity: warning
          annotations:
            summary: MySQL too many connections (> 80%) (instance {{ $labels.instance }})
            description: "More than 80% of MySQL connections are in use on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
        - alert: MysqlHighThreadsRunning
          expr: max_over_time(mysql_global_status_threads_running[1m]) / mysql_global_variables_max_connections * 100 > 60
          for: 2m
          labels:
            severity: warning
          annotations:
            summary: MySQL high threads running (instance {{ $labels.instance }})
            description: "More than 60% of MySQL connections are in running state on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
        - alert: MysqlSlaveIoThreadNotRunning
          expr: mysql_slave_status_master_server_id > 0 and ON (instance) mysql_slave_status_slave_io_running == 0
          for: 0m
          labels:
            severity: critical
          annotations:
            summary: MySQL Slave IO thread not running (instance {{ $labels.instance }})
            description: "MySQL Slave IO thread not running on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
        - alert: MysqlSlaveSqlThreadNotRunning
          expr: mysql_slave_status_master_server_id > 0 and ON (instance) mysql_slave_status_slave_sql_running == 0
          for: 0m
          labels:
            severity: critical
          annotations:
            summary: MySQL Slave SQL thread not running (instance {{ $labels.instance }})
            description: "MySQL Slave SQL thread not running on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
        - alert: MysqlSlaveReplicationLag
          expr: mysql_slave_status_master_server_id > 0 and ON (instance) (mysql_slave_status_seconds_behind_master - mysql_slave_status_sql_delay) > 30
          for: 1m
          labels:
            severity: critical
          annotations:
            summary: MySQL Slave replication lag (instance {{ $labels.instance }})
            description: "MySQL replication lag on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
        - alert: MysqlSlowQueries
          expr: increase(mysql_global_status_slow_queries[1m]) > 0
          for: 2m
          labels:
            severity: warning
          annotations:
            summary: MySQL slow queries (instance {{ $labels.instance }})
            description: "MySQL server mysql has some new slow query.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
        - alert: MysqlInnodbLogWaits
          expr: rate(mysql_global_status_innodb_log_waits[15m]) > 10
          for: 0m
          labels:
            severity: warning
          annotations:
            summary: MySQL InnoDB log waits (instance {{ $labels.instance }})
            description: "MySQL innodb log writes stalling\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
        - alert: MysqlRestarted
          expr: mysql_global_status_uptime < 60
          for: 0m
          labels:
            severity: info
          annotations:
            summary: MySQL restarted (instance {{ $labels.instance }})
            description: "MySQL has just been restarted, less than one minute ago on {{ $labels.instance }}.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

Prometheus Rules（kube-prometheus）

Prometheus Rules

1 etcd rules

2 mysqld rules

评论区