资讯详情

Prometheus+Grafana+Alertmanager构建企业级监控系统

Prometheus Grafana Alertmanager构建企业级监控系统

环境

HOST-NAME IP K8S Role
master1 192.168.1.180/24 master
node1 192.168.1.181/24 work

1. 安装node-exporter

1.1 node-exporter介绍

node-exporter 机器(物理机、虚拟机、云主机等)的监控指标数据可以收集,可以收集的指标包

括 CPU, 内存、磁盘、网络、文件数等信息。

1.2 安装node-exporter

#查看一下k8s节点 [root@master1 .kube]# kubectl get nodes NAME      STATUS   ROLES                  AGE   VERSION master1   Ready    control-plane,master   12d   v1.20.6 node1     Ready    worker                 12d   v1.20.6 #创建一个monitor-sa命名空间 [root@master1 .kube]# kubectl create namespace monitor-sa namespace/monitor-sa created  #上传node-exporter.tar.gz到master1和node1的家目录 [root@master1 ~]# docker load -i node-exporter.tar.gz  ad68498f8d86: Loading layer [==================================================>]  4.628MB/4.628MB ad8512dce2a7: Loading layer [==================================================>]  2.781MB/2.781MB cc1adb06ef21: Loading layer [=================================================>]   16.9MB/16.9MB
Loaded image: prom/node-exporter:v0.16.0
[root@master1 ~]# 


[root@node1 ~]# docker load -i node-exporter.tar.gz 
ad68498f8d86: Loading layer [==================================================>]  4.628MB/4.628MB
ad8512dce2a7: Loading layer [==================================================>]  2.781MB/2.781MB
cc1adb06ef21: Loading layer [==================================================>]   16.9MB/16.9MB
Loaded image: prom/node-exporter:v0.16.0
[root@node1 ~]# 

#说明 获取node-exporter的方法
在dockerhub的官网搜索
https://hub.docker.com/

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-nlOYwB94-1655106671762)(C:\Users\mack\AppData\Roaming\Typora\typora-user-images\1654670966393.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-rLX7tj43-1655106671762)(C:\Users\mack\AppData\Roaming\Typora\typora-user-images\1654670983083.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Diput0ie-1655106671763)(C:\Users\mack\AppData\Roaming\Typora\typora-user-images\1654671144462.png)]


[root@master1 prometheus]# cat > /root/prometheus/node-export.yaml <<END
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-exporter
  namespace: monitor-sa
  labels:
    name: node-exporter
spec:
  selector:
    matchLabels:
     name: node-exporter
  template:
    metadata:
      labels:
        name: node-exporter
    spec:
      hostPID: true
      hostIPC: true
      hostNetwork: true
      containers:
      - name: node-exporter
        image: prom/node-exporter:v0.16.0
        ports:
        - containerPort: 9100
        resources:
          requests:
            cpu: 0.15
        securityContext:
          privileged: true
        args:
        - --path.procfs
        - /host/proc
        - --path.sysfs
        - /host/sys
        - --collector.filesystem.ignored-mount-points
        - '"^/(sys|proc|dev|host|etc)($|/)"'
        volumeMounts:
        - name: dev
          mountPath: /host/dev
        - name: proc
          mountPath: /host/proc
        - name: sys
          mountPath: /host/sys
        - name: rootfs
          mountPath: /rootfs
      tolerations:
      - key: "node-role.kubernetes.io/master"
        operator: "Exists"
        effect: "NoSchedule"
      volumes:
        - name: proc
          hostPath:
            path: /proc
        - name: dev
          hostPath:
            path: /dev
        - name: sys
          hostPath:
            path: /sys
        - name: rootfs
          hostPath:
            path: /
END     

[root@master1 prometheus]# kubectl apply -f node-export.yaml 
daemonset.apps/node-exporter created

[root@master1 prometheus]# kubectl get pods -n monitor-sa -o wide
NAME                  READY   STATUS    RESTARTS   AGE   IP              NODE      NOMINATED NODE   READINESS GATES
node-exporter-92k4d   1/1     Running   0          58s   192.168.1.181   node1     <none>           <none>
node-exporter-d44k4   1/1     Running   0          58s   192.168.1.180   master1   <none>           <none>

1.3

curl http://主机 ip:9100/metrics  

#node-export 默认的监听端口是 9100,可以看到当前主机获取到的所有监控数据 

curl http://192.168.1.180:9100/metrics | grep node_cpu_seconds  

显示 192.168.1.180 主机 cpu 的使用情况 

**# HELP node_cpu_seconds_total Seconds the cpus spent in each mode. 

**# TYPE node_cpu_seconds_total counter 

node_cpu_seconds_total{ 
        cpu="0",mode="idle"} 72963.37  

node_cpu_seconds_total{ 
        cpu="0",mode="iowait"} 9.35  

node_cpu_seconds_total{ 
        cpu="0",mode="irq"} 0  

node_cpu_seconds_total{ 
        cpu="0",mode="nice"} 0  

node_cpu_seconds_total{ 
        cpu="0",mode="softirq"} 151.4  

node_cpu_seconds_total{ 
        cpu="0",mode="steal"} 0  

node_cpu_seconds_total{ 
        cpu="0",mode="system"} 656.12  

node_cpu_seconds_total{ 
        cpu="0",mode="user"} 267.1  

#HELP:解释当前指标的含义,上面表示在每种模式下 node 节点的 cpu 花费的时间,以 s 为单位 

#TYPE:说明当前指标的数据类型,上面是 counter 类型 

node_cpu_seconds_total{ 
        cpu="0",mode="idle"} :  

cpu0 上 idle 进程占用 CPU 的总时间,CPU 占用时间是一个只增不减的度量指标,从类型中也可以看 

出 node_cpu 的数据类型是 counter(计数器) 

**counter 计数器:只是采集递增的指标  

curl http://192.168.40.180:9100/metrics | grep node_load 
**# HELP node_load1 1m load average. 

**# TYPE node_load1 gauge 
node_load1 0.1  

node_load1 该指标反映了当前主机在最近一分钟以内的负载情况,系统的负载情况会随系统资源的 

使用而变化,因此 node_load1 反映的是当前状态,数据可能增加也可能减少,从注释中可以看出当前指 

标类型为 gauge(标准尺寸)  

gauge 标准尺寸:统计的指标可增加可减少 

2.

2.1

#创建一个 sa 账号 monitor
[root@master1 prometheus]# kubectl create serviceaccount monitor -n monitor-sa
serviceaccount/monitor created
[root@master1 prometheus]# kubectl get serviceaccount -n monitor-sa
NAME      SECRETS   AGE
default   1         79m
monitor   1         30s

#把 serviceaccount 账号 monitor 通过 clusterrolebinding 绑定到 clusterrole 上
[root@master1 prometheus]# kubectl create clusterrolebinding monitor-clusterrolebinding -n monitor-sa --clusterrole=cluster-admin --serviceaccount=monitor-sa:monitor
clusterrolebinding.rbac.authorization.k8s.io/monitor-clusterrolebinding created


2.2

#在 k8s 集群的 node1 节点上创建数据存储目录
[root@node1 ~]# mkdir /data
[root@node1 ~]# chmod 777 /data/
[root@node1 ~]# ls -ld /data
drwxrwxrwx. 2 root root 6 Jun  8 16:00 /data

2.3 安装prometheus服务

2.3.1

[root@master1 prometheus]# cat > /root/prometheus/prometheus-cfg.yaml <<END
---
kind: ConfigMap
apiVersion: v1
metadata:
  labels:
    app: prometheus
  name: prometheus-config
  namespace: monitor-sa
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s       #采集目标主机监控数据的时间间隔
      scrape_timeout: 10s        #数据采集超时时间,默认10s
      evaluation_interval: 1m    #触发告警检测的是境,默认是1m
    scrape_configs:              #配置数据源,称为target,每个target用job_name命名。又分为静态配置 #和服务发现
    - job_name: 'kubernetes-node'
      kubernetes_sd_configs:       #使用的是k8s的服务发现
      - role: node
      #使用node角色,它使用默认的kubelet提供的http端口来发现集群中的每个node节点
      relabel_configs:     #重新标记
      - source_labels: [__address__]      #配置的原始标签,匹配地址
        regex: '(.*):10250'               #匹配带有10250端口的url
        replacement: '${1}:9100'          #把匹配到的ip:10250的ip保留
        target_label: __address__         #新生成的url是${1}获取的ip:9100
        action: replace
      - action: labelmap
      #匹配到下面正则表达式的标签会被保留,如果不做regex正则的话,默认只是会显示instance标签
        regex: __meta_kubernetes_node_label_(.+)
    - job_name: 'kubernetes-node-cadvisor'
       # 抓取cAdvisor数据,是获取kubelet上/metrics/cadvisor接口数据来获取容器的资源使用情况
      kubernetes_sd_configs:
      - role:  node
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
      - action: labelmap    #把匹配到的标签保留
        regex: __meta_kubernetes_node_label_(.+)
        #保留匹配到的具有__meta_kubernetes_node_label的标签
      - target_label: __address__
      #获取到的地址: __address__="192.168.1.180:10250"
        replacement: kubernetes.default.svc:443
        #把获取到的地址替换成新的地址kubernetes.default.svc:443
      - source_labels: [__meta_kubernetes_node_name]
        regex: (.+)
        #把原始标签中__meta_kubernetes_node_name值匹配到
        target_label: __metrics_path__
        #获取__metrics_path__对应的值
        replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
    - job_name: 'kubernetes-apiserver'
      kubernetes_sd_configs:
      - role: endpoints
      #使用k8s中的endpoint服务发现,采集apiserver 6443端口获取到的数据
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
        action: keep
        regex: default;kubernetes;https
    - job_name: 'kubernetes-service-endpoints'
      kubernetes_sd_configs:
      - role: endpoints
      relabel_configs:
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
        action: keep
        regex: true
        #重新打标仅抓取到的具有“prometheus.io/scrape:true"的annotation的端点,意思是说如果某个service具有prometheus.io/scrape = ture annotation声明则抓取,annotation本身也是键值结构,所以这里的源标签设置为键,而regex设置值true,当值匹配到regex设定的内容时则执行keep动作也就是保留,其余则丢弃。
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
        action: replace
        target_label: __scheme__
        regex: (https?)
        #重新设置 scheme,匹配源标签__meta_kubernetes_service_annotation_prometheus_io_scheme 也就是 prometheus.io/scheme annotation,如果源标签的值匹配到 regex,则把值替换为__scheme__对应的值。
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
        # 应用中自定义暴露的指标,也许你暴露的 API 接口不是/metrics 这个路径,那么你可以在这个POD 对应的 service 中做一个"prometheus.io/path = /mymetrics" 声明,上面的意思就是把你声明的这个路径赋值给__metrics_path__,其实就是让 prometheus 来获取自定义应用暴露的 metrices 的具体路径,不过这里写的要和 service 中做好约定,如果 service 中这样写 prometheus.io/app-metricspath: '/metrics' 那么你这里就要 __meta_kubernetes_service_annotation_prometheus_io_app_metrics_path 这样写。
      - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        # 暴露自定义的应用的端口,就是把地址和你在 service 中定义的 "prometheus.io/port = <port>" 声明做一个拼接,然后赋值给__address__,这样 prometheus 就能获取自定义应用的端口,然后通过这个端口再结合__metrics_path__来获取指标,如果__metrics_path__值不是默认的/metrics 那么就要使用上面的标签替换来获取真正暴露的具体路径。
      - action: labelmap
        regex: __meta_kubernetes_service_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_service_name]
        action: replace
        target_label: kubernetes_name 
END      


[root@master1 prometheus]# kubectl apply -f prometheus-cfg.yaml 
configmap/prometheus-config created

[root@master1 prometheus]# kubectl get configmap -n monitor-sa
NAME                DATA   AGE
kube-root-ca.crt    1      3h57m
prometheus-config   1      18s

[root@master1 prometheus]# kubectl describe configmap prometheus-config -n monitor-sa
Name:         prometheus-config
Namespace:    monitor-sa
Labels:       app=prometheus
Annotations:  <none>

Data
====
prometheus.yml:
----
global:
  scrape_interval: 15s
  scrape_timeout: 10s
  evaluation_interval: 1m
scrape_configs:
- job_name: 'kubernetes-node'
  kubernetes_sd_configs:
  - role: node
  relabel_configs:
  - source_labels: [__address__]
    regex: '(.*):10250'
    replacement: '${1}:9100'
    target_label: __address__
    action: replace
  - action: labelmap
    regex: __meta_kubernetes_node_label_(.+)
- job_name: 'kubernetes-node-cadvisor'
  kubernetes_sd_configs:
  - role:  node
  scheme: https
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  relabel_configs:
  - action: labelmap
    regex: __meta_kubernetes_node_label_(.+)
  - target_label: __address__
    replacement: kubernetes.default.svc:443
  - source_labels: [__meta_kubernetes_node_name]
    regex: (.+)
    target_label: __metrics_path__
    replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
- job_name: 'kubernetes-apiserver'
  kubernetes_sd_configs:
  - role: endpoints
  scheme: https
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  relabel_configs:
  - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
    action: keep
    regex: default;kubernetes;https
- job_name: 'kubernetes-service-endpoints'
  kubernetes_sd_configs:
  - role: endpoints
  relabel_configs:
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
    action: keep
    regex: true
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
    action: replace
    target_label: __scheme__
    regex: (https?)
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
    action: replace
    target_label: __metrics_path__
    regex: (.+)
  - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
    action: replace
    target_label: __address__
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
  - action: labelmap
    regex: __meta_kubernetes_service_label_(.+)
  - source_labels: [__meta_kubernetes_namespace]
    action: replace
    target_label: kubernetes_namespace
  - source_labels: [__meta_kubernetes_service_name]
    action: replace
    target_label: kubernetes_name 

Events:  <none>

2.3.2

安装 prometheus 需要的镜像 prometheus-2-2-1.tar.gz ,上传到 k8s 的工作节点 node1 上,手动解压

这个镜像可以从hub.docker.com dockerhub上下载,也可以通过如下指令pull

docker pull prom/prometheus:v2.2.1
[root@node1 ~]# ls prometheus-2-2-1.tar.gz 
prometheus-2-2-1.tar.gz
[root@node1 ~]# du -sh prometheus-2-2-1.tar.gz 
110M    prometheus-2-2-1.tar.gz
[root@node1 ~]# docker load -i prometheus-2-2-1.tar.gz 
6a749002dd6a: Loading layer [==================================================>]  1.338MB/1.338MB
5f70bf18a086: Loading layer [==================================================>]  1.024kB/1.024kB
1692ded805c8: Loading layer [==================================================>]  2.629MB/2.629MB
035489d93827: Loading layer [==================================================>]  66.18MB/66.18MB
8b6ef3a2ab2c: Loading layer [==================================================>]   44.5MB/44.5MB
ff98586f6325: Loading layer [==================================================>]  3.584kB/3.584kB
017a13aba9f4: Loading layer [==================================================>]   12.8kB/12.8kB
4d04d79bb1a5: Loading layer [==================================================>]  27.65kB/27.65kB
75f6c078fa6b: Loading layer [==================================================>]  10.75kB/10.75kB
5e8313e8e2ba: Loading layer [==================================================>]  6.144kB/6.144kB
Loaded image: prom/prometheus:v2.2.1
[root@node1 ~]# 
[root@master1 prometheus]# cat > /root/prometheus/prometheus-deploy.yaml <<END
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus-server
  namespace: monitor-sa
  labels:
    app: prometheus
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
      component: server
    #matchExpressions:
    #- {key: app, operator: In, values: [prometheus]}
    #- {key: component, operator: In, values: [server]}
  template:
    metadata:
      labels:
        app: prometheus
        component: server
      annotations:
        prometheus.io/scrape: 'false'
    spec:
      nodeName: node1
      serviceAccountName: monitor
      containers:
      - name: prometheus
        image: prom/prometheus:v2.2.1
        imagePullPolicy: IfNotPresent
        command:
          - prometheus
          - --config.file=/etc/prometheus/prometheus.yml
          - --storage.tsdb.path=/prometheus
          - --storage.tsdb.retention=720h
          - --web.enable-lifecycle     ##启用prometheus热加载
        ports:
        - containerPort: 9090
          protocol: TCP
        volumeMounts:
        - mountPath: /etc/prometheus/prometheus.yml
          name: prometheus-config
          subPath: prometheus.yml
        - mountPath: /prometheus/
          name: prometheus-storage-volume
      volumes:
        - name: prometheus-config
          configMap:
            name: prometheus-config
            items:
              - key: prometheus.yml
                path: prometheus.yml
                mode: 0644
        - name: prometheus-storage-volume
          hostPath:
           path: /data
           type: Directory
END


[root@master1 prometheus]# kubectl apply -f prometheus-deploy.yaml 
deployment.apps/prometheus-server created

[root@master1 prometheus]# kubectl get pods -n monitor-sa -o wide
NAME                                 READY   STATUS    RESTARTS   AGE     IP               NODE      NOMINATED NODE   READINESS GATES
node-exporter-92k4d                  1/1     Running   0          4h14m   192.168.1.181    node1     <none>           <none>
node-exporter-d44k4                  1/1     Running   0          4h14m   192.168.1.180    master1   <none>           <none>
prometheus-server-657bd8cb4d-zrmk4   1/1     Running   0          42s     10.244.166.185   node1     <none>           <none>
[root@master1 prometheus]# 

[root@master1 prometheus]# cat > /root/prometheus/prometheus-svc.yaml <<END
apiVersion: v1
kind: Service
metadata:
  name: prometheus
  namespace: monitor-sa
  labels:
    app: prometheus
spec:
  type: NodePort
  ports:
    - port: 9090
      targetPort: 9090
      protocol: TCP
  selector:
    app: prometheus
    component: server
END    

[root@master1 prometheus]# kubectl apply -f prometheus-svc.yaml 
service/prometheus created
[root@master1 prometheus]# kubectl get service -n monitor-sa
NAME         TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
prometheus   NodePort   10.103.238.66   <none>        9090:31935/TCP   37s

[root@master1 prometheus]# kubectl get endpoints -n monitor-sa
NAME         ENDPOINTS             AGE
prometheus   10.244.166.185:9090   3m50s

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-TEexacDR-1655106671763)(C:\Users\mack\AppData\Roaming\Typora\typora-user-images\1654689013529.png)]

2.4

#为了每次修改配置文件可以热加载 prometheus,也就是不停止 prometheus,就可以使配置生效,
#想要使配置生效可用如下热加载命令:
[root@master1 prometheus]# kubectl get pods -n monitor-sa -o wide -l app=prometheus
NAME                                 READY   STATUS    RESTARTS   AGE   IP               NODE    NOMINATED NODE   READINESS GATES
prometheus-server-657bd8cb4d-zrmk4   1/1     Running   0          64m   10.244.166.185   node1   <none>           <none>
[root@master1 prometheus]# 

#10.244.166.185是 prometheus 的 pod 的 ip 地址

想要使配置生效可用如下命令热加载:
[root@master1 prometheus]# curl -X POST http://10.244.166.185:9090/-/reload

#热加载速度比较慢,可以暴力重启 prometheus,如修改上面的 prometheus-cfg.yaml 文件之后,可
执行如下强制删除: 
kubectl delete -f prometheus-cfg.yaml 
kubectl delete -f prometheus-deploy.yaml 
然后再通过 apply 更新: 
kubectl apply -f prometheus-cfg.yaml 
kubectl apply -f prometheus-deploy.yaml 
注意: 
线上最好热加载,暴力删除可能造成监控数据的丢失

3.

3.1

**Grafana 是一个跨平台的开源的度量分析和可视化工具,可以将采集的数据可视化的展示,并及时通

知给告警接收方。它主要有以下六大特点:

1、展示方式:快速灵活的客户端图表,面板插件有许多不同方式的可视化指标和日志,官方库中具

有丰富的仪表盘插件,比如热图、折线图、图表等多种展示方式;

2、数据源:Graphite,InfluxDB,OpenTSDB,Prometheus,Elasticsearch,CloudWatch 和

KairosDB 等;

3、通知提醒:以可视方式定义最重要指标的警报规则,Grafana 将不断计算并发送通知,在数据达

到阈值时通过 Slack、PagerDuty 等获得通知;

4、混合展示:在同一图表中混合使用不同的数据源,可以基于每个查询指定数据源,甚至自定义数

据源;

5、注释:使用来自不同数据源的丰富事件注释图表,将鼠标悬停在事件上会显示完整的事件元数据

和标记。

3.2

**安装 Grafana 需要的镜像 heapster-grafana-amd64_v5_0_4.tar.gz,把镜像上传到 k8s 的工作节点

node1 上,手动解压:


[root@node1 prometheus]# ls
heapster-grafana-amd64_v5_0_4.tar.gz
[root@node1 prometheus]# du -sh heapster-grafana-amd64_v5_0_4.tar.gz 
165M    heapster-grafana-amd64_v5_0_4.tar.gz
[root@node1 prometheus]# docker load -i heapster-grafana-amd64_v5_0_4.tar.gz 
6816d98be637: Loading layer [==================================================>]  4.642MB/4.642MB
523feee8e0d3: Loading layer [==================================================>]  161.5MB/161.5MB
43d2638621da: Loading layer [==================================================>]  230.4kB/230.4kB
f24c0fa82e54: Loading layer [==================================================>]   2.56kB/2.56kB
334547094992: Loading layer [==================================================>]  5.826MB/5.826MB
Loaded image: k8s.gcr.io/heapster-grafana-amd64:v5.0.4

这个镜像可以在hub.docker.com上搜索下载

[外链图片转存

标签: 13d1压式传感器电容rlx能代替rls6003rb网承3m连接器rcwb601智能温度变送器

锐单商城拥有海量元器件数据手册IC替代型号,打造 电子元器件IC百科大全!

锐单商城 - 一站式电子元器件采购平台