Prometheus Grafana Alertmanager构建企业级监控系统
环境
HOST-NAME | IP | K8S Role |
---|---|---|
master1 | 192.168.1.180/24 | master |
node1 | 192.168.1.181/24 | work |
1. 安装node-exporter
1.1 node-exporter介绍
node-exporter 机器(物理机、虚拟机、云主机等)的监控指标数据可以收集,可以收集的指标包
括 CPU, 内存、磁盘、网络、文件数等信息。
1.2 安装node-exporter
#查看一下k8s节点 [root@master1 .kube]# kubectl get nodes NAME STATUS ROLES AGE VERSION master1 Ready control-plane,master 12d v1.20.6 node1 Ready worker 12d v1.20.6 #创建一个monitor-sa命名空间 [root@master1 .kube]# kubectl create namespace monitor-sa namespace/monitor-sa created #上传node-exporter.tar.gz到master1和node1的家目录 [root@master1 ~]# docker load -i node-exporter.tar.gz ad68498f8d86: Loading layer [==================================================>] 4.628MB/4.628MB ad8512dce2a7: Loading layer [==================================================>] 2.781MB/2.781MB cc1adb06ef21: Loading layer [=================================================>] 16.9MB/16.9MB
Loaded image: prom/node-exporter:v0.16.0
[root@master1 ~]#
[root@node1 ~]# docker load -i node-exporter.tar.gz
ad68498f8d86: Loading layer [==================================================>] 4.628MB/4.628MB
ad8512dce2a7: Loading layer [==================================================>] 2.781MB/2.781MB
cc1adb06ef21: Loading layer [==================================================>] 16.9MB/16.9MB
Loaded image: prom/node-exporter:v0.16.0
[root@node1 ~]#
#说明 获取node-exporter的方法
在dockerhub的官网搜索
https://hub.docker.com/
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-nlOYwB94-1655106671762)(C:\Users\mack\AppData\Roaming\Typora\typora-user-images\1654670966393.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-rLX7tj43-1655106671762)(C:\Users\mack\AppData\Roaming\Typora\typora-user-images\1654670983083.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Diput0ie-1655106671763)(C:\Users\mack\AppData\Roaming\Typora\typora-user-images\1654671144462.png)]
[root@master1 prometheus]# cat > /root/prometheus/node-export.yaml <<END
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
namespace: monitor-sa
labels:
name: node-exporter
spec:
selector:
matchLabels:
name: node-exporter
template:
metadata:
labels:
name: node-exporter
spec:
hostPID: true
hostIPC: true
hostNetwork: true
containers:
- name: node-exporter
image: prom/node-exporter:v0.16.0
ports:
- containerPort: 9100
resources:
requests:
cpu: 0.15
securityContext:
privileged: true
args:
- --path.procfs
- /host/proc
- --path.sysfs
- /host/sys
- --collector.filesystem.ignored-mount-points
- '"^/(sys|proc|dev|host|etc)($|/)"'
volumeMounts:
- name: dev
mountPath: /host/dev
- name: proc
mountPath: /host/proc
- name: sys
mountPath: /host/sys
- name: rootfs
mountPath: /rootfs
tolerations:
- key: "node-role.kubernetes.io/master"
operator: "Exists"
effect: "NoSchedule"
volumes:
- name: proc
hostPath:
path: /proc
- name: dev
hostPath:
path: /dev
- name: sys
hostPath:
path: /sys
- name: rootfs
hostPath:
path: /
END
[root@master1 prometheus]# kubectl apply -f node-export.yaml
daemonset.apps/node-exporter created
[root@master1 prometheus]# kubectl get pods -n monitor-sa -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
node-exporter-92k4d 1/1 Running 0 58s 192.168.1.181 node1 <none> <none>
node-exporter-d44k4 1/1 Running 0 58s 192.168.1.180 master1 <none> <none>
1.3
curl http://主机 ip:9100/metrics
#node-export 默认的监听端口是 9100,可以看到当前主机获取到的所有监控数据
curl http://192.168.1.180:9100/metrics | grep node_cpu_seconds
显示 192.168.1.180 主机 cpu 的使用情况
**# HELP node_cpu_seconds_total Seconds the cpus spent in each mode.
**# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{
cpu="0",mode="idle"} 72963.37
node_cpu_seconds_total{
cpu="0",mode="iowait"} 9.35
node_cpu_seconds_total{
cpu="0",mode="irq"} 0
node_cpu_seconds_total{
cpu="0",mode="nice"} 0
node_cpu_seconds_total{
cpu="0",mode="softirq"} 151.4
node_cpu_seconds_total{
cpu="0",mode="steal"} 0
node_cpu_seconds_total{
cpu="0",mode="system"} 656.12
node_cpu_seconds_total{
cpu="0",mode="user"} 267.1
#HELP:解释当前指标的含义,上面表示在每种模式下 node 节点的 cpu 花费的时间,以 s 为单位
#TYPE:说明当前指标的数据类型,上面是 counter 类型
node_cpu_seconds_total{
cpu="0",mode="idle"} :
cpu0 上 idle 进程占用 CPU 的总时间,CPU 占用时间是一个只增不减的度量指标,从类型中也可以看
出 node_cpu 的数据类型是 counter(计数器)
**counter 计数器:只是采集递增的指标
curl http://192.168.40.180:9100/metrics | grep node_load
**# HELP node_load1 1m load average.
**# TYPE node_load1 gauge
node_load1 0.1
node_load1 该指标反映了当前主机在最近一分钟以内的负载情况,系统的负载情况会随系统资源的
使用而变化,因此 node_load1 反映的是当前状态,数据可能增加也可能减少,从注释中可以看出当前指
标类型为 gauge(标准尺寸)
gauge 标准尺寸:统计的指标可增加可减少
2.
2.1
#创建一个 sa 账号 monitor
[root@master1 prometheus]# kubectl create serviceaccount monitor -n monitor-sa
serviceaccount/monitor created
[root@master1 prometheus]# kubectl get serviceaccount -n monitor-sa
NAME SECRETS AGE
default 1 79m
monitor 1 30s
#把 serviceaccount 账号 monitor 通过 clusterrolebinding 绑定到 clusterrole 上
[root@master1 prometheus]# kubectl create clusterrolebinding monitor-clusterrolebinding -n monitor-sa --clusterrole=cluster-admin --serviceaccount=monitor-sa:monitor
clusterrolebinding.rbac.authorization.k8s.io/monitor-clusterrolebinding created
2.2
#在 k8s 集群的 node1 节点上创建数据存储目录
[root@node1 ~]# mkdir /data
[root@node1 ~]# chmod 777 /data/
[root@node1 ~]# ls -ld /data
drwxrwxrwx. 2 root root 6 Jun 8 16:00 /data
2.3 安装prometheus服务
2.3.1
[root@master1 prometheus]# cat > /root/prometheus/prometheus-cfg.yaml <<END --- kind: ConfigMap apiVersion: v1 metadata: labels: app: prometheus name: prometheus-config namespace: monitor-sa data: prometheus.yml: | global: scrape_interval: 15s #采集目标主机监控数据的时间间隔 scrape_timeout: 10s #数据采集超时时间,默认10s evaluation_interval: 1m #触发告警检测的是境,默认是1m scrape_configs: #配置数据源,称为target,每个target用job_name命名。又分为静态配置 #和服务发现 - job_name: 'kubernetes-node' kubernetes_sd_configs: #使用的是k8s的服务发现 - role: node #使用node角色,它使用默认的kubelet提供的http端口来发现集群中的每个node节点 relabel_configs: #重新标记 - source_labels: [__address__] #配置的原始
标签,匹配地址 regex: '(.*):10250' #匹配带有10250端口的url replacement: '${1}:9100' #把匹配到的ip:10250的ip保留 target_label: __address__ #新生成的url是${1}获取的ip:9100 action: replace - action: labelmap #匹配到下面正则表达式的标签会被保留,如果不做regex正则的话,默认只是会显示instance标签 regex: __meta_kubernetes_node_label_(.+) - job_name: 'kubernetes-node-cadvisor' # 抓取cAdvisor数据,是获取kubelet上/metrics/cadvisor接口数据来获取容器的资源使用情况 kubernetes_sd_configs: - role: node scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - action: labelmap #把匹配到的标签保留 regex: __meta_kubernetes_node_label_(.+) #保留匹配到的具有__meta_kubernetes_node_label的标签 - target_label: __address__ #获取到的地址: __address__="192.168.1.180:10250" replacement: kubernetes.default.svc:443 #把获取到的地址替换成新的地址kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) #把原始标签中__meta_kubernetes_node_name值匹配到 target_label: __metrics_path__ #获取__metrics_path__对应的值 replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor - job_name: 'kubernetes-apiserver' kubernetes_sd_configs: - role: endpoints #使用k8s中的endpoint服务发现,采集apiserver 6443端口获取到的数据 scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] action: keep regex: default;kubernetes;https - job_name: 'kubernetes-service-endpoints' kubernetes_sd_configs: - role: endpoints relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] action: keep regex: true #重新打标仅抓取到的具有“prometheus.io/scrape:true"的annotation的端点,意思是说如果某个service具有prometheus.io/scrape = ture annotation声明则抓取,annotation本身也是键值结构,所以这里的源标签设置为键,而regex设置值true,当值匹配到regex设定的内容时则执行keep动作也就是保留,其余则丢弃。 - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) #重新设置 scheme,匹配源标签__meta_kubernetes_service_annotation_prometheus_io_scheme 也就是 prometheus.io/scheme annotation,如果源标签的值匹配到 regex,则把值替换为__scheme__对应的值。 - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) # 应用中自定义暴露的指标,也许你暴露的 API 接口不是/metrics 这个路径,那么你可以在这个POD 对应的 service 中做一个"prometheus.io/path = /mymetrics" 声明,上面的意思就是把你声明的这个路径赋值给__metrics_path__,其实就是让 prometheus 来获取自定义应用暴露的 metrices 的具体路径,不过这里写的要和 service 中做好约定,如果 service 中这样写 prometheus.io/app-metricspath: '/metrics' 那么你这里就要 __meta_kubernetes_service_annotation_prometheus_io_app_metrics_path 这样写。 - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] action: replace target_label: __address__ regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 # 暴露自定义的应用的端口,就是把地址和你在 service 中定义的 "prometheus.io/port = <port>" 声明做一个拼接,然后赋值给__address__,这样 prometheus 就能获取自定义应用的端口,然后通过这个端口再结合__metrics_path__来获取指标,如果__metrics_path__值不是默认的/metrics 那么就要使用上面的标签替换来获取真正暴露的具体路径。 - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_service_name] action: replace target_label: kubernetes_name END [root@master1 prometheus]# kubectl apply -f prometheus-cfg.yaml configmap/prometheus-config created [root@master1 prometheus]# kubectl get configmap -n monitor-sa NAME DATA AGE kube-root-ca.crt 1 3h57m prometheus-config 1 18s [root@master1 prometheus]# kubectl describe configmap prometheus-config -n monitor-sa Name: prometheus-config Namespace: monitor-sa Labels: app=prometheus Annotations: <none> Data ==== prometheus.yml: ---- global: scrape_interval: 15s scrape_timeout: 10s evaluation_interval: 1m scrape_configs: - job_name: 'kubernetes-node' kubernetes_sd_configs: - role: node relabel_configs: - source_labels: [__address__] regex: '(.*):10250' replacement: '${1}:9100' target_label: __address__ action: replace - action: labelmap regex: __meta_kubernetes_node_label_(.+) - job_name: 'kubernetes-node-cadvisor' kubernetes_sd_configs: - role: node scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __address__ replacement: kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor - job_name: 'kubernetes-apiserver' kubernetes_sd_configs: - role: endpoints scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] action: keep regex: default;kubernetes;https - job_name: 'kubernetes-service-endpoints' kubernetes_sd_configs: - role: endpoints relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] action: replace target_label: __address__ regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_service_name] action: replace target_label: kubernetes_name Events: <none>
2.3.2
安装 prometheus 需要的镜像 prometheus-2-2-1.tar.gz ,上传到 k8s 的工作节点 node1 上,手动解压
这个镜像可以从hub.docker.com dockerhub上下载,也可以通过如下指令pull
docker pull prom/prometheus:v2.2.1
[root@node1 ~]# ls prometheus-2-2-1.tar.gz
prometheus-2-2-1.tar.gz
[root@node1 ~]# du -sh prometheus-2-2-1.tar.gz
110M prometheus-2-2-1.tar.gz
[root@node1 ~]# docker load -i prometheus-2-2-1.tar.gz
6a749002dd6a: Loading layer [==================================================>] 1.338MB/1.338MB
5f70bf18a086: Loading layer [==================================================>] 1.024kB/1.024kB
1692ded805c8: Loading layer [==================================================>] 2.629MB/2.629MB
035489d93827: Loading layer [==================================================>] 66.18MB/66.18MB
8b6ef3a2ab2c: Loading layer [==================================================>] 44.5MB/44.5MB
ff98586f6325: Loading layer [==================================================>] 3.584kB/3.584kB
017a13aba9f4: Loading layer [==================================================>] 12.8kB/12.8kB
4d04d79bb1a5: Loading layer [==================================================>] 27.65kB/27.65kB
75f6c078fa6b: Loading layer [==================================================>] 10.75kB/10.75kB
5e8313e8e2ba: Loading layer [==================================================>] 6.144kB/6.144kB
Loaded image: prom/prometheus:v2.2.1
[root@node1 ~]#
[root@master1 prometheus]# cat > /root/prometheus/prometheus-deploy.yaml <<END
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus-server
namespace: monitor-sa
labels:
app: prometheus
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
component: server
#matchExpressions:
#- {key: app, operator: In, values: [prometheus]}
#- {key: component, operator: In, values: [server]}
template:
metadata:
labels:
app: prometheus
component: server
annotations:
prometheus.io/scrape: 'false'
spec:
nodeName: node1
serviceAccountName: monitor
containers:
- name: prometheus
image: prom/prometheus:v2.2.1
imagePullPolicy: IfNotPresent
command:
- prometheus
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.path=/prometheus
- --storage.tsdb.retention=720h
- --web.enable-lifecycle ##启用prometheus热加载
ports:
- containerPort: 9090
protocol: TCP
volumeMounts:
- mountPath: /etc/prometheus/prometheus.yml
name: prometheus-config
subPath: prometheus.yml
- mountPath: /prometheus/
name: prometheus-storage-volume
volumes:
- name: prometheus-config
configMap:
name: prometheus-config
items:
- key: prometheus.yml
path: prometheus.yml
mode: 0644
- name: prometheus-storage-volume
hostPath:
path: /data
type: Directory
END
[root@master1 prometheus]# kubectl apply -f prometheus-deploy.yaml
deployment.apps/prometheus-server created
[root@master1 prometheus]# kubectl get pods -n monitor-sa -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
node-exporter-92k4d 1/1 Running 0 4h14m 192.168.1.181 node1 <none> <none>
node-exporter-d44k4 1/1 Running 0 4h14m 192.168.1.180 master1 <none> <none>
prometheus-server-657bd8cb4d-zrmk4 1/1 Running 0 42s 10.244.166.185 node1 <none> <none>
[root@master1 prometheus]#
[root@master1 prometheus]# cat > /root/prometheus/prometheus-svc.yaml <<END
apiVersion: v1
kind: Service
metadata:
name: prometheus
namespace: monitor-sa
labels:
app: prometheus
spec:
type: NodePort
ports:
- port: 9090
targetPort: 9090
protocol: TCP
selector:
app: prometheus
component: server
END
[root@master1 prometheus]# kubectl apply -f prometheus-svc.yaml
service/prometheus created
[root@master1 prometheus]# kubectl get service -n monitor-sa
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
prometheus NodePort 10.103.238.66 <none> 9090:31935/TCP 37s
[root@master1 prometheus]# kubectl get endpoints -n monitor-sa
NAME ENDPOINTS AGE
prometheus 10.244.166.185:9090 3m50s
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-TEexacDR-1655106671763)(C:\Users\mack\AppData\Roaming\Typora\typora-user-images\1654689013529.png)]
2.4
#为了每次修改配置文件可以热加载 prometheus,也就是不停止 prometheus,就可以使配置生效,
#想要使配置生效可用如下热加载命令:
[root@master1 prometheus]# kubectl get pods -n monitor-sa -o wide -l app=prometheus
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
prometheus-server-657bd8cb4d-zrmk4 1/1 Running 0 64m 10.244.166.185 node1 <none> <none>
[root@master1 prometheus]#
#10.244.166.185是 prometheus 的 pod 的 ip 地址
想要使配置生效可用如下命令热加载:
[root@master1 prometheus]# curl -X POST http://10.244.166.185:9090/-/reload
#热加载速度比较慢,可以暴力重启 prometheus,如修改上面的 prometheus-cfg.yaml 文件之后,可
执行如下强制删除:
kubectl delete -f prometheus-cfg.yaml
kubectl delete -f prometheus-deploy.yaml
然后再通过 apply 更新:
kubectl apply -f prometheus-cfg.yaml
kubectl apply -f prometheus-deploy.yaml
注意:
线上最好热加载,暴力删除可能造成监控数据的丢失
3.
3.1
**Grafana 是一个跨平台的开源的度量分析和可视化工具,可以将采集的数据可视化的展示,并及时通
知给告警接收方。它主要有以下六大特点:
1、展示方式:快速灵活的客户端图表,面板插件有许多不同方式的可视化指标和日志,官方库中具
有丰富的仪表盘插件,比如热图、折线图、图表等多种展示方式;
2、数据源:Graphite,InfluxDB,OpenTSDB,Prometheus,Elasticsearch,CloudWatch 和
KairosDB 等;
3、通知提醒:以可视方式定义最重要指标的警报规则,Grafana 将不断计算并发送通知,在数据达
到阈值时通过 Slack、PagerDuty 等获得通知;
4、混合展示:在同一图表中混合使用不同的数据源,可以基于每个查询指定数据源,甚至自定义数
据源;
5、注释:使用来自不同数据源的丰富事件注释图表,将鼠标悬停在事件上会显示完整的事件元数据
和标记。
3.2
**安装 Grafana 需要的镜像 heapster-grafana-amd64_v5_0_4.tar.gz,把镜像上传到 k8s 的工作节点
node1 上,手动解压:
[root@node1 prometheus]# ls
heapster-grafana-amd64_v5_0_4.tar.gz
[root@node1 prometheus]# du -sh heapster-grafana-amd64_v5_0_4.tar.gz
165M heapster-grafana-amd64_v5_0_4.tar.gz
[root@node1 prometheus]# docker load -i heapster-grafana-amd64_v5_0_4.tar.gz
6816d98be637: Loading layer [==================================================>] 4.642MB/4.642MB
523feee8e0d3: Loading layer [==================================================>] 161.5MB/161.5MB
43d2638621da: Loading layer [==================================================>] 230.4kB/230.4kB
f24c0fa82e54: Loading layer [==================================================>] 2.56kB/2.56kB
334547094992: Loading layer [==================================================>] 5.826MB/5.826MB
Loaded image: k8s.gcr.io/heapster-grafana-amd64:v5.0.4
这个镜像可以在hub.docker.com上搜索下载
[外链图片转存