场景描述:
在 AWS 云环境已部署 EKS + Prometheus + Grafana + Alertmanager 的全栈监控体系且数据持久化的前提下,
使用 Prometheus-role 按需定义告警规则,再通过 Prometheus-alert 将告警实时推送至业务平台,实现集群性能与资源异常的闭环监控。
方案概述:
为满足用户需求,我们将执行以下步骤:
- 自定义告警规则
- 使用Prometheus-alter进行服务告警
- 使用EFS对alter进行持久化并通过alb进行服务转发
- 修改helm alter配置信息
- 配置飞书模板并进行告警测试
详细步骤:
配置Prometheus告警规则
修改完成后,Prometheus 会自动重载配置(如果不行手动重启一下),不需要重启 Pod,进入 Prometheus rules 界面即可看到新的规则
cat << EOF > prometheus_rule.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: pod-alerts
namespace: monitoring
labels:
app: kube-prometheus-stack
release: prom-stack
spec:
groups:
- name: pod-monitoring
rules:
- alert: PodNotRunning
expr: |
sum by (namespace, pod) (kube_pod_status_phase{phase!="Running"}) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} is not running"
description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has been in non-Running state for over 5 minutes."
- alert: PodRestartingFrequently
expr: |
sum by (namespace, pod) (increase(kube_pod_container_status_restarts_total[5m])) > 3
for: 1m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} is restarting frequently"
description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is restarting too frequently (more than 3 times in 5 minutes)."
- alert: PodCrashLoopBackOff
expr: |
sum by (namespace, pod, container) (kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}) > 0
for: 2m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} in CrashLoopBackOff state"
description: "Container {{ $labels.container }} in pod {{ $labels.pod }} is crashing repeatedly."
- alert: PodNotReady
expr: |
sum by (namespace, pod) (kube_pod_status_ready{condition="false"}) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} is not ready"
description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has been not ready for over 5 minutes."
EOF
配置prometheus-alert
alertmanager 是告警处理模块,但是告警消息的发送方法并不丰富。如果需要将告警接入飞书,钉钉,微信等,还需要有相应的SDK适配。prometheusAlert就是这样的SDK,可以将告警消息发送到各种终端上。 prometheus Alert 是开源的运维告警中心消息转发系统,支持主流的监控系统 prometheus,日志系统 Graylog 和数据可视化系统 Grafana 发出的预警消息。通知渠道支持钉钉、微信、华为云短信、腾讯云短信、腾讯云电话、阿里云短信、阿里云电话等。
- 创建飞书机器人
- 准备配置文件
- 启动 prometheusAlert服务
- 对接告警服务
- 调试告警模板
参数解释: PA_LOGIN_USER=alertuser 登录账号 PA_LOGIN_PASSWORD=123456 登录密码 PA_TITLE=prometheusAlert 系统title PA_OPEN_FEISHU=1 开启飞书支持 PA_OPEN_DINGDING=1 开启钉钉支持 PA_OPEN_WEIXIN=1 开启微信支持
创建PVC
cat << EOF > webhook-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: webkook-db-data
namespace: monitoring
spec:
storageClassName: efs-sc
accessModes:
- ReadWriteMany
resources:
requests:
storage: 10Gi
EOF
创建webhook-deploy
cat << EOF > webhook-deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: webhook-deploy
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
k8s-app: webhook
strategy:
type: Recreate
template:
metadata:
creationTimestamp: null
labels:
k8s-app: webhook
spec:
containers:
- env:
- name: PA_LOGIN_USER
value: alertuser
- name: PA_LOGIN_PASSWORD
value: "123456"
- name: PA_TITLE
value: prometheusAlert
- name: PA_OPEN_FEISHU
value: "1"
- name: PA_OPEN_DINGDING
value: "0"
- name: PA_OPEN_WEIXIN
value: "0"
image: registry.cn-hangzhou.aliyuncs.com/tianxiang_app/prometheus-alert:latest
imagePullPolicy: IfNotPresent
name: webhook
ports:
- containerPort: 8080
protocol: TCP
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /app/db
name: db-data
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
volumes:
- name: db-data
persistentVolumeClaim:
claimName: webkook-db-data
EOF
创建SVC
cat << EOF > webhook-svc.yaml
apiVersion: v1
kind: Service
metadata:
name: webhook-service
namespace: monitoring
spec:
ports:
- port: 8080
protocol: TCP
targetPort: 8080
selector:
k8s-app: webhook
type: ClusterIP
EOF
创建ALB
cat << EOF > webhook-alb.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: webhook-ingress
namespace: monitoring
annotations:
kubernetes.io/ingress.class: alb
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}]'
alb.ingress.kubernetes.io/target-type: ip
alb.ingress.kubernetes.io/subnets: "subnet-09ca72eec6dbf351a,subnet-0558f4b0ec4424206"
spec:
ingressClassName: alb
rules:
- http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: webhook-service
port:
number: 8080
EOF
创建发送报警模板
{{ $alertmanagerURL := "altermanager IP" -}}
{{ $alerts := .alerts -}}
{{ $grafanaURL := "grafannIP" -}}
{{ range $alert := $alerts -}}
{{ $groupKey := printf "%s|%s" $alert.labels.alertname $alert.status -}}
{{ $urimsg := "" -}}
{{ range $key,$value := $alert.labels -}}
{{ $urimsg = print $urimsg $key "%3D%22" $value "%22%2C" -}}
{{ end -}}
{{ if eq $alert.status "resolved" -}}
🟢 Kubernetes 集群恢复通知 🟢
{{ else -}}
🚨 Kubernetes 集群告警通知 🚨
{{ end -}}
---
🔔 **告警名称**: {{ $alert.labels.alertname }}
🚩 **告警级别**: {{ $alert.labels.severity }}
{{ if eq $alert.status "resolved" }}✅ **告警状态**: {{ $alert.status }}{{ else }}🔥 **告警状态**: {{ $alert.status }}{{ end }}
🕒 **开始时间**: {{ GetCSTtime $alert.startsAt }}
{{ if eq $alert.status "resolved" }}#### 🕒 **结束时间**: {{ GetCSTtime $alert.endsAt }}{{ end }}
---
📌 **告警详情**
- **🏷️ 命名空间**: {{ $alert.labels.namespace }}
- **📡 实例名称**: {{ $alert.labels.pod }}
- **🌐 实例地址**: {{ $alert.labels.pod_ip }}
- **🖥️ 实例节点**: {{ $alert.labels.node }}
- **🔄 实例控制器类型**: {{ $alert.labels.owner_kind }}
- **🔧 实例控制器名称**: {{ $alert.labels.owner_name }}
---
📝 **告警描述**
{{ $alert.annotations.message }}{{ $alert.annotations.summary }}{{ $alert.annotations.description }}
---
🚀 **快速操作**
- **[点我屏蔽该告警]({{ $alertmanagerURL }}/#/silences/new?filter=%7B{{ SplitString $urimsg 0 -3 }}%7D)**
- **[点击我查看 Grafana 监控面板]({{ $grafanaURL }})**
---
📊 **建议操作**
1. 检查 Pod 日志,确认是否有异常。
2. 检查节点 {{ $alert.labels.node }} 的资源使用情况,确保没有资源瓶颈。
3. 如果问题持续,考虑重启 Pod 或节点。
---
📅 **告警时间线**
- **首次触发**: {{ GetCSTtime $alert.startsAt }}
{{ if eq $alert.status "resolved" }}
- **结束时间**: {{ GetCSTtime $alert.endsAt }}
{{ end }}
---
📞 **联系支持**
如有疑问,请联系 Kubernetes 运维团队或查看相关文档。
---
{{ if eq $alert.status "resolved" }}
**✅ 告警已恢复,请确认业务正常运行!**
{{ else }}
**🔔 请及时处理,避免影响业务正常运行!**
{{ end }}
---
{{ end -}}
修改helm altermanger配置
# helm values.yaml
config:
global:
resolve_timeout: 5m
inhibit_rules:
- source_matchers:
- 'severity = critical'
target_matchers:
- 'severity =~ warning|info'
equal:
- 'namespace'
- 'alertname'
- source_matchers:
- 'severity = warning'
target_matchers:
- 'severity = info'
equal:
- 'namespace'
- 'alertname'
- source_matchers:
- 'alertname = InfoInhibitor'
target_matchers:
- 'severity = info'
equal:
- 'namespace'
- target_matchers:
- 'alertname = InfoInhibitor'
route:
group_by: ['namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'yunwei'
routes:
- receiver: 'yunwei'
matchers:
- alertname = "Watchdog"
receivers:
- name: 'yunwei'
webhook_configs:
- url: "http://k8s-monitori-webhooki-497c3c4f72-73633833.cn-northwest-1.elb.amazonaws.com.cn/prometheusalert?type=fs&tpl=prometheus-fs&fsurl=https://open.feishu.cn/open-apis/bot/v2/hook/feishu_token"
templates:
- '/etc/alertmanager/config/*.tmpl'
更新helm
helm upgrade prom-stack prometheus-community/kube-prometheus-stack -n monitoring --version 75.10.0 -f values.yaml --reuse-values
创建测试pod
kubectl run test-crash-2 --image=registry.cn-hangzhou.aliyuncs.com/raymond-pro/busybox:latest --command -- /bin/sh -c "sleep 10; exit 1"
告警测试
总结
至此,AWS-EKS 监控体系已打通「指标采集 → 可视化 → 自定义告警 → 平台通知」全流程,运维团队可据此持续优化集群性能与稳定性。

