场景描述:

在 AWS 云环境已部署 EKS + Prometheus + Grafana + Alertmanager 的全栈监控体系且数据持久化的前提下,
使用 Prometheus-role 按需定义告警规则,再通过 Prometheus-alert 将告警实时推送至业务平台,实现集群性能与资源异常的闭环监控。

方案概述:

为满足用户需求,我们将执行以下步骤:

  1. 自定义告警规则
  2. 使用Prometheus-alter进行服务告警
  3. 使用EFS对alter进行持久化并通过alb进行服务转发
  4. 修改helm alter配置信息
  5. 配置飞书模板并进行告警测试

详细步骤:

配置Prometheus告警规则

修改完成后,Prometheus 会自动重载配置(如果不行手动重启一下),不需要重启 Pod,进入 Prometheus rules 界面即可看到新的规则

cat << EOF > prometheus_rule.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: pod-alerts
  namespace: monitoring
  labels:
    app: kube-prometheus-stack
    release: prom-stack
spec:
  groups:
  - name: pod-monitoring
    rules:
    - alert: PodNotRunning
      expr: |
        sum by (namespace, pod) (kube_pod_status_phase{phase!="Running"}) > 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Pod {{ $labels.pod }} is not running"
        description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has been in non-Running state for over 5 minutes."
        
    - alert: PodRestartingFrequently
      expr: |
        sum by (namespace, pod) (increase(kube_pod_container_status_restarts_total[5m])) > 3
      for: 1m
      labels:
        severity: warning
      annotations:
        summary: "Pod {{ $labels.pod }} is restarting frequently"
        description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is restarting too frequently (more than 3 times in 5 minutes)."
        
    - alert: PodCrashLoopBackOff
      expr: |
        sum by (namespace, pod, container) (kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}) > 0
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "Pod {{ $labels.pod }} in CrashLoopBackOff state"
        description: "Container {{ $labels.container }} in pod {{ $labels.pod }} is crashing repeatedly."
        
    - alert: PodNotReady
      expr: |
        sum by (namespace, pod) (kube_pod_status_ready{condition="false"}) > 0
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Pod {{ $labels.pod }} is not ready"
        description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has been not ready for over 5 minutes."
 EOF

配置prometheus-alert

alertmanager 是告警处理模块,但是告警消息的发送方法并不丰富。如果需要将告警接入飞书,钉钉,微信等,还需要有相应的SDK适配。prometheusAlert就是这样的SDK,可以将告警消息发送到各种终端上。 prometheus Alert 是开源的运维告警中心消息转发系统,支持主流的监控系统 prometheus,日志系统 Graylog 和数据可视化系统 Grafana 发出的预警消息。通知渠道支持钉钉、微信、华为云短信、腾讯云短信、腾讯云电话、阿里云短信、阿里云电话等。

  1. 创建飞书机器人
  2. 准备配置文件
  3. 启动 prometheusAlert服务
  4. 对接告警服务
  5. 调试告警模板
参数解释:

PA_LOGIN_USER=alertuser 登录账号
PA_LOGIN_PASSWORD=123456 登录密码
PA_TITLE=prometheusAlert 系统title
PA_OPEN_FEISHU=1 开启飞书支持
PA_OPEN_DINGDING=1 开启钉钉支持
PA_OPEN_WEIXIN=1 开启微信支持

创建PVC

cat << EOF > webhook-pvc.yaml 
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: webkook-db-data
  namespace: monitoring
spec:
  storageClassName: efs-sc      
  accessModes:
    - ReadWriteMany             
  resources:
    requests:
      storage: 10Gi             
EOF

创建webhook-deploy

cat << EOF > webhook-deploy.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  name: webhook-deploy
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      k8s-app: webhook
  strategy:
    type: Recreate
  template:
    metadata:
      creationTimestamp: null
      labels:
        k8s-app: webhook
    spec:
      containers:
      - env:
        - name: PA_LOGIN_USER
          value: alertuser
        - name: PA_LOGIN_PASSWORD
          value: "123456"
        - name: PA_TITLE
          value: prometheusAlert
        - name: PA_OPEN_FEISHU
          value: "1"
        - name: PA_OPEN_DINGDING
          value: "0"
        - name: PA_OPEN_WEIXIN
          value: "0"
        image: registry.cn-hangzhou.aliyuncs.com/tianxiang_app/prometheus-alert:latest
        imagePullPolicy: IfNotPresent
        name: webhook
        ports:
        - containerPort: 8080
          protocol: TCP
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /app/db
          name: db-data
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - name: db-data
        persistentVolumeClaim:
          claimName: webkook-db-data
EOF

创建SVC

cat << EOF > webhook-svc.yaml
apiVersion: v1
kind: Service
metadata:
  name: webhook-service
  namespace: monitoring
spec:
  ports:
  - port: 8080          
    protocol: TCP
    targetPort: 8080    
  selector:
    k8s-app: webhook    
  type: ClusterIP   
EOF

创建ALB

cat << EOF > webhook-alb.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: webhook-ingress
  namespace: monitoring
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}]'
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/subnets: "subnet-09ca72eec6dbf351a,subnet-0558f4b0ec4424206"

spec:
  ingressClassName: alb
  rules:
  - http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: webhook-service
            port:
              number: 8080

EOF

创建发送报警模板

{{ $alertmanagerURL := "altermanager IP" -}}
{{ $alerts := .alerts -}}
{{ $grafanaURL := "grafannIP" -}}
{{ range $alert := $alerts -}}
  {{ $groupKey := printf "%s|%s" $alert.labels.alertname $alert.status -}}
  {{ $urimsg := "" -}}
  {{ range $key,$value := $alert.labels -}}
    {{ $urimsg = print $urimsg $key "%3D%22" $value "%22%2C" -}}
  {{ end -}}

  {{ if eq $alert.status "resolved" -}}
🟢 Kubernetes 集群恢复通知 🟢
  {{ else -}}
🚨 Kubernetes 集群告警通知 🚨
  {{ end -}}

---

🔔 **告警名称**: {{ $alert.labels.alertname }}

🚩 **告警级别**: {{ $alert.labels.severity }}

{{ if eq $alert.status "resolved" }}✅ **告警状态**: {{ $alert.status }}{{ else }}🔥 **告警状态**: {{ $alert.status }}{{ end }}

🕒 **开始时间**: {{ GetCSTtime $alert.startsAt }}
{{ if eq $alert.status "resolved" }}#### 🕒 **结束时间**: {{ GetCSTtime $alert.endsAt }}{{ end }}

---

📌 **告警详情**

- **🏷️ 命名空间**: {{ $alert.labels.namespace }}
- **📡 实例名称**: {{ $alert.labels.pod }}
- **🌐 实例地址**: {{ $alert.labels.pod_ip }}
- **🖥️ 实例节点**: {{ $alert.labels.node }}
- **🔄 实例控制器类型**: {{ $alert.labels.owner_kind }}
- **🔧 实例控制器名称**: {{ $alert.labels.owner_name }}

---

📝 **告警描述**

{{ $alert.annotations.message }}{{ $alert.annotations.summary }}{{ $alert.annotations.description }}

---

🚀 **快速操作**

- **[点我屏蔽该告警]({{ $alertmanagerURL }}/#/silences/new?filter=%7B{{ SplitString $urimsg 0 -3 }}%7D)**
- **[点击我查看 Grafana 监控面板]({{ $grafanaURL }})**

---

📊 **建议操作**

1. 检查 Pod 日志,确认是否有异常。
2. 检查节点 {{ $alert.labels.node }} 的资源使用情况,确保没有资源瓶颈。
3. 如果问题持续,考虑重启 Pod 或节点。

---

📅 **告警时间线**

- **首次触发**: {{ GetCSTtime $alert.startsAt }}
{{ if eq $alert.status "resolved" }}
- **结束时间**: {{ GetCSTtime $alert.endsAt }}
{{ end }}

---

📞 **联系支持**

如有疑问,请联系 Kubernetes 运维团队或查看相关文档。

---

{{ if eq $alert.status "resolved" }}
**✅ 告警已恢复,请确认业务正常运行!**
{{ else }}
**🔔 请及时处理,避免影响业务正常运行!**
{{ end }}

---

{{ end -}}

修改helm altermanger配置

# helm values.yaml
  config:
    global:
      resolve_timeout: 5m
    inhibit_rules:
      - source_matchers:
          - 'severity = critical'
        target_matchers:
          - 'severity =~ warning|info'
        equal:
          - 'namespace'
          - 'alertname'
      - source_matchers:
          - 'severity = warning'
        target_matchers:
          - 'severity = info'
        equal:
          - 'namespace'
          - 'alertname'
      - source_matchers:
          - 'alertname = InfoInhibitor'
        target_matchers:
          - 'severity = info'
        equal:
          - 'namespace'
      - target_matchers:
          - 'alertname = InfoInhibitor'
    route:
      group_by: ['namespace']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h
      receiver: 'yunwei'
      routes:
      - receiver: 'yunwei'
        matchers:
          - alertname = "Watchdog"
    receivers:
    - name: 'yunwei'
      webhook_configs:
       - url: "http://k8s-monitori-webhooki-497c3c4f72-73633833.cn-northwest-1.elb.amazonaws.com.cn/prometheusalert?type=fs&tpl=prometheus-fs&fsurl=https://open.feishu.cn/open-apis/bot/v2/hook/feishu_token"
    templates:
    - '/etc/alertmanager/config/*.tmpl'

更新helm

helm upgrade prom-stack prometheus-community/kube-prometheus-stack   -n monitoring   --version 75.10.0   -f values.yaml   --reuse-values

创建测试pod

kubectl run test-crash-2   --image=registry.cn-hangzhou.aliyuncs.com/raymond-pro/busybox:latest   --command -- /bin/sh -c "sleep 10; exit 1"

告警测试

总结

至此,AWS-EKS 监控体系已打通「指标采集 → 可视化 → 自定义告警 → 平台通知」全流程,运维团队可据此持续优化集群性能与稳定性。





  • No labels