kubernetes

Spotting Silent Pod Failures in Kubernetes with Grafana

This article discusses issues with Kubernetes clusters, such as pod and node failures, constraints, and how to set up an alert system using Grafana to detect these problems.

Lince Mathew

Nov 5, 2023 — 7 min read

Unnoticed Pod Failures in Kubernetes

One of the critical issues in Kubernetes operations is the pod's deployment failures. Kubernetes pods can fail due to various reasons such as CPU constraints, memory constraints, Image pull errors, node failures etc.

pod failure

The main problem is that these problems will have a negative impact on the applications in production, ultimately leading to a bad impression.

How to Spot Failures?

Discord is one of the primary communication channels for many teams. If Kubernetes cluster failures are reported on Discord, it will attract the attention of developers, who can then fix them immediately. Creating a pathway from Kubernetes clusters to Discord servers would allow addressing unnoticed failures.

Finding the Pathway

We explored various options for establishing a notification pathway from the Kubernetes cluster to the communication medium. There are multiple tools and products available for this, such as Botkube, Grafana and InfluxDB.

We chose Grafana over other options because it is an open-source analytics and monitoring platform. Grafana has an alert feature, a detailed dashboard for visualizing Kubernetes clusters, and the ability to customize alerts and set up thresholds. All of these features are available in the free version of Grafana.

Setting up Kubernetes Monitoring in Grafana

Setup Grafana Cloud

Grafana Cloud is better than OSS because the cloud provides various benefits such as low maintenance, free storage space, additional layers of security, performance testing, IRM, load testing, Kubernetes monitoring, continuous profiling, frontend observability etc. If you already have kubectl or helm in the Kubernetes cluster this is the better option.

Grafana Cloud offers a number of advantages, but the free version has some limitations. For example, you can only store and process 10,000 metrics, there are storage limitations for logs, and only three users can use it at a time. Additionally, Grafana only retains history for 14 days in the free version. If Grafana cloud is not a possible way, you can install Grafana in Linux, Mac or Windows even in Docker.

Find further information about installation and configuration.

Configure Data Source

To share data between the Kubernetes cluster and Grafana, first configure the data source. Data sources can be configured from the Kubernetes monitoring tab in Grafana Cloud.

There are various ways to connect, but we choose to use the Helm chart because it is recommended by Grafana and we already have the necessary requirements, such as Helm and kubectl. After the cluster configuration, Grafana will generate a helm command for installation.

helm repo add grafana https://grafana.github.io/helm-charts &&
  helm repo update &&
  helm upgrade --install grafana-k8s-monitoring grafana/k8s-monitoring \
    --namespace "default" --create-namespace --values - <<EOF
cluster:
  name: my-cluster
externalServices:
  prometheus:
    host: https://prometheus-prod-43-prod-ap-south-1.grafana.net
    basicAuth:
      username: "1352589"
      password: REPLACE_WITH_ACCESS_POLICY_TOKEN
  loki:
    host: https://logs-prod-028.grafana.net
    basicAuth:
      username: "7922240"
      password: REPLACE_WITH_ACCESS_POLICY_TOKEN
opencost:
  opencost:
    exporter:
      defaultClusterId: my-cluster
    prometheus:
      external:
        url: https://prometheus-prod-43-prod-ap-south-1.grafana.net/api/prom
EOF

The command adds the Grafana Helm repository and proceeds to install grafana-k8s-monitoring, an instance of a Helm chart sourced from the Grafana repository. Grafana relies on external services like Prometheus for monitoring the Kubernetes cluster. Grafana uses Prometheus as the data source for the monitoring. Prometheus collects the data from the Kubernetes cluster and Grafana Cloud will provide an interface for it.

Find further information about configuring using the helm chart from here.

Visualize Kubernetes Cluster

After the setup, Grafana will establish a connection with the Kubernetes cluster and we can visualize the cluster live in the Grafana dashboard.

Grafana visualizes various metrics, such as cluster status, pod or node status, and resource constraints. We have the freedom to modify the visualization, set up thresholds, and more in this dashboard.

Grafana Alert Manager

The Grafana alert management system continuously monitors the Kubernetes cluster and reports the status on the dashboard or via direct notification. Let's check the actions behind the process. After running the helm command it will create multiple services in the kube cluster such as grafana-agent, grafana-agent-log, kube-state-metric, opencost and prometheus. grafana-agent, grafana-agent-log, and kube-state-metric are used for metric collection from the kube cluster and the opencost is used for kube expense calculation. To monitor each node in the cluster, Grafana deploys a pod on each node. Prometheus and agents will then look for any failures with the node or the pods running on it.

Grafana will take approximately 150MB of memory and >1 CPU for utilization.

Alert Rule for Pod Failures

Grafana by default will create some alert rules for alerting various issues in the Kubernetes cluster. All the alert rules will be displayed in the alert rule panel.

This alert rule is mainly concerned with the Kubernetes-apps category rules that are responsible for warning about pod, node failures, and resource constraints. The following is the rule for alerting about pod failures:

We need to send a notification via Discord when this rule is violated in the Kubernetes cluster.

Integrate Discord through Contact Points

Contact points are the channels through which Grafana Alert sends notifications. We can use contact points to integrate Grafana Alerting with email, Slack, Discord, Telegram, and other notification services.

To integrate Discord, first create a webhook from the Discord server. We can create a webhook from the Integrations options in the server's settings.

Then create a contact point in the Grafana contact point option.

After providing a name, adding the webhook URL, and saving it, we can test the notification in contact points by clicking the test button. It will send a test notification to the discord server.

Creating a contact point doesn't complete the alert system implementation we need to assign some notification policies to the contact points next. Notification policies determine how alerts are routed to contact points. We can assign notification policies from the policy section.

Create a Custom Notification Template

The default Grafana notification template for Discord is a bit long. It has a lot of information and it is difficult to grasp the actual issue.

Go template Language

A custom template can be created using the template option in Grafana. Templates are created using the Go template language, which allows for different syntaxes for creating notification templates. The following is the Go template for the above notification.

Template for notification title:

{{ define "discord.default.title" }}{{ template "__subject" . }}{{ end }}
{{ define "__subject" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " " }} {{ if gt (len .CommonLabels) (len .GroupLabels) }}({{ with .CommonLabels.Remove .GroupLabels.Names }}{{ .Values | join " " }}{{ end }}){{ end }}{{ end }}

Body template:

{{ define "discord.default.message" }}
{{ if gt (len .Alerts.Firing) 0 }}
Alerts Firing:
{{ template "__text_alert_list" .Alerts.Firing }}
{{ end }}
{{ if gt (len .Alerts.Resolved) 0 }}
Alerts Resolved:
{{ template "__text_alert_list" .Alerts.Resolved }}
{{ end }}
{{ end }}

{{ define "__text_alert_list" }}{{ range . }}Labels:
{{ range .Labels.SortedPairs }} - {{ .Name }} = {{ .Value }}
{{ end }}Annotations:
{{ range .Annotations.SortedPairs }} - {{ .Name }} = {{ .Value }}
{{ end }}Source: {{ .GeneratorURL }}
{{ end }}{{ end }}

We can define as many mini-templates using the keyword define and can access other templates inside a template using the keyword {{ template "__text_alert_list" }}

We modified the template to only include the following essential information:

Alert name
Alert severity
Name of the pod or node that is failing
Description of the issue
Summary

The go template now looks like this:

For the title:

{{ define "__subject" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " " }}{{ end }}

For the body:

{{ define "__text_alert_list" }}
{{ range . }}
{{ range .Labels.SortedPairs }}{{ if or (eq .Name "horizontalpodautoscaler") (eq .Name "instance") (eq .Name "severity") }} - {{ .Name }} = {{ .Value }}
{{ end }}{{ end }}
{{ range .Annotations.SortedPairs }}{{ if or (eq .Name "summary") (eq .Name "description") }} - {{ .Name }} = {{ .Value }}{{ end }}
{{ end }}
{{ end }}{{ end }}

The indentation and spacing are important because they will affect the spacing and formatting of the final notification.

Testing And Customizing Alerts

Grafana provides an option to test templates. We can test notification templates by providing payloads and static data from the contact points.

Add the go template in the content section and a static payload for the template in the Payload data section and a preview will be generated for the notification template. Also for testing how the template looks in Discord, we can add custom annotations and labels in the custom testing option in contact points.

This will trigger a notification with the custom template in Discord.

Conclusion

Grafana Cloud's alert system is one of its most valuable features. There are also various other options available for testing and optimizing the performance of the kube cluster in Grafana. We can establish thresholds and conditions for when to trigger alerts or notifications and on which platform to trigger them. Grafana Cloud assists us in identifying and addressing major problems and failures in the Kube cluster.