Kubernetes is the leading orchestration platform for containerized applications, offering robust features for managing large-scale deployments. Monitoring Kubernetes clusters is critical to ensure application performance, reliability, and resource optimization. Telegraf, an open-source server agent for collecting metrics, can be an invaluable tool for this purpose. In this blog, we’ll explore how to use Telegraf to monitor Kubernetes clusters, detailing the setup, configuration, and best practices.
Introduction to Telegraf and Kubernetes
What is Telegraf?
Telegraf is part of the TICK stack (Telegraf, InfluxDB, Chronograf, Kapacitor) and is designed to collect, process, and send metrics and events from various sources. It supports numerous input and output plugins, making it highly versatile for different monitoring needs.
What is Kubernetes?

Kubernetes is an open-source platform for automating the deployment, scaling, and management of containerized applications. It provides a resilient infrastructure to run applications consistently and efficiently across different environments.
Why Monitor Kubernetes Clusters?
Monitoring Kubernetes clusters is essential for several reasons:
- Resource Utilization: Track CPU, memory, and storage usage to optimize resource allocation.
- Performance Tracking: Ensure applications are performing as expected.
- Health and Stability: Detect and troubleshoot issues early to maintain application uptime.
- Scalability: Make informed decisions on scaling applications based on workload demands.
Setting Up Telegraf to Monitor Kubernetes
Step 1: Install Telegraf
Ensure that Telegraf is installed on your system. You can download and install Telegraf from Telegraf’s official website.
Step 2: Configure Telegraf for Kubernetes Monitoring
Telegraf can collect metrics from Kubernetes using various plugins. The most commonly used plugins for Kubernetes monitoring are kubernetes, prometheus, and docker.
Example Configuration for Kubernetes Input Plugin
Create a Telegraf configuration file tailored for Kubernetes monitoring:
[agent]
interval = "10s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = "0s"
flush_interval = "10s"
flush_jitter = "0s"
precision = ""
debug = false
quiet = false
logfile = "/var/log/telegraf/telegraf.log"
hostname = ""
omit_hostname = false
[[outputs.influxdb]]
urls = ["http://influxdb:8086"]
database = "telegraf"
retention_policy = ""
write_consistency = "any"
timeout = "10s"
[[inputs.kubernetes]]
url = "https://kubernetes.default.svc"
bearer_token = "/var/run/secrets/kubernetes.io/serviceaccount/token"
tls_ca = "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
insecure_skip_verify = true
response_timeout = "5s"
This configuration collects metrics from the Kubernetes API server using the kubernetes input plugin.
Step 3: Deploy Telegraf as a DaemonSet
Deploying Telegraf as a DaemonSet in Kubernetes ensures that an instance of Telegraf runs on each node in the cluster, collecting node-specific and pod-specific metrics.
Create a Kubernetes manifest file for the Telegraf DaemonSet:
apiVersion: v1
kind: ConfigMap
metadata:
name: telegraf-config
namespace: monitoring
data:
telegraf.conf: |
[agent]
interval = "10s"
round_interval = true
[[outputs.influxdb]]
urls = ["http://influxdb:8086"]
database = "telegraf"
retention_policy = ""
[[inputs.kubernetes]]
url = "https://kubernetes.default.svc"
bearer_token = "/var/run/secrets/kubernetes.io/serviceaccount/token"
tls_ca = "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
insecure_skip_verify = true
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: telegraf
namespace: monitoring
spec:
selector:
matchLabels:
name: telegraf
template:
metadata:
labels:
name: telegraf
spec:
containers:
- name: telegraf
image: telegraf:latest
resources:
limits:
memory: 200Mi
cpu: 200m
requests:
memory: 100Mi
cpu: 100m
volumeMounts:
- name: config
mountPath: /etc/telegraf/telegraf.conf
subPath: telegraf.conf
- name: dockersock
mountPath: /var/run/docker.sock
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
- name: kube-api-access
mountPath: /var/run/secrets/kubernetes.io/serviceaccount
readOnly: true
volumes:
- name: config
configMap:
name: telegraf-config
- name: dockersock
hostPath:
path: /var/run/docker.sock
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
- name: kube-api-access
projected:
sources:
- serviceAccountToken:
path: token
- configMap:
name: telegraf-config
items:
- key: ca.crt
path: ca.crt
- downwardAPI:
items:
- path: namespace
fieldRef:
fieldPath: metadata.namespace
Apply the manifest to your Kubernetes cluster:
kubectl apply -f telegraf-daemonset.yaml
Key Metrics to Monitor
Node Metrics
Monitoring node-level metrics helps ensure the underlying infrastructure is healthy and capable of running applications efficiently.
- CPU Usage: Percentage of CPU utilized by each node.
- Memory Usage: Amount of memory used by each node.
- Disk I/O: Read and write operations on the node’s disks.
- Network Traffic: Amount of data sent and received by each node.
Pod Metrics
Pod-level metrics provide insights into the performance and health of individual containers.
- CPU and Memory Usage: Resource consumption of each pod.
- Restarts: Number of times a pod has restarted, indicating potential issues.
- Status: Current status of pods (Running, Pending, Failed).
Cluster Metrics
Cluster-wide metrics give a holistic view of the overall health and performance of the Kubernetes cluster.
- Resource Allocation: Total CPU and memory requested vs. available.
- Node Health: Status of nodes (Ready, NotReady).
- Pod Health: Status of pods across namespaces.
Visualizing Metrics
Using Grafana
Grafana is a popular open-source platform for monitoring and observability, which can be used to visualize metrics collected by Telegraf.
1. Install Grafana: Deploy Grafana in your Kubernetes cluster.
kubectl apply -f https://raw.githubusercontent.com/grafana/grafana/master/deploy/kubernetes/grafana.yaml
2. Configure Data Source: Add InfluxDB as a data source in Grafana.
3. Create Dashboards: Create custom dashboards in Grafana to visualize node, pod, and cluster metrics.
Sample Grafana Dashboard
A sample Grafana dashboard for Kubernetes monitoring might include:
- CPU and Memory Usage: Visualize node and pod resource usage over time.
- Pod Restarts: Track the number of restarts per pod.
- Node and Pod Status: Display the status of nodes and pods.
- Network Traffic: Monitor network traffic across nodes and pods.
Best Practices for Monitoring Kubernetes with Telegraf
Optimize Configuration
- Polling Intervals: Set appropriate polling intervals to balance data granularity and performance.
- Minimal Metrics: Collect only the necessary metrics to reduce storage and processing overhead.
Secure Communication
- TLS/SSL: Use encrypted communication between Telegraf and the metric storage backend (e.g., InfluxDB).
- Authentication: Use authentication mechanisms to secure access to the Kubernetes API server.
Resource Management
- Resource Limits: Set resource limits for Telegraf containers to prevent them from consuming excessive resources.
- Horizontal Scaling: Scale Telegraf instances horizontally to handle increased load.
Regular Monitoring and Alerts
- Dashboards: Regularly monitor dashboards to track the health and performance of the cluster.
- Alerts: Set up alerts for critical metrics to detect and respond to issues promptly.
Conclusion
Using Telegraf to monitor Kubernetes clusters provides a comprehensive and flexible approach to collecting and visualizing metrics. By deploying Telegraf as a DaemonSet and leveraging its powerful plugins, you can gain deep insights into the performance and health of your Kubernetes environment. Following best practices such as optimizing configurations, securing communications, and setting up robust monitoring and alerting systems will help ensure your Kubernetes clusters run smoothly and efficiently.
I hope this gave you some useful insights. Please feel free to drop any comments, questions or suggestions. Thank You !!!