Top 10 Kubernetes Monitoring & Observability Tools
Effective monitoring and observability are critical for running Kubernetes clusters in production. The cloud-native ecosystem offers a rich set of tools for collecting metrics, visualizing data, and gaining insights into cluster and application performance. Here are the top 10 monitoring and observability tools that every Kubernetes operator should know.
The de facto standard for collecting metrics in Kubernetes clusters.
Prometheus is the cornerstone of Kubernetes monitoring, providing a powerful time-series database and query language for collecting and analyzing metrics. Its pull-based architecture and service discovery make it ideal for dynamic container environments.
Key Features:
- Time-series data collection
- Powerful query language (PromQL)
- Service discovery integration
- Alerting capabilities
- High availability support
Installation:
# Using Helm
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack
# Using kubectl
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/kube-prometheus/main/manifests/setup/
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/kube-prometheus/main/manifests/
Configuration Example:
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
Visualization layer commonly paired with Prometheus and Loki.
Grafana transforms raw metrics data into actionable insights through beautiful dashboards and visualizations. It supports multiple data sources and provides powerful querying capabilities.
Key Features:
- Rich dashboard creation
- Multiple data source support
- Alerting and notifications
- User management and permissions
- Plugin ecosystem
Installation:
# Using Helm
helm repo add grafana https://grafana.github.io/helm-charts
helm install grafana grafana/grafana
# Using kubectl
kubectl apply -f https://raw.githubusercontent.com/grafana/helm-charts/main/charts/grafana/templates/deployment.yaml
Dashboard Configuration:
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboards
data:
kubernetes-cluster.json: |
{
"dashboard": {
"title": "Kubernetes Cluster",
"panels": [
{
"title": "CPU Usage",
"type": "graph",
"targets": [
{
"expr": "sum(rate(container_cpu_usage_seconds_total{container!=\"\"}[5m]))"
}
]
}
]
}
}
A log aggregation system by Grafana Labs, designed to integrate with Prometheus.
Loki provides efficient log aggregation and querying capabilities, designed to work seamlessly with Prometheus and Grafana. It’s optimized for Kubernetes environments and provides cost-effective log storage.
Key Features:
- Efficient log storage
- PromQL-like query language (LogQL)
- Kubernetes-native design
- Cost-effective scaling
- Integration with Grafana
Installation:
# Using Helm
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki
# Using kubectl
kubectl apply -f https://raw.githubusercontent.com/grafana/loki/main/production/helm/loki/templates/
Configuration Example:
apiVersion: v1
kind: ConfigMap
metadata:
name: loki-config
data:
loki.yaml: |
auth_enabled: false
server:
http_listen_port: 3100
ingester:
lifecycler:
address: 127.0.0.1
ring:
kvstore:
store: inmemory
replication_factor: 1
final_sleep: 0s
chunk_idle_period: 5m
chunk_retain_period: 30s
schema_config:
configs:
- from: 2020-05-15
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
Long-term, highly available Prometheus setup for global metrics.
Thanos extends Prometheus with long-term storage capabilities and global querying across multiple clusters. It’s essential for organizations running multiple Kubernetes clusters or requiring long-term metric retention.
Key Features:
- Long-term metric storage
- Global querying
- High availability
- Multi-cluster support
- Cost-effective storage
Installation:
# Using Helm
helm repo add thanos https://charts.bitnami.com/bitnami
helm install thanos thanos/thanos
# Using kubectl
kubectl apply -f https://raw.githubusercontent.com/thanos-io/thanos/main/examples/k8s/
Configuration Example:
apiVersion: v1
kind: ConfigMap
metadata:
name: thanos-query-config
data:
thanos-query.yaml: |
type: s3
config:
bucket: "thanos-metrics"
endpoint: "s3.amazonaws.com"
access_key: "your-access-key"
secret_key: "your-secret-key"
Fast, scalable alternative to Prometheus with long-term storage.
VictoriaMetrics provides a high-performance, cost-effective alternative to Prometheus with built-in long-term storage and enhanced query performance.
Key Features:
- High-performance storage
- Built-in long-term retention
- Prometheus compatibility
- Cost-effective scaling
- Enhanced query performance
Installation:
# Using Helm
helm repo add vm https://victoriametrics.github.io/helm-charts/
helm install victoria-metrics vm/victoria-metrics-single
# Using kubectl
kubectl apply -f https://raw.githubusercontent.com/VictoriaMetrics/helm-charts/main/charts/victoria-metrics-single/templates/
Configuration Example:
apiVersion: v1
kind: ConfigMap
metadata:
name: victoria-metrics-config
data:
victoria-metrics.yaml: |
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
Foundation for tracing and metrics instrumentation.
OpenTelemetry Collector provides a unified approach to collecting traces, metrics, and logs. It’s becoming the standard for observability data collection in cloud-native environments.
Key Features:
- Unified data collection
- Multiple format support
- Flexible processing
- Vendor-agnostic
- High performance
Installation:
# Using Helm
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm install otel-collector open-telemetry/opentelemetry-collector
# Using kubectl
kubectl apply -f https://raw.githubusercontent.com/open-telemetry/opentelemetry-helm-charts/main/charts/opentelemetry-collector/templates/
Configuration Example:
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
data:
config.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
exporters:
prometheus:
endpoint: "0.0.0.0:9464"
otlp:
endpoint: "tempo:4317"
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
Exposes cluster state as metrics for Prometheus.
kube-state-metrics translates Kubernetes objects into Prometheus metrics, providing insights into cluster state, resource usage, and object lifecycle.
Key Features:
- Kubernetes object metrics
- Resource state tracking
- Custom resource support
- Prometheus integration
- Real-time updates
Installation:
# Using Helm
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install kube-state-metrics prometheus-community/kube-state-metrics
# Using kubectl
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kube-state-metrics/main/examples/standard/
Configuration Example:
apiVersion: apps/v1
kind: Deployment
metadata:
name: kube-state-metrics
spec:
replicas: 1
selector:
matchLabels:
app: kube-state-metrics
template:
metadata:
labels:
app: kube-state-metrics
spec:
containers:
- name: kube-state-metrics
image: registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.8.0
ports:
- name: metrics
containerPort: 8080
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
timeoutSeconds: 5
readinessProbe:
httpGet:
path: /
port: 8080
initialDelaySeconds: 5
timeoutSeconds: 5
Get kube-state-metrics on GitHub
Lightweight aggregator of resource usage for HPA and dashboarding.
Metrics Server provides core resource usage metrics (CPU and memory) for Kubernetes objects, enabling Horizontal Pod Autoscaler (HPA) and resource monitoring dashboards.
Key Features:
- Resource usage metrics
- HPA support
- Lightweight design
- Real-time data
- Kubernetes-native
Installation:
# Using kubectl
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
# Using Helm
helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server/
helm install metrics-server metrics-server/metrics-server
Configuration Example:
apiVersion: v1
kind: ServiceAccount
metadata:
name: metrics-server
namespace: kube-system
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: metrics-server
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
app: metrics-server
template:
metadata:
labels:
app: metrics-server
spec:
serviceAccountName: metrics-server
containers:
- name: metrics-server
image: registry.k8s.io/metrics-server/metrics-server:v0.6.4
args:
- --kubelet-insecure-tls
- --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
ports:
- name: main-port
containerPort: 4443
Real-time resource change notifications via Slack or webhooks.
Kubewatch monitors Kubernetes events and sends real-time notifications to various channels, helping teams stay informed about cluster changes and potential issues.
Key Features:
- Real-time event monitoring
- Multiple notification channels
- Customizable filters
- Webhook support
- Slack integration
Installation:
# Using Helm
helm repo add kubewatch https://charts.bitnami.com/bitnami
helm install kubewatch kubewatch/kubewatch
# Using kubectl
kubectl apply -f https://raw.githubusercontent.com/bitnami-labs/kubewatch/main/deploy/
Configuration Example:
apiVersion: v1
kind: ConfigMap
metadata:
name: kubewatch-config
data:
.kubewatch.yaml: |
handler:
slack:
token: "your-slack-token"
channel: "#kubernetes"
title: "Kubernetes Event"
resources:
- deployment
- pod
- service
events:
- create
- update
- delete
Visualizes processes, containers, and services in real-time.
Weave Scope provides a visual interface for exploring and monitoring Kubernetes clusters, making it easier to understand application topology and troubleshoot issues.
Key Features:
- Visual cluster exploration
- Real-time topology mapping
- Container and process monitoring
- Interactive debugging
- Performance insights
Installation:
# Using kubectl
kubectl apply -f "https://cloud.weave.works/k8s/scope.yaml?k8s-version=$(kubectl version | base64 | tr -d '\n')"
# Using Helm
helm repo add weaveworks https://weaveworks.github.io/helm-charts
helm install weave-scope weaveworks/weave-scope
Configuration Example:
apiVersion: apps/v1
kind: Deployment
metadata:
name: weave-scope-app
namespace: weave
spec:
replicas: 1
selector:
matchLabels:
app: weave-scope-app
template:
metadata:
labels:
app: weave-scope-app
spec:
containers:
- name: scope
image: weaveworks/scope:latest
ports:
- containerPort: 4040
env:
- name: WEAVE_SCOPE_DISCOVERY_URL
value: "weave-scope-app.weave.svc.cluster.local:4040"
- Metrics Collection: Prometheus + kube-state-metrics + Metrics Server
- Logging: Loki + Promtail
- Tracing: Jaeger or Zipkin with OpenTelemetry
- Visualization: Grafana
- Alerting: Prometheus Alertmanager + Grafana Alerts
- Long-term Storage: Thanos or VictoriaMetrics
- Start with Core Metrics: Deploy Prometheus and kube-state-metrics
- Add Visualization: Install Grafana and create basic dashboards
- Implement Logging: Deploy Loki and configure log collection
- Set Up Alerting: Configure Alertmanager with meaningful alerts
- Add Tracing: Implement OpenTelemetry for distributed tracing
- Scale and Optimize: Add long-term storage and optimize performance
- Resource Planning: Allocate sufficient resources for monitoring components
- Security: Implement proper RBAC and network policies
- Retention Policies: Configure appropriate data retention periods
- Alert Fatigue: Design meaningful alerts to avoid notification overload
- Documentation: Maintain clear documentation for dashboards and alerts
- Testing: Regularly test monitoring and alerting systems
A comprehensive monitoring and observability strategy is essential for running Kubernetes clusters in production. The tools outlined above provide the foundation for understanding cluster health, application performance, and user experience.
Start with the core tools (Prometheus, Grafana, kube-state-metrics) and gradually expand your observability stack based on your specific needs. Remember that effective monitoring is not just about collecting data—it’s about providing actionable insights that help you maintain reliable, performant applications.
For organizations running multiple clusters or requiring enterprise-grade features, consider managed solutions that build upon these open-source tools while providing additional features like global aggregation, advanced analytics, and professional support.
For more information about Kubernetes monitoring and observability, visit the official Kubernetes documentation and the CNCF observability landscape.