Monitoring Istio on AKS with Prometheus and Grafana

How to monitor Istio deployed on AKS with Managed Prometheus and Grafana

Monitoring Istio on AKS with Prometheus and Grafana
Photo by Luke Chesser / Unsplash

As more teams deploy Istio on AKS, I want to demonstrate how to leverage the Managed Prometheus and Grafana services in Azure to monitor the service mesh and associated services sitting behind it.

One of the benefits of leveraging Istio in your stack on AKS is that you can get metrics about calls being made to your backend service without needing to manually instrument your application. This comes from the injected envoy sidecars and the ingress gateway Istio deploys. With the Managed Prometheus and Grafana services in Azure, we have an easy path towards a fully managed solution to store and monitor those exposed metrics.

A few pre-requisites that I'm expecting:

  1. AKS
  2. Istio on AKS
  3. Azure Monitor Workspace linked to Grafana - if you need guidance on this one, review my post on integrating Prometheus with AKS

Step 1 - Setup Scraping for Managed Prometheus

You may have already completed this as part of the pre-requisites, but to ensure it's done properly I'll review it here.

When you setup your AKS cluster with Prometheus in Azure, that only captures the following AKS default metrics. It does not however scrape your custom pods. Therefore, we need to apply a custom config map to the cluster that tells the metrics agent to scrape your pods (there are a few options on how to do this noted here in the docs - we are taking the noted recommended approach here but review the different options if your scenario differs). We can do this by applying the following config map that is configured to scrape all pods in the cluster that have the common Prometheus pod annotations that tell the agent to scrape for metrics - there are configured in the proxies that Istio injects as well as the ingress gateway:

Create a file named prometheus-config and copy the following contents:

scrape_configs:
  - job_name: 'kubernetespods-scrape'

    kubernetes_sd_configs:
    - role: pod

    relabel_configs:
    # Scrape only pods with the annotation: prometheus.io/scrape = true
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
      action: keep
      regex: true

    # If prometheus.io/path is specified, scrape this path instead of /metrics
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
      action: replace
      target_label: __metrics_path__
      regex: (.+)

    # If prometheus.io/port is specified, scrape this port instead of the default
    - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
      action: replace
      regex: ([^:]+)(?::\d+)?;(\d+)
      replacement: $1:$2
      target_label: __address__

    # If prometheus.io/scheme is specified, scrape with this scheme instead of http
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
      action: replace
      regex: (http|https)
      target_label: __scheme__

    # Include the pod namespace as a label for each metric
    - source_labels: [__meta_kubernetes_namespace]
      action: replace
      target_label: kubernetes_namespace

    # Include the pod name as a label for each metric
    - source_labels: [__meta_kubernetes_pod_name]
      action: replace
      target_label: kubernetes_pod_name

    # [Optional] Include all pod labels as labels for each metric
    - action: labelmap
      regex: __meta_kubernetes_pod_label_(.+)

prometheus-config file that we apply to the kube-system namespace as a config map

Apply the config map to the kube-system namespace with the following command:

kubectl create configmap ama-metrics-prometheus-config --from-file=prometheus-config -n kube-system

As a confirmation that all is well after you apply the config map, you should find in about 2-3 minutes that the AMA agents in the kube-system namespace have restarted:

AMA Metric Pod Restarts to Pick Up Custom Config

If you run a kubectl logs command on that pod, you will find the following log message showing the scrape job was configured:

Custom Scrape Job is Configured

Step 2 - Deploy Istio and Associated Service

You may not need to follow this step if you already have Istio deployed. I'll take a path of deploying the default profile to get an ingress gateway and the core components deployed:

Install Istio Default Profile

From there, I'll deploy the sample httpbin service so there is something to interact with. I'll first label the namespace for injection and then run the deployment:

#CREATE HTTPBIN NAMESPACE
kubectl create ns httpbin

#LABEL NAMESPACE FOR INJECTION
kubectl label namespace httpbin istio-injection=enabled

#DEPLOY HTTPBIN SERVICE IN HTTPBIN NAMESPACE
kubectl apply -f https://raw.githubusercontent.com/istio/istio/master/samples/httpbin/httpbin.yaml -n httpbin

#DEPLPOY GATEWAY AND VIRTUALSERVICE IN HTTPBIN NAMESPACE
kubectl apply -f https://raw.githubusercontent.com/istio/istio/master/samples/httpbin/httpbin-gateway.yaml -n httpbin

As a check, you can get your ingress gateway IP and run a curl. The httpbin pod has a /uuid path that will return a randomly generated uuid:

Check to Confirm Deployment

To generate some data, I'm going to run a curl in a loop that we can then use in our next few steps. Be sure to replace the SERVICE_IP:

#!/bin/bash
while true
do
  curl http://SERVICE_IP/uuid
  sleep 1
done

Curl Loop to Generate Data

Curl Loop Running

Step 3 - Import Istio Dashboards to Grafana

Now that we have Istio deployed and we've setup our Prometheus scraping, we should be able to see the metrics in our linked Grafana instance.

Navigate to the Explore tab from the home page and you should then see a number of envoy and istio metrics:

Navigate to Explore tab
Be Sure to Select your Prometheus Workspace as the data source. You should see istio metrics being populated

Now from here you can do anything you normally would with Grafana and Prometheus. However, to me the key value add that we get when we leverage open source tools like Grafana is that someone has likely already built dashboards using these common metrics.

Istio has already published a number of dashboards, so we can import these into our Grafana instance. I deployed Istio version 1.20.0, so I'll want to import the dashboard versions that correspond.

I'll start with the Istio Service Dashboard. We can simply copy the ID to our clipboard and then import the ID in our Grafana instance. Alternatively to correspond to our version, we can select Revisions and download the JSON that matched our Istio Version. I'll take the second path:

Download JSON for Specific Istio Version
Navigate to Dashboards and select New and Import
Paste the Downloaded JSON and select Load
Select your Prometheus Datasource and Import. You can modify the name and folders if desired

Since I have the curl script running in the background, I am able to see data populated. Notice that the "client" in this dashboard is the ingress gateway and the "service" is the httpbin backend service. The metrics look the same from both perspectives since the gateway forwards everything to the httpbin service in our scenario. The metrics also look accurate since the curl script is generating a curl roughly every second:

Istio Servicer Dashboard - There is no data for TCP in this case since the gateway is an http gateway

You can continue following the import process to get the main Istio Dasbhoards into Grafana.

Callout About Default Dashboards

A key point to consider is that default dashboards can make assumptions that you need to evaluate, especially at scale if you have multiple AKS clusters linked to one Azure Monitor Workspace and Grafana instance. Depending on your setup and use case, it may be necessary to extend the dashboard.

If we consider our example, what would happen if I deployed the exact same httpbin service to another AKS cluster and linked that cluster to the same Azure Monitor Workspace? Basically the results would be double, since the names of the gateway and backend service would exactly match and there isn't a clear filter on how to say "cluster 1" versus "cluster 2".

There are many ways to address this. You could simply configure each cluster to use a different Azure Monitor Workspace, but that isn't really necessary. Another way is let's add a filter to further specify the query.

In my small example, I will simply use the cluster name label but you could use custom labels on the pods themselves or other variables to do the filtering.

Every metric processed by the AMA Agent in your AKS cluster will have a cluster label that is the name of the cluster:

cluster Label is the name of the AKS cluster

Therefore, I could use this label to filter the metrics by each cluster if desired. I would need to add a variable to the dashboard and then use it in the queries. Here's what that could look like, and again you can customize or do it differently as needed.

First navigate to the dashboard settings:

Navigate to the Dashboard and select Dashboard Settings

From there, let's create a new variable. This can look different on your end depending on the the type of variable you want to create. In this scenario, we simply want to get the name of the cluster available to us, so our config can look like the following:

Once that is applied, you will see the variable populate at the top of the dashboard. However, we still are not done since we need to then include it in queries of the visuals themselves:

Variable Appears in Dashboard

Select one of the visuals we want to update:

Edit a Visual to Include the New Query Variable

From there, we can edit the query and test to confirm it works:

Update Query to Include Cluster Filter. You can use either the Builder or Code

From there you can save the dashboard and view the update. To test it, I actually type is an incorrect cluster name and notice how the visual presents N/A representing no data:

Confirming it works by providing a bad cluster name and seeing N/A for data
Updating the Cluster Name variable and Viewing Data Specific for that Cluster

Keep in mind we only did this for one visual, so you want need to make the corresponding updates to the other visuals.

Summary

The goal of this post is to show you the power of using Grafana and Prometheus with AKS, especially when using OSS tools like Istio. Dashboards often already exist and can accelerate your monitoring.