Container-Based Monitoring Solutions for Azure

Traditional VM-based monitoring doesn't translate well to containers. Here's how to implement monitoring that scales with your containerised workloads.

The Container Monitoring Challenge

VMs are persistent. Containers are ephemeral. Traditional monitoring approaches fail because:

Containers come and go constantly
IP addresses change with each deployment
Log files disappear when containers restart
Agent-per-node doesn't work the same way

Azure Monitor for Containers

Microsoft's native solution for AKS:

resource "azurerm_kubernetes_cluster" "this" {
  name                = "aks-production"
  location            = azurerm_resource_group.this.location
  resource_group_name = azurerm_resource_group.this.name
  dns_prefix          = "aks-production"

  default_node_pool {
    name       = "default"
    node_count = 3
    vm_size    = "Standard_D2s_v3"
  }

  identity {
    type = "SystemAssigned"
  }

  oms_agent {
    log_analytics_workspace_id = azurerm_log_analytics_workspace.this.id
  }
}

What you get:

Container-level CPU/memory metrics
Pod and node health
Container logs in Log Analytics
Pre-built workbooks

Querying Container Logs

// Container errors in the last hour
ContainerLog
| where TimeGenerated > ago(1h)
| where LogEntry contains "error" or LogEntry contains "exception"
| project TimeGenerated, ContainerID, LogEntry
| order by TimeGenerated desc

// Pod restarts
KubePodInventory
| where TimeGenerated > ago(24h)
| where PodRestartCount > 0
| summarize RestartCount = max(PodRestartCount) by PodName, Namespace
| order by RestartCount desc

// High memory pods
Perf
| where ObjectName == "K8SContainer"
| where CounterName == "memoryWorkingSetBytes"
| summarize AvgMemory = avg(CounterValue) by InstanceName
| where AvgMemory > 500000000  // 500MB
| order by AvgMemory desc

Prometheus Integration

For cloud-native metrics, use Azure Monitor managed Prometheus:

resource "azurerm_monitor_workspace" "this" {
  name                = "prometheus-workspace"
  resource_group_name = azurerm_resource_group.this.name
  location            = azurerm_resource_group.this.location
}

resource "azurerm_monitor_data_collection_rule" "prometheus" {
  name                = "dcr-prometheus"
  resource_group_name = azurerm_resource_group.this.name
  location            = azurerm_resource_group.this.location

  destinations {
    monitor_account {
      monitor_account_id = azurerm_monitor_workspace.this.id
      name               = "prometheus-destination"
    }
  }

  data_flow {
    streams      = ["Microsoft-PrometheusMetrics"]
    destinations = ["prometheus-destination"]
  }
}

Prometheus Queries

# Container CPU usage
sum(rate(container_cpu_usage_seconds_total{namespace="production"}[5m])) by (pod)

# Memory usage percentage
sum(container_memory_working_set_bytes{namespace="production"}) by (pod)
/ sum(container_spec_memory_limit_bytes{namespace="production"}) by (pod)
* 100

# Request rate
sum(rate(http_requests_total{namespace="production"}[5m])) by (service)

Application-Level Monitoring

OpenTelemetry in Containers

# Python app with OpenTelemetry
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

# Configure exporter
trace.set_tracer_provider(TracerProvider())
otlp_exporter = OTLPSpanExporter(endpoint="otel-collector:4317")
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(otlp_exporter)
)

tracer = trace.get_tracer(__name__)

@app.route("/api/orders")
def get_orders():
    with tracer.start_as_current_span("get_orders"):
        # Your code here
        pass

Application Insights for Containers

# Kubernetes deployment with App Insights
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  template:
    spec:
      containers:
        - name: api
          image: myapp:latest
          env:
            - name: APPLICATIONINSIGHTS_CONNECTION_STRING
              valueFrom:
                secretKeyRef:
                  name: appinsights
                  key: connection-string

Health Probes and Liveness

Configure proper health checks:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  template:
    spec:
      containers:
        - name: api
          livenessProbe:
            httpGet:
              path: /health/live
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 10
            failureThreshold: 3

          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5
            failureThreshold: 3

Alerting on Container Issues

resource "azurerm_monitor_metric_alert" "container_restarts" {
  name                = "alert-container-restarts"
  resource_group_name = azurerm_resource_group.this.name
  scopes              = [azurerm_kubernetes_cluster.this.id]
  description         = "Alert when containers restart frequently"

  criteria {
    metric_namespace = "insights.container/pods"
    metric_name      = "restartingContainerCount"
    aggregation      = "Average"
    operator         = "GreaterThan"
    threshold        = 0
  }

  action {
    action_group_id = azurerm_monitor_action_group.ops.id
  }
}

Container Apps Monitoring

For Azure Container Apps:

resource "azurerm_container_app" "api" {
  # ... other config ...

  template {
    container {
      name  = "api"
      image = "myapp:latest"

      # Built-in logging
      env {
        name  = "ASPNETCORE_LOGGING__CONSOLE__FORMATTERTYPE"
        value = "json"  # JSON logs for better parsing
      }
    }
  }
}

Query Container Apps logs:

ContainerAppConsoleLogs_CL
| where ContainerAppName_s == "api"
| where Log_s contains "error"
| project TimeGenerated, Log_s

Need help implementing container monitoring? Get in touch - we help organisations build observable container platforms.