Using AI to Optimise Cloud Infrastructure

AI isn't just for chatbots. Here's how we're using AI tools to improve infrastructure management and cloud optimisation.

Log Analysis with LLMs

Instead of writing complex KQL queries, describe what you're looking for:

Azure OpenAI + Log Analytics

import openai
from azure.monitor.query import LogsQueryClient

# Use LLM to generate KQL from natural language
def generate_kql(user_query):
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": """You are a KQL expert.
            Convert natural language queries to KQL for Azure Log Analytics.
            Return only the KQL query, no explanation."""},
            {"role": "user", "content": user_query}
        ]
    )
    return response.choices[0].message.content

# Example usage
kql = generate_kql("Show me all failed logins in the last 24 hours grouped by user")
# Returns: SigninLogs | where TimeGenerated > ago(24h) | where ResultType != 0 | summarize count() by UserPrincipalName

Automated Anomaly Detection

# Use Azure ML for time series anomaly detection
from azure.ai.anomalydetector import AnomalyDetectorClient

client = AnomalyDetectorClient(endpoint, credential)

# Detect unusual cost patterns
cost_data = get_daily_costs()  # Your cost data
request = DetectRequest(
    series=[TimeSeriesPoint(timestamp=d["date"], value=d["cost"]) for d in cost_data],
    granularity="daily",
    sensitivity=95
)

response = client.detect_last_point(request)
if response.is_anomaly:
    send_alert(f"Unusual spend detected: {response.expected_value} vs {cost_data[-1]['cost']}")

Intelligent Cost Prediction

Predict next month's Azure costs:

from azure.ai.ml import MLClient
from azure.ai.ml.entities import Model

# Train model on historical cost data
def train_cost_predictor():
    # Features: resource count, previous costs, day of week, etc.
    features = extract_cost_features()

    # Use Azure AutoML for best model selection
    automl_config = AutoMLConfig(
        task="forecasting",
        primary_metric="normalized_root_mean_squared_error",
        training_data=features,
        label_column_name="cost",
        time_column_name="date",
        forecast_horizon=30
    )

    run = experiment.submit(automl_config)
    return run.get_output()

# Predict and alert if significantly higher
predicted = model.predict(current_features)
if predicted > current_budget * 1.2:
    alert(f"Predicted cost £{predicted} exceeds budget by 20%")

Automated Remediation

Use AI to suggest and implement fixes:

# Analyze Azure Advisor recommendations with LLM
def analyze_recommendations():
    advisor = AdvisorManagementClient(credential, subscription_id)
    recommendations = list(advisor.recommendations.list())

    # Use LLM to prioritize and explain
    analysis = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": """You are a cloud architect.
            Analyze these Azure Advisor recommendations and:
            1. Rank by impact and ease of implementation
            2. Identify any that conflict with each other
            3. Suggest an implementation order"""},
            {"role": "user", "content": json.dumps([r.as_dict() for r in recommendations])}
        ]
    )

    return analysis.choices[0].message.content

Intelligent Right-Sizing

Beyond simple thresholds, use ML for better VM recommendations:

# Collect comprehensive metrics
metrics = {
    "cpu_avg": query_metrics("Percentage CPU"),
    "cpu_p95": query_metrics("Percentage CPU", aggregation="P95"),
    "memory_avg": query_metrics("Available Memory Bytes"),
    "network_in": query_metrics("Network In Total"),
    "disk_ops": query_metrics("Disk Operations/Sec")
}

# Use LLM to analyze patterns
analysis = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": """You are a cloud optimization expert.
        Analyze these VM metrics and recommend the optimal Azure VM size.
        Consider:
        - Workload patterns (steady vs bursty)
        - Memory vs compute bound
        - Network requirements
        - Cost optimization"""},
        {"role": "user", "content": f"Current VM: {vm_size}\nMetrics: {json.dumps(metrics)}"}
    ]
)

Infrastructure as Code Generation

Generate Terraform from descriptions:

def generate_terraform(description):
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": """You are a Terraform expert for Azure.
            Generate production-ready Terraform code following best practices:
            - Use variables for configurable values
            - Include resource naming conventions
            - Add appropriate tags
            - Consider security defaults"""},
            {"role": "user", "content": description}
        ]
    )
    return response.choices[0].message.content

# Example
terraform = generate_terraform("""
Create an Azure web application:
- Linux App Service Plan (Standard S1)
- Python web app with Application Insights
- Azure SQL Database (Standard S0)
- Storage account for static files
All resources in UK South, tagged with environment=production
""")

ChatOps for Infrastructure

Integrate AI into your chat platform:

# Slack bot for infrastructure queries
@app.message("what's my azure spend")
def handle_spend_query(message, say):
    # Get costs
    costs = get_current_costs()

    # Generate natural language summary
    summary = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "Summarize Azure costs in a friendly, concise way"},
            {"role": "user", "content": json.dumps(costs)}
        ]
    )

    say(summary.choices[0].message.content)

Practical Applications

Use Case	Benefit
Log analysis	Faster troubleshooting
Cost prediction	Budget planning
Right-sizing	Reduced waste
IaC generation	Faster deployments
Anomaly detection	Proactive alerts

Getting Started

Azure OpenAI for LLM capabilities
Azure Machine Learning for custom models
Logic Apps for orchestration
Start small - one use case at a time

Don't try to automate everything. Pick high-value, repetitive tasks first.

Need help implementing AI-driven infrastructure optimization? Get in touch - we help organisations leverage AI for smarter cloud management.