You've enabled Data Exfiltration Protection on Synapse for security. Now your Spark pools can't reach on-premises data sources over VPN. Here's what to do.
The Problem
With DEP enabled:
- All outbound traffic from Spark pools is blocked
- Only approved managed private endpoints are allowed
- VPN traffic goes through your hub VNET
- There's no way to create a "private endpoint" to on-premises
Understanding the Traffic Flow
Without DEP:
Spark Pool → Managed VNET → Peering → Hub VNET → VPN → On-Prem
(Works)
With DEP:
Spark Pool → Managed VNET → BLOCKED
(All egress denied except managed private endpoints)
Option 1: NAT VM Workaround
Create a VM in your VNET that acts as a reverse proxy:
Spark → Managed PE → NAT VM → VPN → On-Prem
Deploy NAT VM
resource "azurerm_network_interface" "nat" {
name = "nic-nat-vm"
location = azurerm_resource_group.this.location
resource_group_name = azurerm_resource_group.this.name
ip_configuration {
name = "internal"
subnet_id = azurerm_subnet.nat.id
private_ip_address_allocation = "Static"
private_ip_address = "10.0.5.10"
}
# Enable IP forwarding for NAT
enable_ip_forwarding = true
}
resource "azurerm_linux_virtual_machine" "nat" {
name = "vm-nat"
resource_group_name = azurerm_resource_group.this.name
location = azurerm_resource_group.this.location
size = "Standard_B2s"
admin_username = "adminuser"
network_interface_ids = [azurerm_network_interface.nat.id]
os_disk {
caching = "ReadWrite"
storage_account_type = "Standard_LRS"
}
source_image_reference {
publisher = "Canonical"
offer = "0001-com-ubuntu-server-jammy"
sku = "22_04-lts"
version = "latest"
}
# Configure NAT on boot
custom_data = base64encode(<<-EOF
#!/bin/bash
sysctl -w net.ipv4.ip_forward=1
iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
iptables -A FORWARD -i eth0 -j ACCEPT
EOF
)
}
Create Managed Private Endpoint to NAT VM
In Synapse, create a managed private endpoint pointing to a Private Link Service on the NAT VM.
# Private Link Service for NAT VM
resource "azurerm_private_link_service" "nat" {
name = "pls-nat-vm"
resource_group_name = azurerm_resource_group.this.name
location = azurerm_resource_group.this.location
load_balancer_frontend_ip_configuration_ids = [
azurerm_lb.nat.frontend_ip_configuration[0].id
]
nat_ip_configuration {
name = "primary"
subnet_id = azurerm_subnet.nat.id
primary = true
}
}
# Managed private endpoint in Synapse
resource "azurerm_synapse_managed_private_endpoint" "nat" {
name = "pe-nat"
synapse_workspace_id = azurerm_synapse_workspace.this.id
target_resource_id = azurerm_private_link_service.nat.id
subresource_name = "" # Private Link Service doesn't use subresource
}
Option 2: Data Movement via Linked Services
Instead of direct access, move data through Azure:
On-Prem → Self-Hosted IR → Data Factory → ADLS → Synapse Spark
- Deploy Self-Hosted Integration Runtime on-premises
- Use Data Factory to copy data to Azure Data Lake
- Spark reads from ADLS (via managed private endpoint)
# Data Factory with managed VNET
resource "azurerm_data_factory" "this" {
name = "adf-data-ingestion"
resource_group_name = azurerm_resource_group.this.name
location = azurerm_resource_group.this.location
managed_virtual_network_enabled = true
}
# Self-hosted IR for on-prem connectivity
resource "azurerm_data_factory_integration_runtime_self_hosted" "onprem" {
name = "ir-onprem"
data_factory_id = azurerm_data_factory.this.id
}
Option 3: ExpressRoute with Microsoft Peering
ExpressRoute Microsoft Peering can provide private connectivity to Azure PaaS services:
On-Prem → ExpressRoute → Microsoft Peering → Private Endpoint
This is complex and expensive but provides true private connectivity.
Option 4: Accept the Limitation
Sometimes the right answer is to not use DEP:
- If on-prem connectivity is essential
- If the data isn't highly sensitive
- If other controls provide adequate protection
resource "azurerm_synapse_workspace" "this" {
# ...
managed_virtual_network_enabled = true
data_exfiltration_protection_enabled = false # Allow outbound
}
Security Compensating Controls
If you disable DEP, implement other controls:
- NSG on managed VNET - Restrict destinations
- Azure Firewall - Inspect and log traffic
- Private endpoints - For Azure resources
- Activity monitoring - Log all data access
Comparison
| Approach | Complexity | Cost | Security |
|---|---|---|---|
| NAT VM | High | Medium | Good |
| Data Movement | Medium | Low-Medium | Good |
| ExpressRoute MS Peering | Very High | High | Best |
| Disable DEP | Low | Low | Lower |
Recommendation
For most organisations:
- Use Data Movement approach for scheduled data loads
- Keep DEP enabled for sensitive workloads
- Consider separate Synapse workspace for on-prem connected workloads (without DEP)
Need help with Synapse networking and on-premises integration? Get in touch - we help organisations design secure hybrid data platforms.