Azure Synapse with Data Exfiltration Protection (DEP) is the most secure configuration - but it blocks all outbound traffic from Spark pools. Here's how to make it work.
What Data Exfiltration Protection Does
When enabled:
- All outbound traffic from managed VNET is blocked
- Spark pools can only reach resources via managed private endpoints
- No internet access for pip install or external APIs
- Even Azure services need private endpoints
The Catch-22
You want security, but your Spark jobs need to:
- Read from Azure Storage
- Write to SQL databases
- Log to Log Analytics
- Maybe call external APIs
Without proper configuration, everything fails with connection timeouts.
Enabling DEP
resource "azurerm_synapse_workspace" "this" {
name = "syn-production"
resource_group_name = azurerm_resource_group.this.name
location = azurerm_resource_group.this.location
storage_data_lake_gen2_filesystem_id = azurerm_storage_data_lake_gen2_filesystem.this.id
managed_virtual_network_enabled = true
data_exfiltration_protection_enabled = true # This is the key setting
identity {
type = "SystemAssigned"
}
}
Warning: DEP can only be enabled at workspace creation. You cannot enable it later.
Creating Managed Private Endpoints
For each resource your Spark pools need to access:
Storage Account
resource "azurerm_synapse_managed_private_endpoint" "storage" {
name = "pe-storage-blob"
synapse_workspace_id = azurerm_synapse_workspace.this.id
target_resource_id = azurerm_storage_account.data.id
subresource_name = "blob"
}
resource "azurerm_synapse_managed_private_endpoint" "storage_dfs" {
name = "pe-storage-dfs"
synapse_workspace_id = azurerm_synapse_workspace.this.id
target_resource_id = azurerm_storage_account.data.id
subresource_name = "dfs"
}
Key Vault
resource "azurerm_synapse_managed_private_endpoint" "keyvault" {
name = "pe-keyvault"
synapse_workspace_id = azurerm_synapse_workspace.this.id
target_resource_id = azurerm_key_vault.this.id
subresource_name = "vault"
}
SQL Database
resource "azurerm_synapse_managed_private_endpoint" "sql" {
name = "pe-sql"
synapse_workspace_id = azurerm_synapse_workspace.this.id
target_resource_id = azurerm_mssql_server.this.id
subresource_name = "sqlServer"
}
Event Hub
resource "azurerm_synapse_managed_private_endpoint" "eventhub" {
name = "pe-eventhub"
synapse_workspace_id = azurerm_synapse_workspace.this.id
target_resource_id = azurerm_eventhub_namespace.this.id
subresource_name = "namespace" # Note: not "eventhub"
}
Approving Private Endpoints
Managed private endpoints require approval on the target resource:
# Auto-approval doesn't work for managed private endpoints
# You need to approve them manually or via automation
resource "null_resource" "approve_storage_pe" {
depends_on = [azurerm_synapse_managed_private_endpoint.storage]
provisioner "local-exec" {
command = <<-EOT
az network private-endpoint-connection approve \
--resource-group ${azurerm_resource_group.this.name} \
--resource-name ${azurerm_storage_account.data.name} \
--name ${azurerm_synapse_managed_private_endpoint.storage.name} \
--type Microsoft.Storage/storageAccounts \
--description "Approved for Synapse"
EOT
}
}
Handling External Dependencies
Python Packages
Without internet access, pip install fails. Options:
- Pre-built Spark pool image with required packages
- Private PyPI mirror with managed private endpoint
- Upload packages to linked storage and install from there
# Install from storage
spark.sparkContext.addPyFile("abfss://[email protected]/mypackage.whl")
Workspace Packages
Upload packages to the workspace:
resource "azurerm_synapse_workspace_package" "pandas" {
name = "pandas-1.5.0-py3-none-any.whl"
synapse_workspace_id = azurerm_synapse_workspace.this.id
link = azurerm_storage_blob.pandas_wheel.url
}
Logging to Log Analytics
For Spark pool diagnostics with DEP, you need:
- Log Analytics in the same region
- Managed private endpoint to Log Analytics
resource "azurerm_synapse_managed_private_endpoint" "log_analytics" {
name = "pe-log-analytics"
synapse_workspace_id = azurerm_synapse_workspace.this.id
target_resource_id = azurerm_log_analytics_workspace.this.id
subresource_name = "api" # For data collection
}
Note: This requires Azure Monitor Private Link Scope (AMPLS) configuration.
VPN and ExpressRoute Access
DEP blocks traffic to on-premises too. For hybrid scenarios:
- NAT VM approach: Route through a VM in your VNET
- Azure Firewall: Allow specific destinations
- Private endpoints on-prem: Complex but possible
Most organisations accept that Spark pools can't reach on-prem with DEP enabled.
Debugging Connection Issues
When Spark jobs fail with timeout errors:
# In a notebook, test connectivity
import socket
def test_connection(host, port):
try:
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.settimeout(5)
result = sock.connect_ex((host, port))
sock.close()
return result == 0
except:
return False
# Test your endpoints
endpoints = [
("mystorageaccount.blob.core.windows.net", 443),
("mystorageaccount.dfs.core.windows.net", 443),
("mykeyvault.vault.azure.net", 443),
]
for host, port in endpoints:
status = "OK" if test_connection(host, port) else "BLOCKED"
print(f"{host}:{port} - {status}")
Comparison: DEP vs Standard Managed VNET
| Feature | DEP Enabled | Standard Managed VNET |
|---|---|---|
| Outbound internet | Blocked | Allowed |
| Azure services | Private endpoint required | Direct access |
| pip install | Blocked | Works |
| On-premises | Blocked | Works |
| Security | Highest | High |
| Complexity | Higher | Lower |
Need help securing your Synapse environment? Get in touch - we help organisations implement secure data platforms.