Data Loss Prevention (DLP): Protecting Sensitive Data in SaaS Apps

Data Loss Prevention is not a single product — it is a capability built from policies, detection logic, controls, and workflows. Its goal is to prevent sensitive data from leaving the organization in ways that are unauthorized, whether through malicious exfiltration, accidental sharing, or misconfigured access controls.

In the SaaS era, data moves through dozens of cloud applications — Slack, Google Drive, Salesforce, GitHub, Notion, Zendesk. A traditional DLP solution designed for on-premises file servers provides little value here. Modern DLP must be cloud-aware, API-integrated, and capable of inspecting content in motion through SaaS applications.

DLP Categories

Network DLP — Monitors and blocks data leaving the network via web proxies, email gateways, and cloud access security brokers (CASB). Can inspect SSL/TLS traffic if certificates are deployed on managed devices.

Endpoint DLP — Agent running on employee devices that monitors file operations, clipboard activity, USB transfers, and application data access. Microsoft Purview, CrowdStrike DLP, and Symantec DLP all have endpoint agents.

Cloud/SaaS DLP — API-integrated inspection of content stored in or transiting through cloud applications. Connects to Google Workspace, Microsoft 365, Box, Dropbox, Salesforce, GitHub via their admin APIs.

Data-at-Rest DLP — Scans cloud storage (S3, GCS, SharePoint) for sensitive data that should not be there — PII in public buckets, credentials in code repositories, health data in uncontrolled folders.

Data Classification

DLP policies operate on data classification. Before you can prevent data from leaving inappropriately, you must know what data you have and how sensitive it is.

Common classification tiers:

Tier	Definition	Example
Public	Intentionally public, no restriction	Marketing materials, product documentation
Internal	For employees only, low risk if leaked	Internal announcements, meeting notes
Confidential	Business-sensitive, should not leave org	Customer contracts, financial forecasts
Restricted	Regulatory or high-risk PII	SSNs, PHI, payment card data

Apply classification via:

Manual labeling — Users classify documents (Microsoft Purview sensitivity labels, Google Workspace labels)
Auto-classification — Content inspection applies labels based on detected data types
Structural classification — All data in certain systems is classified by definition (CRM = customer data = confidential)

Content Inspection: Patterns and Classifiers

DLP effectiveness depends on accurate detection of sensitive data. The two primary detection approaches:

Regex Patterns

Regex patterns match structured data formats. They are fast but prone to false positives.

Common regex patterns:

import re

PATTERNS = {
    # US Social Security Number
    'ssn': r'\b(?!000|666|9\d{2})\d{3}-(?!00)\d{2}-(?!0000)\d{4}\b',

    # Credit card numbers (Luhn-validated separately)
    'credit_card': r'\b(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|3[47][0-9]{13}|6(?:011|5[0-9]{2})[0-9]{12})\b',

    # US Driver's License (generic — varies by state)
    'drivers_license': r'\b[A-Z]{1,2}\d{6,8}\b',

    # AWS Access Key ID
    'aws_key': r'(?<![A-Z0-9])[A-Z0-9]{20}(?![A-Z0-9])',

    # Private key header
    'private_key': r'-----BEGIN (?:RSA |EC )?PRIVATE KEY-----',

    # Email address
    'email': r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',

    # US Phone number
    'phone': r'\b(?:\+1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b',

    # International Bank Account Number
    'iban': r'\b[A-Z]{2}\d{2}[A-Z0-9]{4}\d{7}(?:[A-Z0-9]?){0,16}\b',
}

def scan_text(text: str) -> list[dict]:
    findings = []
    for data_type, pattern in PATTERNS.items():
        matches = re.finditer(pattern, text)
        for match in matches:
            findings.append({
                'type': data_type,
                'match': match.group(),
                'position': match.span()
            })
    return findings

Reducing false positives:

Apply context filters: "123-45-6789" is an SSN pattern, but in the sentence "Version 123-45-6789" it is not.
Use proximity analysis: an SSN near words like "social security," "SSN," or "taxpayer" is higher confidence.
Apply Luhn validation for credit card numbers.
Require minimum match count for lower-confidence patterns before alerting.

ML-Based Classifiers

Cloud DLP services use trained machine learning models that understand context, not just patterns.

AWS Macie — Automatically discovers and classifies sensitive data in S3 using ML:

import boto3

macie = boto3.client('macie2', region_name='us-east-1')

# Create a classification job
response = macie.create_classification_job(
    name='production-data-scan',
    jobType='SCHEDULED',
    scheduleFrequency={'weeklySchedule': {'dayOfWeek': 'MONDAY'}},
    s3JobDefinition={
        'bucketDefinitions': [
            {
                'accountId': '123456789012',
                'buckets': ['production-data-bucket', 'backup-bucket']
            }
        ]
    },
    samplingPercentage=100
)

GCP Cloud DLP — Supports 150+ built-in information types including PII, credentials, and country-specific identifiers:

from google.cloud import dlp_v2

dlp = dlp_v2.DlpServiceClient()

inspect_config = dlp_v2.InspectConfig(
    info_types=[
        dlp_v2.InfoType(name='US_SOCIAL_SECURITY_NUMBER'),
        dlp_v2.InfoType(name='CREDIT_CARD_NUMBER'),
        dlp_v2.InfoType(name='EMAIL_ADDRESS'),
        dlp_v2.InfoType(name='PERSON_NAME'),
    ],
    min_likelihood=dlp_v2.Likelihood.LIKELY,
    include_quote=False  # Do not include the actual sensitive value in findings
)

Microsoft Purview — Integrates DLP across Microsoft 365, Teams, Exchange, SharePoint, and OneDrive with pre-built sensitive information types for 100+ countries.

DLP in CI/CD: Preventing Secrets in Code

One of the highest-value DLP use cases for engineering teams is preventing secrets, credentials, and PII from being committed to source code repositories.

Gitleaks as a pre-commit hook:

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/gitleaks/gitleaks
    rev: v8.18.4
    hooks:
      - id: gitleaks
        entry: gitleaks protect --staged --redact --exit-code 1

GitHub Advanced Security secret scanning automatically scans repositories and alerts on 200+ token types from cloud providers, API services, and credentials.

For historical exposure, scan your entire git history:

gitleaks detect --source . --log-level debug --report-path gitleaks-report.json

Alert Strategies and Response

Raw DLP findings without a clear response workflow become noise that analysts ignore.

Tiered response model:

Severity	Trigger	Response
Critical	SSN/PHI/PAN sent externally	Auto-block + immediate investigation
High	PII shared to unmanaged device	Auto-quarantine + manager notification
Medium	Internal credentials in Slack	Alert to security + automated reminder to user
Low	Email address in public document	Log only, batch review weekly

User education, not just blocking:

Aggressive DLP that blocks frequently and without explanation creates shadow IT — employees work around DLP by using personal devices or external services. Effective DLP:

Explains WHY a transfer was blocked in user-friendly language
Provides an approved alternative (e.g., "Use the secure file sharing portal instead")
Allows a business justification override workflow for edge cases
Uses just-in-time education to explain the policy

False positive management:

DLP policies require continuous tuning. Track false positive rates per policy. Any policy generating more than 30% false positives should be re-tuned before analysts stop trusting it.

Modern DLP is not a firewall for data — it is a combination of visibility, classification, policy enforcement, and user education. The goal is not to block everything; it is to make sensitive data handling deliberate and auditable.