Security

Data Loss Prevention (DLP): Protecting Sensitive Data in SaaS Apps

How to implement DLP in SaaS applications — categories, cloud DLP tools, regex patterns for PII detection, alert strategies, and common pitfalls.

March 9, 20266 min readShipSafer Team

Data Loss Prevention (DLP): Protecting Sensitive Data in SaaS Apps

Data Loss Prevention is not a single product — it is a capability built from policies, detection logic, controls, and workflows. Its goal is to prevent sensitive data from leaving the organization in ways that are unauthorized, whether through malicious exfiltration, accidental sharing, or misconfigured access controls.

In the SaaS era, data moves through dozens of cloud applications — Slack, Google Drive, Salesforce, GitHub, Notion, Zendesk. A traditional DLP solution designed for on-premises file servers provides little value here. Modern DLP must be cloud-aware, API-integrated, and capable of inspecting content in motion through SaaS applications.

DLP Categories

Network DLP — Monitors and blocks data leaving the network via web proxies, email gateways, and cloud access security brokers (CASB). Can inspect SSL/TLS traffic if certificates are deployed on managed devices.

Endpoint DLP — Agent running on employee devices that monitors file operations, clipboard activity, USB transfers, and application data access. Microsoft Purview, CrowdStrike DLP, and Symantec DLP all have endpoint agents.

Cloud/SaaS DLP — API-integrated inspection of content stored in or transiting through cloud applications. Connects to Google Workspace, Microsoft 365, Box, Dropbox, Salesforce, GitHub via their admin APIs.

Data-at-Rest DLP — Scans cloud storage (S3, GCS, SharePoint) for sensitive data that should not be there — PII in public buckets, credentials in code repositories, health data in uncontrolled folders.

Data Classification

DLP policies operate on data classification. Before you can prevent data from leaving inappropriately, you must know what data you have and how sensitive it is.

Common classification tiers:

TierDefinitionExample
PublicIntentionally public, no restrictionMarketing materials, product documentation
InternalFor employees only, low risk if leakedInternal announcements, meeting notes
ConfidentialBusiness-sensitive, should not leave orgCustomer contracts, financial forecasts
RestrictedRegulatory or high-risk PIISSNs, PHI, payment card data

Apply classification via:

  • Manual labeling — Users classify documents (Microsoft Purview sensitivity labels, Google Workspace labels)
  • Auto-classification — Content inspection applies labels based on detected data types
  • Structural classification — All data in certain systems is classified by definition (CRM = customer data = confidential)

Content Inspection: Patterns and Classifiers

DLP effectiveness depends on accurate detection of sensitive data. The two primary detection approaches:

Regex Patterns

Regex patterns match structured data formats. They are fast but prone to false positives.

Common regex patterns:

import re

PATTERNS = {
    # US Social Security Number
    'ssn': r'\b(?!000|666|9\d{2})\d{3}-(?!00)\d{2}-(?!0000)\d{4}\b',

    # Credit card numbers (Luhn-validated separately)
    'credit_card': r'\b(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|3[47][0-9]{13}|6(?:011|5[0-9]{2})[0-9]{12})\b',

    # US Driver's License (generic — varies by state)
    'drivers_license': r'\b[A-Z]{1,2}\d{6,8}\b',

    # AWS Access Key ID
    'aws_key': r'(?<![A-Z0-9])[A-Z0-9]{20}(?![A-Z0-9])',

    # Private key header
    'private_key': r'-----BEGIN (?:RSA |EC )?PRIVATE KEY-----',

    # Email address
    'email': r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',

    # US Phone number
    'phone': r'\b(?:\+1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b',

    # International Bank Account Number
    'iban': r'\b[A-Z]{2}\d{2}[A-Z0-9]{4}\d{7}(?:[A-Z0-9]?){0,16}\b',
}

def scan_text(text: str) -> list[dict]:
    findings = []
    for data_type, pattern in PATTERNS.items():
        matches = re.finditer(pattern, text)
        for match in matches:
            findings.append({
                'type': data_type,
                'match': match.group(),
                'position': match.span()
            })
    return findings

Reducing false positives:

  • Apply context filters: "123-45-6789" is an SSN pattern, but in the sentence "Version 123-45-6789" it is not.
  • Use proximity analysis: an SSN near words like "social security," "SSN," or "taxpayer" is higher confidence.
  • Apply Luhn validation for credit card numbers.
  • Require minimum match count for lower-confidence patterns before alerting.

ML-Based Classifiers

Cloud DLP services use trained machine learning models that understand context, not just patterns.

AWS Macie — Automatically discovers and classifies sensitive data in S3 using ML:

import boto3

macie = boto3.client('macie2', region_name='us-east-1')

# Create a classification job
response = macie.create_classification_job(
    name='production-data-scan',
    jobType='SCHEDULED',
    scheduleFrequency={'weeklySchedule': {'dayOfWeek': 'MONDAY'}},
    s3JobDefinition={
        'bucketDefinitions': [
            {
                'accountId': '123456789012',
                'buckets': ['production-data-bucket', 'backup-bucket']
            }
        ]
    },
    samplingPercentage=100
)

GCP Cloud DLP — Supports 150+ built-in information types including PII, credentials, and country-specific identifiers:

from google.cloud import dlp_v2

dlp = dlp_v2.DlpServiceClient()

inspect_config = dlp_v2.InspectConfig(
    info_types=[
        dlp_v2.InfoType(name='US_SOCIAL_SECURITY_NUMBER'),
        dlp_v2.InfoType(name='CREDIT_CARD_NUMBER'),
        dlp_v2.InfoType(name='EMAIL_ADDRESS'),
        dlp_v2.InfoType(name='PERSON_NAME'),
    ],
    min_likelihood=dlp_v2.Likelihood.LIKELY,
    include_quote=False  # Do not include the actual sensitive value in findings
)

Microsoft Purview — Integrates DLP across Microsoft 365, Teams, Exchange, SharePoint, and OneDrive with pre-built sensitive information types for 100+ countries.

DLP in CI/CD: Preventing Secrets in Code

One of the highest-value DLP use cases for engineering teams is preventing secrets, credentials, and PII from being committed to source code repositories.

Gitleaks as a pre-commit hook:

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/gitleaks/gitleaks
    rev: v8.18.4
    hooks:
      - id: gitleaks
        entry: gitleaks protect --staged --redact --exit-code 1

GitHub Advanced Security secret scanning automatically scans repositories and alerts on 200+ token types from cloud providers, API services, and credentials.

For historical exposure, scan your entire git history:

gitleaks detect --source . --log-level debug --report-path gitleaks-report.json

Alert Strategies and Response

Raw DLP findings without a clear response workflow become noise that analysts ignore.

Tiered response model:

SeverityTriggerResponse
CriticalSSN/PHI/PAN sent externallyAuto-block + immediate investigation
HighPII shared to unmanaged deviceAuto-quarantine + manager notification
MediumInternal credentials in SlackAlert to security + automated reminder to user
LowEmail address in public documentLog only, batch review weekly

User education, not just blocking:

Aggressive DLP that blocks frequently and without explanation creates shadow IT — employees work around DLP by using personal devices or external services. Effective DLP:

  • Explains WHY a transfer was blocked in user-friendly language
  • Provides an approved alternative (e.g., "Use the secure file sharing portal instead")
  • Allows a business justification override workflow for edge cases
  • Uses just-in-time education to explain the policy

False positive management:

DLP policies require continuous tuning. Track false positive rates per policy. Any policy generating more than 30% false positives should be re-tuned before analysts stop trusting it.

Modern DLP is not a firewall for data — it is a combination of visibility, classification, policy enforcement, and user education. The goal is not to block everything; it is to make sensitive data handling deliberate and auditable.

Check Your Security Score — Free

See exactly how your domain scores on DMARC, TLS, HTTP headers, and 25+ other automated security checks in under 60 seconds.