LLM Cost Security: Preventing Prompt Flooding and API Abuse

LLM API costs scale linearly with usage — typically measured in tokens processed, not requests made. This creates a vulnerability unique to AI applications: a malicious actor can trigger enormous financial costs without exploiting any code vulnerability. They simply send large inputs to your AI feature until your API bill reaches thousands of dollars.

This is financial denial-of-service, and it's underappreciated as a security threat. Traditional DoS protection defends availability; LLM cost abuse doesn't take your service down — it drains your budget while your service continues running, sometimes for hours before anyone notices.

The Financial DoS Attack Surface

Attack Vectors

1. Context window flooding

Modern LLMs have large context windows — GPT-4o supports 128,000 tokens, Claude 3.5 Sonnet supports 200,000 tokens. An attacker who can fill the context window pays you almost nothing (their request to your free tier or low-cost UI), but you pay the provider for processing 200,000 tokens at $0.003/1K = $0.60 per request.

Scale this: a single attacker making 100 requests/minute for an hour processes 200K × 6,000 = 1.2 billion tokens. At $0.003/1K, that's $3,600 in one hour.

# What an attack looks like
def flood_context(base_url: str):
    filler = "a " * 100000  # ~100K tokens of garbage
    payload = {
        "messages": [{"role": "user", "content": filler + "What is 2+2?"}]
    }
    for _ in range(1000):
        requests.post(f"{base_url}/chat", json=payload)

2. Recursive or expansion attacks

"Write a 10-page story. For every paragraph in the story, provide
a 10-page expansion. For every paragraph in those expansions,
provide a further 10-page expansion."

Without output length limits, this generates a theoretically unlimited token count.

3. Expensive model abuse

If your application allows model selection and you have tiered pricing (cheaper model for basic queries, expensive model for complex ones), attackers route all requests to the most expensive model.

4. Embedding generation abuse

Generating embeddings is cheaper than text generation but faster to abuse — embedding APIs have higher rate limits and simpler abuse patterns.

5. Fine-tuned model abuse

If you expose a fine-tuned model via API, attackers can run inference at your cost rather than paying for their own.

Rate Limiting Architecture

Rate limiting for LLM applications requires tracking multiple dimensions that don't apply to traditional APIs.

Token-Based Rate Limiting

Requests vary enormously in cost based on token count. A rate limit based only on request count (N requests per minute) is insufficient — one request with 128K tokens costs as much as 128 requests with 1K tokens each.

import redis
import tiktoken
from functools import wraps

encoder = tiktoken.encoding_for_model("gpt-4o")

class TokenBudgetRateLimiter:
    def __init__(self, redis_client: redis.Redis):
        self.redis = redis_client

        # Per-user token budgets (configurable per plan)
        self.budgets = {
            "free": {"tokens_per_hour": 10_000, "tokens_per_day": 50_000, "requests_per_minute": 5},
            "pro": {"tokens_per_hour": 100_000, "tokens_per_day": 500_000, "requests_per_minute": 30},
            "enterprise": {"tokens_per_hour": 1_000_000, "tokens_per_day": 10_000_000, "requests_per_minute": 200},
        }

    def count_tokens(self, messages: list[dict]) -> int:
        total = 0
        for message in messages:
            total += len(encoder.encode(message.get("content", "")))
            total += 4  # per-message overhead
        return total

    def check_and_consume(self, user_id: str, plan: str, messages: list[dict]) -> None:
        input_tokens = self.count_tokens(messages)
        budget = self.budgets.get(plan, self.budgets["free"])

        from datetime import datetime
        now = datetime.utcnow()
        hour_key = f"tokens:h:{user_id}:{now.strftime('%Y%m%d%H')}"
        day_key = f"tokens:d:{user_id}:{now.strftime('%Y%m%d')}"
        rpm_key = f"rpm:{user_id}:{now.strftime('%Y%m%d%H%M')}"

        pipe = self.redis.pipeline()
        pipe.incrby(hour_key, input_tokens)
        pipe.expire(hour_key, 3600)
        pipe.incrby(day_key, input_tokens)
        pipe.expire(day_key, 86400)
        pipe.incr(rpm_key)
        pipe.expire(rpm_key, 60)
        hourly_tokens, _, daily_tokens, _, minute_requests, _ = pipe.execute()

        if minute_requests > budget["requests_per_minute"]:
            raise RateLimitError(f"Request rate limit exceeded: {budget['requests_per_minute']} rpm")

        if hourly_tokens > budget["tokens_per_hour"]:
            raise RateLimitError(f"Hourly token budget exceeded")

        if daily_tokens > budget["tokens_per_day"]:
            raise RateLimitError(f"Daily token budget exceeded")

Input Token Validation

Validate input size before making any API call:

MAX_INPUT_TOKENS = {
    "free": 2_000,
    "pro": 16_000,
    "enterprise": 128_000,
}

def validate_and_truncate_input(messages: list[dict], user_plan: str) -> list[dict]:
    """Validate input length and truncate if needed."""
    max_tokens = MAX_INPUT_TOKENS.get(user_plan, 2_000)
    total_tokens = count_tokens(messages)

    if total_tokens > max_tokens:
        raise ValueError(
            f"Input too long: {total_tokens} tokens (plan limit: {max_tokens}). "
            f"Please shorten your input."
        )

    return messages

Output Token Limits

Always set max_tokens on every API call. Set it based on the expected output for your use case, not the maximum possible:

MAX_OUTPUT_TOKENS = {
    "summarize": 500,
    "chat_response": 1024,
    "code_generation": 2048,
    "document_analysis": 4096,
}

def create_completion(
    messages: list[dict],
    use_case: str,
    model: str = "gpt-4o",
) -> str:
    max_tokens = MAX_OUTPUT_TOKENS.get(use_case, 1024)

    response = openai.chat.completions.create(
        model=model,
        messages=messages,
        max_tokens=max_tokens,  # Never omit this
        timeout=30,             # Prevent runaway requests
    )

    return response.choices[0].message.content

Cost Monitoring and Alerting

Real-Time Cost Tracking

from dataclasses import dataclass
from decimal import Decimal

@dataclass
class ModelCostConfig:
    input_per_1k: Decimal
    output_per_1k: Decimal

MODEL_COSTS = {
    "gpt-4o": ModelCostConfig(Decimal("0.0025"), Decimal("0.010")),
    "gpt-4o-mini": ModelCostConfig(Decimal("0.000150"), Decimal("0.000600")),
    "claude-3-5-sonnet-20241022": ModelCostConfig(Decimal("0.003"), Decimal("0.015")),
    "claude-3-haiku-20240307": ModelCostConfig(Decimal("0.00025"), Decimal("0.00125")),
    "text-embedding-3-large": ModelCostConfig(Decimal("0.00013"), Decimal("0")),
}

class CostTracker:
    def __init__(self, redis_client: redis.Redis, alert_service):
        self.redis = redis_client
        self.alerts = alert_service

    def record_usage(
        self,
        user_id: str,
        model: str,
        input_tokens: int,
        output_tokens: int,
    ) -> Decimal:
        costs = MODEL_COSTS.get(model)
        if not costs:
            return Decimal("0")

        cost = (
            Decimal(input_tokens) / 1000 * costs.input_per_1k +
            Decimal(output_tokens) / 1000 * costs.output_per_1k
        )

        # Update running totals
        from datetime import datetime
        now = datetime.utcnow()

        pipe = self.redis.pipeline()
        # Store as integer cents to avoid float precision issues
        cost_cents = int(cost * 10000)  # microdollars

        pipe.incrby(f"cost:user:{user_id}:hour:{now.strftime('%Y%m%d%H')}", cost_cents)
        pipe.incrby(f"cost:user:{user_id}:day:{now.strftime('%Y%m%d')}", cost_cents)
        pipe.incrby(f"cost:global:hour:{now.strftime('%Y%m%d%H')}", cost_cents)
        pipe.incrby(f"cost:global:day:{now.strftime('%Y%m%d')}", cost_cents)
        pipe.execute()

        # Alert on anomalous per-user costs
        hourly_user_cost = Decimal(self.redis.get(
            f"cost:user:{user_id}:hour:{now.strftime('%Y%m%d%H')}"
        ) or 0) / 10000

        if hourly_user_cost > Decimal("10.00"):
            self.alerts.send(
                severity="warning",
                message=f"High LLM cost: user {user_id} spent ${hourly_user_cost:.2f} this hour",
            )

        return cost

Cost Anomaly Detection

class CostAnomalyDetector:
    def __init__(self, redis_client: redis.Redis):
        self.redis = redis_client

    def detect_anomaly(self, user_id: str) -> dict:
        from datetime import datetime, timedelta
        now = datetime.utcnow()

        # Get last 7 days of hourly costs
        hourly_costs = []
        for i in range(168):  # 7 days * 24 hours
            hour = (now - timedelta(hours=i)).strftime('%Y%m%d%H')
            cost = int(self.redis.get(f"cost:user:{user_id}:hour:{hour}") or 0)
            hourly_costs.append(cost)

        if not hourly_costs:
            return {"anomaly": False}

        import statistics
        mean = statistics.mean(hourly_costs)
        stdev = statistics.stdev(hourly_costs) if len(hourly_costs) > 1 else 0

        current_hour_cost = hourly_costs[0]

        # Alert if current hour is more than 3 standard deviations above mean
        z_score = (current_hour_cost - mean) / stdev if stdev > 0 else 0
        is_anomaly = z_score > 3 and current_hour_cost > mean * 5

        return {
            "anomaly": is_anomaly,
            "current_hour_cost_usd": current_hour_cost / 10000,
            "mean_hourly_cost_usd": mean / 10000,
            "z_score": z_score,
        }

Abuse Detection Patterns

Detecting Flooding Behavior

class AbuseDetector:
    SUSPICIOUS_PATTERNS = [
        "a " * 100,           # Repetitive filler
        "." * 200,             # Dot flooding
        "\n" * 100,            # Newline flooding
    ]

    def detect_filler_content(self, text: str) -> dict:
        """Detect suspiciously repetitive content designed to inflate token count."""
        if len(text) < 100:
            return {"flagged": False}

        # Check character entropy (low entropy = repetitive content)
        from collections import Counter
        import math
        char_counts = Counter(text)
        total = len(text)
        entropy = -sum((c/total) * math.log2(c/total) for c in char_counts.values())

        # Normal English text has entropy around 4-5 bits per character
        # Repetitive filler has entropy near 0
        is_low_entropy = entropy < 2.0

        # Check for known filler patterns
        has_filler_pattern = any(
            pattern in text for pattern in self.SUSPICIOUS_PATTERNS
        )

        return {
            "flagged": is_low_entropy or has_filler_pattern,
            "entropy": entropy,
            "reason": "low entropy content" if is_low_entropy else "filler pattern detected",
        }

    def detect_rapid_fire_requests(self, user_id: str, window_seconds: int = 10) -> bool:
        """Detect bursts of requests faster than human typing speed."""
        from datetime import datetime
        key = f"burst:{user_id}:{datetime.utcnow().strftime('%Y%m%d%H%M%S')[:15]}"
        count = self.redis.incr(key)
        self.redis.expire(key, window_seconds)

        # 10 requests in 10 seconds is faster than human interaction
        return count > 10

CAPTCHA and Proof-of-Work for High-Cost Operations

For operations that trigger expensive inference, add friction before the API call:

from hashlib import sha256

def generate_proof_of_work_challenge() -> dict:
    """Generate a simple proof-of-work that costs compute on the client side."""
    import secrets
    nonce = secrets.token_hex(16)
    difficulty = 4  # Number of leading zeros required
    return {
        "nonce": nonce,
        "difficulty": difficulty,
        "challenge": f"Find x such that sha256('{nonce}' + x) starts with {'0' * difficulty}",
    }

def verify_proof_of_work(nonce: str, solution: str, difficulty: int) -> bool:
    hash_result = sha256(f"{nonce}{solution}".encode()).hexdigest()
    return hash_result.startswith('0' * difficulty)

Proof-of-work is not appropriate for all UX contexts, but for API endpoints that are particularly expensive (long-context analysis, multi-step agent tasks), it meaningfully raises the cost of abuse.

Provider-Level Controls

OpenAI Usage Limits

Configure hard usage limits in the OpenAI dashboard:

# OpenAI API allows programmatic limit management
import openai

def set_usage_limits(project_id: str, monthly_budget_usd: float):
    """Set hard cost limits for an OpenAI project."""
    # Via OpenAI dashboard → Project → Limits
    # Programmatically via API (requires admin key):
    client = openai.OpenAI(api_key=os.environ["OPENAI_ADMIN_KEY"])

    # Set hard limit
    # Hard limit: project is blocked when reached
    # Soft limit: email alert when reached
    print(f"Set monthly limit of ${monthly_budget_usd} for project {project_id}")
    print("Configure via: platform.openai.com/account/limits")

Separate Projects per Environment

Use separate API keys and projects per environment with separate spending limits:

Production: high limit with alerting
Staging: lower limit
Development: strict per-developer limits

Economic Model for Free Tiers

If you offer a free tier with AI features, model the economics explicitly:

FREE_TIER_ECONOMICS = {
    "max_requests_per_day": 20,
    "max_tokens_per_request": 2000,
    "model": "gpt-4o-mini",  # Use cheapest model for free tier
    "cost_per_heavy_user_per_month": 0.50,  # Estimated
    "acceptable_cost_per_user": 1.00,
}

def is_free_tier_sustainable(free_user_count: int, actual_avg_cost_per_user: float) -> dict:
    total_monthly_cost = free_user_count * actual_avg_cost_per_user
    projected_annual = total_monthly_cost * 12

    return {
        "monthly_cost": total_monthly_cost,
        "annual_projection": projected_annual,
        "cost_per_user": actual_avg_cost_per_user,
        "within_budget": actual_avg_cost_per_user <= FREE_TIER_ECONOMICS["acceptable_cost_per_user"],
        "recommendation": (
            "increase free tier restrictions" if actual_avg_cost_per_user > FREE_TIER_ECONOMICS["acceptable_cost_per_user"]
            else "current limits acceptable"
        ),
    }

Incident Response for Cost Abuse

When a cost spike is detected:

Immediate: Query which user(s) are driving the spike
Containment: Temporarily reduce rate limits for the implicated accounts
Investigation: Review request logs for the attack pattern
Remediation: Adjust rate limits, add validation, or block accounts
Provider: Contact OpenAI/Anthropic if charges were fraudulent (e.g., from a stolen credit card buying API credits)

class CostIncidentResponder:
    def handle_spike(self, spike_data: dict):
        top_users = self.identify_top_cost_users(spike_data["hour"])

        for user_id, cost in top_users:
            if cost > self.HIGH_COST_THRESHOLD:
                # Temporarily restrict
                self.rate_limiter.set_emergency_limits(user_id, factor=0.1)
                self.audit_log.record("cost_spike_restriction", {
                    "user_id": user_id,
                    "cost": cost,
                    "action": "emergency_rate_limit",
                })
                # Flag for human review
                self.alerts.send_to_security_team(
                    f"Possible cost abuse: {user_id} spent ${cost:.2f} in one hour"
                )

LLM cost abuse is a real financial risk that requires the same systematic treatment as security vulnerabilities. Rate limiting, monitoring, and alerting are not optional features — they are prerequisites for deploying AI features at scale without incurring unexpected financial exposure.