LLM Cost Security: Preventing Prompt Flooding and API Abuse
How prompt flooding works as a financial denial-of-service attack, and how to implement rate limiting, token budgets, cost alerting, and abuse detection to protect your LLM application.
LLM API costs scale linearly with usage — typically measured in tokens processed, not requests made. This creates a vulnerability unique to AI applications: a malicious actor can trigger enormous financial costs without exploiting any code vulnerability. They simply send large inputs to your AI feature until your API bill reaches thousands of dollars.
This is financial denial-of-service, and it's underappreciated as a security threat. Traditional DoS protection defends availability; LLM cost abuse doesn't take your service down — it drains your budget while your service continues running, sometimes for hours before anyone notices.
The Financial DoS Attack Surface
Attack Vectors
1. Context window flooding
Modern LLMs have large context windows — GPT-4o supports 128,000 tokens, Claude 3.5 Sonnet supports 200,000 tokens. An attacker who can fill the context window pays you almost nothing (their request to your free tier or low-cost UI), but you pay the provider for processing 200,000 tokens at $0.003/1K = $0.60 per request.
Scale this: a single attacker making 100 requests/minute for an hour processes 200K × 6,000 = 1.2 billion tokens. At $0.003/1K, that's $3,600 in one hour.
# What an attack looks like
def flood_context(base_url: str):
filler = "a " * 100000 # ~100K tokens of garbage
payload = {
"messages": [{"role": "user", "content": filler + "What is 2+2?"}]
}
for _ in range(1000):
requests.post(f"{base_url}/chat", json=payload)
2. Recursive or expansion attacks
"Write a 10-page story. For every paragraph in the story, provide
a 10-page expansion. For every paragraph in those expansions,
provide a further 10-page expansion."
Without output length limits, this generates a theoretically unlimited token count.
3. Expensive model abuse
If your application allows model selection and you have tiered pricing (cheaper model for basic queries, expensive model for complex ones), attackers route all requests to the most expensive model.
4. Embedding generation abuse
Generating embeddings is cheaper than text generation but faster to abuse — embedding APIs have higher rate limits and simpler abuse patterns.
5. Fine-tuned model abuse
If you expose a fine-tuned model via API, attackers can run inference at your cost rather than paying for their own.
Rate Limiting Architecture
Rate limiting for LLM applications requires tracking multiple dimensions that don't apply to traditional APIs.
Token-Based Rate Limiting
Requests vary enormously in cost based on token count. A rate limit based only on request count (N requests per minute) is insufficient — one request with 128K tokens costs as much as 128 requests with 1K tokens each.
import redis
import tiktoken
from functools import wraps
encoder = tiktoken.encoding_for_model("gpt-4o")
class TokenBudgetRateLimiter:
def __init__(self, redis_client: redis.Redis):
self.redis = redis_client
# Per-user token budgets (configurable per plan)
self.budgets = {
"free": {"tokens_per_hour": 10_000, "tokens_per_day": 50_000, "requests_per_minute": 5},
"pro": {"tokens_per_hour": 100_000, "tokens_per_day": 500_000, "requests_per_minute": 30},
"enterprise": {"tokens_per_hour": 1_000_000, "tokens_per_day": 10_000_000, "requests_per_minute": 200},
}
def count_tokens(self, messages: list[dict]) -> int:
total = 0
for message in messages:
total += len(encoder.encode(message.get("content", "")))
total += 4 # per-message overhead
return total
def check_and_consume(self, user_id: str, plan: str, messages: list[dict]) -> None:
input_tokens = self.count_tokens(messages)
budget = self.budgets.get(plan, self.budgets["free"])
from datetime import datetime
now = datetime.utcnow()
hour_key = f"tokens:h:{user_id}:{now.strftime('%Y%m%d%H')}"
day_key = f"tokens:d:{user_id}:{now.strftime('%Y%m%d')}"
rpm_key = f"rpm:{user_id}:{now.strftime('%Y%m%d%H%M')}"
pipe = self.redis.pipeline()
pipe.incrby(hour_key, input_tokens)
pipe.expire(hour_key, 3600)
pipe.incrby(day_key, input_tokens)
pipe.expire(day_key, 86400)
pipe.incr(rpm_key)
pipe.expire(rpm_key, 60)
hourly_tokens, _, daily_tokens, _, minute_requests, _ = pipe.execute()
if minute_requests > budget["requests_per_minute"]:
raise RateLimitError(f"Request rate limit exceeded: {budget['requests_per_minute']} rpm")
if hourly_tokens > budget["tokens_per_hour"]:
raise RateLimitError(f"Hourly token budget exceeded")
if daily_tokens > budget["tokens_per_day"]:
raise RateLimitError(f"Daily token budget exceeded")
Input Token Validation
Validate input size before making any API call:
MAX_INPUT_TOKENS = {
"free": 2_000,
"pro": 16_000,
"enterprise": 128_000,
}
def validate_and_truncate_input(messages: list[dict], user_plan: str) -> list[dict]:
"""Validate input length and truncate if needed."""
max_tokens = MAX_INPUT_TOKENS.get(user_plan, 2_000)
total_tokens = count_tokens(messages)
if total_tokens > max_tokens:
raise ValueError(
f"Input too long: {total_tokens} tokens (plan limit: {max_tokens}). "
f"Please shorten your input."
)
return messages
Output Token Limits
Always set max_tokens on every API call. Set it based on the expected output for your use case, not the maximum possible:
MAX_OUTPUT_TOKENS = {
"summarize": 500,
"chat_response": 1024,
"code_generation": 2048,
"document_analysis": 4096,
}
def create_completion(
messages: list[dict],
use_case: str,
model: str = "gpt-4o",
) -> str:
max_tokens = MAX_OUTPUT_TOKENS.get(use_case, 1024)
response = openai.chat.completions.create(
model=model,
messages=messages,
max_tokens=max_tokens, # Never omit this
timeout=30, # Prevent runaway requests
)
return response.choices[0].message.content
Cost Monitoring and Alerting
Real-Time Cost Tracking
from dataclasses import dataclass
from decimal import Decimal
@dataclass
class ModelCostConfig:
input_per_1k: Decimal
output_per_1k: Decimal
MODEL_COSTS = {
"gpt-4o": ModelCostConfig(Decimal("0.0025"), Decimal("0.010")),
"gpt-4o-mini": ModelCostConfig(Decimal("0.000150"), Decimal("0.000600")),
"claude-3-5-sonnet-20241022": ModelCostConfig(Decimal("0.003"), Decimal("0.015")),
"claude-3-haiku-20240307": ModelCostConfig(Decimal("0.00025"), Decimal("0.00125")),
"text-embedding-3-large": ModelCostConfig(Decimal("0.00013"), Decimal("0")),
}
class CostTracker:
def __init__(self, redis_client: redis.Redis, alert_service):
self.redis = redis_client
self.alerts = alert_service
def record_usage(
self,
user_id: str,
model: str,
input_tokens: int,
output_tokens: int,
) -> Decimal:
costs = MODEL_COSTS.get(model)
if not costs:
return Decimal("0")
cost = (
Decimal(input_tokens) / 1000 * costs.input_per_1k +
Decimal(output_tokens) / 1000 * costs.output_per_1k
)
# Update running totals
from datetime import datetime
now = datetime.utcnow()
pipe = self.redis.pipeline()
# Store as integer cents to avoid float precision issues
cost_cents = int(cost * 10000) # microdollars
pipe.incrby(f"cost:user:{user_id}:hour:{now.strftime('%Y%m%d%H')}", cost_cents)
pipe.incrby(f"cost:user:{user_id}:day:{now.strftime('%Y%m%d')}", cost_cents)
pipe.incrby(f"cost:global:hour:{now.strftime('%Y%m%d%H')}", cost_cents)
pipe.incrby(f"cost:global:day:{now.strftime('%Y%m%d')}", cost_cents)
pipe.execute()
# Alert on anomalous per-user costs
hourly_user_cost = Decimal(self.redis.get(
f"cost:user:{user_id}:hour:{now.strftime('%Y%m%d%H')}"
) or 0) / 10000
if hourly_user_cost > Decimal("10.00"):
self.alerts.send(
severity="warning",
message=f"High LLM cost: user {user_id} spent ${hourly_user_cost:.2f} this hour",
)
return cost
Cost Anomaly Detection
class CostAnomalyDetector:
def __init__(self, redis_client: redis.Redis):
self.redis = redis_client
def detect_anomaly(self, user_id: str) -> dict:
from datetime import datetime, timedelta
now = datetime.utcnow()
# Get last 7 days of hourly costs
hourly_costs = []
for i in range(168): # 7 days * 24 hours
hour = (now - timedelta(hours=i)).strftime('%Y%m%d%H')
cost = int(self.redis.get(f"cost:user:{user_id}:hour:{hour}") or 0)
hourly_costs.append(cost)
if not hourly_costs:
return {"anomaly": False}
import statistics
mean = statistics.mean(hourly_costs)
stdev = statistics.stdev(hourly_costs) if len(hourly_costs) > 1 else 0
current_hour_cost = hourly_costs[0]
# Alert if current hour is more than 3 standard deviations above mean
z_score = (current_hour_cost - mean) / stdev if stdev > 0 else 0
is_anomaly = z_score > 3 and current_hour_cost > mean * 5
return {
"anomaly": is_anomaly,
"current_hour_cost_usd": current_hour_cost / 10000,
"mean_hourly_cost_usd": mean / 10000,
"z_score": z_score,
}
Abuse Detection Patterns
Detecting Flooding Behavior
class AbuseDetector:
SUSPICIOUS_PATTERNS = [
"a " * 100, # Repetitive filler
"." * 200, # Dot flooding
"\n" * 100, # Newline flooding
]
def detect_filler_content(self, text: str) -> dict:
"""Detect suspiciously repetitive content designed to inflate token count."""
if len(text) < 100:
return {"flagged": False}
# Check character entropy (low entropy = repetitive content)
from collections import Counter
import math
char_counts = Counter(text)
total = len(text)
entropy = -sum((c/total) * math.log2(c/total) for c in char_counts.values())
# Normal English text has entropy around 4-5 bits per character
# Repetitive filler has entropy near 0
is_low_entropy = entropy < 2.0
# Check for known filler patterns
has_filler_pattern = any(
pattern in text for pattern in self.SUSPICIOUS_PATTERNS
)
return {
"flagged": is_low_entropy or has_filler_pattern,
"entropy": entropy,
"reason": "low entropy content" if is_low_entropy else "filler pattern detected",
}
def detect_rapid_fire_requests(self, user_id: str, window_seconds: int = 10) -> bool:
"""Detect bursts of requests faster than human typing speed."""
from datetime import datetime
key = f"burst:{user_id}:{datetime.utcnow().strftime('%Y%m%d%H%M%S')[:15]}"
count = self.redis.incr(key)
self.redis.expire(key, window_seconds)
# 10 requests in 10 seconds is faster than human interaction
return count > 10
CAPTCHA and Proof-of-Work for High-Cost Operations
For operations that trigger expensive inference, add friction before the API call:
from hashlib import sha256
def generate_proof_of_work_challenge() -> dict:
"""Generate a simple proof-of-work that costs compute on the client side."""
import secrets
nonce = secrets.token_hex(16)
difficulty = 4 # Number of leading zeros required
return {
"nonce": nonce,
"difficulty": difficulty,
"challenge": f"Find x such that sha256('{nonce}' + x) starts with {'0' * difficulty}",
}
def verify_proof_of_work(nonce: str, solution: str, difficulty: int) -> bool:
hash_result = sha256(f"{nonce}{solution}".encode()).hexdigest()
return hash_result.startswith('0' * difficulty)
Proof-of-work is not appropriate for all UX contexts, but for API endpoints that are particularly expensive (long-context analysis, multi-step agent tasks), it meaningfully raises the cost of abuse.
Provider-Level Controls
OpenAI Usage Limits
Configure hard usage limits in the OpenAI dashboard:
# OpenAI API allows programmatic limit management
import openai
def set_usage_limits(project_id: str, monthly_budget_usd: float):
"""Set hard cost limits for an OpenAI project."""
# Via OpenAI dashboard → Project → Limits
# Programmatically via API (requires admin key):
client = openai.OpenAI(api_key=os.environ["OPENAI_ADMIN_KEY"])
# Set hard limit
# Hard limit: project is blocked when reached
# Soft limit: email alert when reached
print(f"Set monthly limit of ${monthly_budget_usd} for project {project_id}")
print("Configure via: platform.openai.com/account/limits")
Separate Projects per Environment
Use separate API keys and projects per environment with separate spending limits:
- Production: high limit with alerting
- Staging: lower limit
- Development: strict per-developer limits
Economic Model for Free Tiers
If you offer a free tier with AI features, model the economics explicitly:
FREE_TIER_ECONOMICS = {
"max_requests_per_day": 20,
"max_tokens_per_request": 2000,
"model": "gpt-4o-mini", # Use cheapest model for free tier
"cost_per_heavy_user_per_month": 0.50, # Estimated
"acceptable_cost_per_user": 1.00,
}
def is_free_tier_sustainable(free_user_count: int, actual_avg_cost_per_user: float) -> dict:
total_monthly_cost = free_user_count * actual_avg_cost_per_user
projected_annual = total_monthly_cost * 12
return {
"monthly_cost": total_monthly_cost,
"annual_projection": projected_annual,
"cost_per_user": actual_avg_cost_per_user,
"within_budget": actual_avg_cost_per_user <= FREE_TIER_ECONOMICS["acceptable_cost_per_user"],
"recommendation": (
"increase free tier restrictions" if actual_avg_cost_per_user > FREE_TIER_ECONOMICS["acceptable_cost_per_user"]
else "current limits acceptable"
),
}
Incident Response for Cost Abuse
When a cost spike is detected:
- Immediate: Query which user(s) are driving the spike
- Containment: Temporarily reduce rate limits for the implicated accounts
- Investigation: Review request logs for the attack pattern
- Remediation: Adjust rate limits, add validation, or block accounts
- Provider: Contact OpenAI/Anthropic if charges were fraudulent (e.g., from a stolen credit card buying API credits)
class CostIncidentResponder:
def handle_spike(self, spike_data: dict):
top_users = self.identify_top_cost_users(spike_data["hour"])
for user_id, cost in top_users:
if cost > self.HIGH_COST_THRESHOLD:
# Temporarily restrict
self.rate_limiter.set_emergency_limits(user_id, factor=0.1)
self.audit_log.record("cost_spike_restriction", {
"user_id": user_id,
"cost": cost,
"action": "emergency_rate_limit",
})
# Flag for human review
self.alerts.send_to_security_team(
f"Possible cost abuse: {user_id} spent ${cost:.2f} in one hour"
)
LLM cost abuse is a real financial risk that requires the same systematic treatment as security vulnerabilities. Rate limiting, monitoring, and alerting are not optional features — they are prerequisites for deploying AI features at scale without incurring unexpected financial exposure.