Bütün yazılarENGINEERING · 4 DƏQ OXUMA

The anatomy of a 3-second DDoS that wasn't.

TN
Tuomas Nieminen · SRE Lead
·
March 5, 2026
3s
PAGE-WORTHY DURATION

On February 27th at 14:03 UTC, our edge routers at FRA1 reported a sudden burst of inbound traffic that peaked at 28 Gbps and completely saturated one of our 40G uplinks for just over three seconds. On-call paged. Mitigation kicked in. Everything looked like a mid-sized DDoS in progress.

It wasn't a DDoS. It was a customer's cron job.

This is the writeup, the mistakes, and what we changed afterward.

Timeline

14:03:12 UTC  Edge router detects abnormal inbound traffic (28 Gbps)
14:03:14 UTC  Automated DDoS mitigation engages; on-call paged
14:03:16 UTC  Traffic subsides to baseline
14:03:58 UTC  Oncall engineer begins investigation
14:07:30 UTC  Traffic pattern correlated to a single customer IP
14:14:00 UTC  Customer identified; webhook receiver endpoint implicated
14:22:00 UTC  Customer contacted
15:10:00 UTC  Root cause confirmed: retry storm in third-party integration
15:45:00 UTC  Customer's webhook queue drained; issue fully resolved

Total customer-visible impact: roughly 3 seconds of elevated packet loss on the affected uplink.

What actually happened

One of our VPS customers runs a service that receives webhooks from a large payment processor. The processor had been struggling with a regional backend issue for about two hours before the spike — returning slow, intermittent 5xx responses to outbound webhook deliveries.

The processor's retry policy, on seeing sustained 5xx responses, decided to flush its entire backlog simultaneously when its regional backend recovered. Our customer's endpoint — a single VPS — received approximately 187,000 HTTP POSTs in under four seconds.

Most of those connections hit the TCP accept queue, overflowed, and got reset. The traffic pattern — sudden, high volume, concentrated on one IP — was indistinguishable from a volumetric attack from our edge's perspective.

Why the DDoS classifier fired

Our DDoS mitigation is based on a combination of flow-rate analysis and destination-IP variance. The decision rule, simplified:

  1. If >10 Gbps of inbound traffic is concentrated on fewer than 3 destination IPs within a 2-second window, treat it as a potential attack.
  2. If the source-IP entropy is low (few sources, high volume per source), treat it as a potential attack.
  3. If either matches, engage mitigation.

In this incident, the destination was a single IP, and the source entropy looked "low" because the payment processor's datacentre uses a small ASN with a tight IP range. Two of three rules matched. Mitigation engaged. Correctly, given the information available at the time.

Where we got lucky

A real DDoS would typically sustain for minutes or hours. This one was over in seconds. Our mitigation scheme is aggressive on engagement and conservative on disengagement, so for the 45 minutes after the burst, some of the payment processor's legitimate traffic was being scrubbed at our scrubbing centre before reaching the customer. That's the part that's our fault — and the thing we changed.

The incident was a false positive on engagement, but a real failure on disengagement timing.

What we changed

Three things:

1. Shorter disengagement windows for brief bursts. If a spike resolves within 30 seconds and no secondary bursts follow within 3 minutes, we now drop back to baseline in 90 seconds instead of 45 minutes. This reduces downside on false positives dramatically.

2. Better source classification. We added known-good ASN lists for major cloud providers, payment processors, and CDN origins. Traffic from these sources now requires a higher confidence score before mitigation engages. A real attack rarely originates from Visa.

3. Customer-side alerting. The customer had no idea their webhook endpoint was being hit until we told them. We now offer (opt-in) inbound traffic alerting for VPS customers — if a single IP takes >5× baseline inbound traffic, the customer gets an email and a dashboard notification.

What we didn't change

Our default mitigation rules are still aggressive. A 28 Gbps burst on a single IP is almost always an attack. The one in forty times when it isn't, we'd rather treat as an attack and apologize than miss a real one.

Takeaway for customers

If your endpoint receives webhooks from a third-party service, assume the worst case is not "the webhook is slow" — it's "the webhook backlog flushes all at once when the upstream recovers." Two things help:

  • Request a delivery-rate limit from the webhook source if they offer one (most modern processors do).
  • Queue writes to a message broker (SQS, NATS, Redis Streams) before processing. Your receiver absorbs the burst; your workers drain it at a sustainable rate.

Both of these are good practice anyway. In this case, either one would have turned a 28 Gbps incident into a non-event.

— Tuomas