One challenge I tackled in late 2013 was scaling a job processing pipeline under unpredictable spikes.
What worked:
- Introduced multi-priority queues using durable brokers
- Batched writes to avoid write amplification
- Monitored latency percentiles (not just averages)
We also implemented exponential backoff for retries—concepts later popularized in Google SRE practices.
Lesson: async design isn't just about throughput—it’s about predictability and graceful degradation.