In early 2014, I began a focused push to make all critical services idempotent by design. Up until then, some retry mechanisms were "best effort."
But distributed systems don't forgive ambiguity. If you can't guarantee side-effect safety on retries, you don't have resilience.
What helped:
- Using client-provided or auto-generated request IDs
- Keeping state transitions explicit and tracked
- Pairing with distributed tracing to validate call integrity
You don’t need idempotency everywhere—just everywhere that matters under failure.