Lessons from Taming Clock Skew in Distributed Jobs
Lessons from Taming Clock Skew in Distributed Jobs

Lessons from Taming Clock Skew in Distributed Jobs

Author
Shiv Bade
Tags
clock skew
distributed computing
scheduling
retries
Published
November 10, 2013
Featured
Slug
Tweet
We ran into a strange issue: scheduled jobs running "twice" or not at all across nodes. The culprit? Clock skew.
In distributed systems, time is an illusion—but you still need some notion of coordination. What helped: - Using ntpd and tighter clock sync configurations - Avoiding “run at exactly X” semantics - Designing retries with logical idempotency
These bugs are rare, but they teach you humility. Distributed systems are always one leap second away from surprising you.