IoT pipelines that don't break on the third Tuesday

The first Tuesday is fine. The second Tuesday is fine. The third Tuesday is when you find out which assumption was load-bearing.

[photo to follow]

Most IoT pipelines look great on a demo dataset. Real telemetry breaks them in ways that are surprisingly hard to reproduce in a lab — clock drift, message ordering, partial outages of a single gateway, the firmware update that ships at 02:00 local time.

What we instrument before anything else

Three lines of telemetry that pay for themselves the first time something goes wrong:

  • A heartbeat per device, regardless of business payload
  • A monotonic sequence number, so you can detect gaps without inferring them
  • A receive-side timestamp distinct from the device-side timestamp

The third one matters most. The day a sensor’s clock drifts by ninety minutes, you want to know whether the lateness is on the wire or in the device.

“We thought it was the network. It was always the firmware.”

Pipelines that “don’t break” are usually pipelines that fail visibly and recover automatically. The brittle ones are the ones that quietly drop two percent of messages and call it a rounding error.

Three patterns that survive contact with reality

We default to these unless there’s a specific reason not to:

  • Idempotent writes keyed on (device_id, sequence) — the fix for nearly every duplicate-event panic.
  • A dead-letter store with the original payload intact — the only way to learn from a failure you can’t reproduce.
  • Slow alerts before fast alerts — paging on a five-minute trend is calmer than paging on every spike.

None of this is novel. All of it is the kind of thing you remember to add after the third Tuesday, but it costs almost nothing to put in on day one.


Filed under Engineering

Contact

A first conversation costs nothing.

If something here landed, tell us what you're working on. We'll tell you what we'd do.