The out-of-order data landing problem

The out-of-order data landing problem

Applying window functions over data is non-trivial if data arrives out-of-order (with respect to the dimension the window function is applied across). For clarity, lets take timeseries data for this example as our window dimension. If timeseries data arrives from Tuesday through Thursday of a week, then at a later time data from Monday of that week arrives, the data has arrived out-of-order.

Image for post
Image for post
Photo by Ricardo Gomez Angel on Unsplash

As a window function output is sensitive to its surroundings in timespace, the results of the window function would be altered by the new out-of-order data that landed. All affected data needs to be reprocessed.

You could…

Hands-on Tutorials

The problem

I was recently presented the challenge to join two timeseries datasets together on their timestamps without requiring the corresponding data from either dataset to arrive at the same time as the other. For example, data from one day last month from one dataset may have landed on S3 a week ago, and the corresponding data from the other dataset for that day last month may have landed yesterday. This is an incremental join problem.

Image for post
Image for post
Photo by Shane Rounce on Unsplash

Potential solutions

  • It may be possible to get around this problem by holding off from joining the data until it was queried, however I wanted to pre-process the…

Hamish Lamotte

Data scientist and data applications architect.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store