 
              Trill 102: Temporal Joins
 
              WRITTEN BY
/en-us/opensource/blog/author/james-terwilliger
This post is the second in a sequence intended to introduce developers to the Trill streaming query engine, its programming model, and its capabilities. We introduced in the previous post the concept of snapshot semantics for temporal query processing. Here, we go deeper into the mechanics of snapshot semantics by showing its impact on one of the most basic relational operations in query processing, the Join.
The join operation and how it is handled is one of the key differences between a standard relational query engine and one that operates on temporal data. What’s more, how an engine accommodates a notion of time in its query language speaks volumes about how that system thinks about time management. So here, we take the time to demonstrate Trill’s take on joining data.
A join is, of course, a binary operation, taking two inputs and producing a single output stream. Snapshot semantics tells us that we want to be looking at join results at each point in time. To see this operation in action graphically, consider the following simple pair of streams with events we want to join:
 
               
              Points, points everywhere
Of course, a lot of naturally temporal data does not appear as intervals but rather as points. A sensor reading, for instance, comes into the system with a reading taken at a single time, say 10:15am. Whether that reading should be valid until the next reading, or for a certain period, or some other heuristically-driven policy is a matter for the query designer. But before any lifetime alterations are made in the query, the input data to a join would be just points, looking like this:
 
              A common join operation over point events such as these might look something like, “join the left and right data where the points are within 10 minutes of each other.” Indeed, for engines whose query and lifetime operations are not separate, a join must be specified in that form, with an explicit temporal condition.
But as with all things Trill, we need to think of data as intervals. We will need to adjust the lifetime of these points to bring about the join result that we want. To do so, we need to answer two questions:
- Which points on the left and right should match each other?
- What lifetime do we expect the result to have?
To answer both simultaneously, we need to reformulate the initial question. “Join left and right data within 10 minutes” alone only addresses the first point.
Here is one way to reformulate the question: “Return all data from the left, matched with any data from the right within 10 minutes.” The asymmetric nature of this new question gives us the lifetime information that we want – we’re using the data on the left as a pivot, so we want to preserve whatever the lifetime information was in the left data.
To effect the correct result for this question, we need to take the data on the right and adjust its lifetime to include the plus/minus 10 minute time window. Because we use the left data as the pivot, we leave it alone. The resulting lifespans, along with the join query result, look like this:
 
               
              Time is of the essence
There is yet another way to formulate the question, which also happens to be the default way that temporal joins over points works in many systems: “Return data from the left and right that occur within 10 minutes of each other, as quickly as possible with whatever timestamp is convenient.” That sentence is a mouthful, but its intent is clear: What’s important is finding whatever data matches, and the resulting lifetimes are irrelevant so long as latency is minimal.
The problem with latency in the left-pivot and right-pivot cases is that the width of the lifetime extensions may impact latency. Intuitively, the wider one expands a lifetime in a join input, the more latency as the system must wait longer for potential matches. There is a way to change the event lifetimes to get the same results but with smaller overall time windows, as follows:
 
              In this diagram, instead of extending one side’s events by 10 minutes in both directions, we extend events on both sides by 10 minutes in one direction. We get the same two results thanks to the overlap in time intervals. The lifespans of the result events are… complicated, roughly belonging to whichever of the joined events happened second.
What’s the takeaway here? One big advantage of having the join operation use snapshot semantics is that there is a wide variety of possibilities for returning results that allow the query author control over the result. The temporal join offered by other streaming systems is possible, but so are other variants that may be more semantically correct if desired.
Like siblings on a car trip: I’m only going if you’re not
With this understanding of join in hand, we can take our newfound knowledge and apply it to other binary operations as well. What if we wanted to take an anti-semijoin of our two data streams instead?
 
               
               
              Are we there yet?
Between this post and the previous one, we have covered a fair amount of Trill’s basic query surface. We still need to show a little more about how aggregations work, as well as Trill’s support for automata and regular expressions. But before that, one big piece remaining is describing how to get data into and out of Trill, so our next tutorial post will cover ingress policies and temporal order. Until then, as always, please feel free to look at our code, skim through our usage samples, and let us know if there’s something you’d like to see more detail on.
