Friday, January 16, 2015

Why You Should Blend Instead of Join

Even to an experienced SQL query writer, the mechanics of blending data, rather than joining it, can be difficult to understand upon initial exposure. Data blending is a fairly advanced concept; this article explains the concept of data blending at a basic level using Excel as an example.

Data blending in Tableau is the operation of combining multiple data sources into the same view by finding common fields between them to join on.

How is this different from a regular join?

Let’s first describe the equivalent of a data blend using an example with Excel spreadsheets.

Suppose you have two different Excel spreadsheets, the first sheet contains data on sales, the second sheet contains quota for sales. If you wanted to do a side by side comparison of sales vs. sales quota by country, you would likely first aggregate the sales data to be summed by country, then perform a vlookup to the sales quota sheet on each country to grab the quota for that country and place the result next to the sales for that country in the aggregated table.

This concept is referred to as a post-aggregate join, as the vlookup to the sales quota spreadsheet did not occur until after the sales data was already aggregated by country. Therefore, the “join” to the second data source was done post-aggregation. This process of joining data sources post-aggregation is referred to as data blending in Tableau.

Had we performed a vlookup on the sales data before aggregating by country, we would likely have multiple sales records for each country with the quota next to each record. In order to get the desired comparison, we would have to aggregate the data by summing the sales column by country, and grabbing the max or min sales quota by country. This process is understandably more work, as two different aggregations have to occur, and much more data ends up being processed at once because the vlookup occurred before the aggregation.

Creating this in Tableau would mean having your sales data in one data source, sales quota data in another data source, creating an aggregated view of the sum of sales by country, and simply dragging quota onto the visualization next to the sales column. Tableau will automatically recognize the common field of Country between the two data sources, and use that to do a post-aggregate join. Tableau does not join the two data sources until after the data is aggregated. Given that the desired visualization aggregates the primary data source, the post-aggregate join is much better for performance than joining at the row-level first and then performing the aggregate calculations.

2 comments:

  1. Good write-up! The last paragraph really helps explain what blending is doing.

    ReplyDelete