Today we’re going to explore the different kinds of joins that can be used in SQL, as well as when each of those join types would be useful to us. If you’re new to the concept of joins in general we’ll quickly go over why joins are useful in the first place, but feel free to skip ahead if you’re already familiar with the concept. We’ll dive into the SQL join examples after we’ve had a brief general introduction to the basics.
I’m assuming for this post that you have some experience with database concepts like tables, rows, columns, and simple select statements. If you don’t, I suggest checking out Helen Anderson’s excellent Relational Database Fundamentals article, as well as the rest of her SQL 101 posts.
What is a SQL join query, and when would I use one?
An SQL join query will be needed any time you want to relate the rows from two or more tables together. Let’s work through an example so that we can begin thinking in practical terms.
We’ll look at an imaginary scenario involving medications and drug distributors. Let’s pretend that we’re responsible for producing a sales report summarizing the market performance of each drug over a certain time period.
First I’ll outline the structure and content of each table in this scenario.
A 3-digit code identifies products in our example. The price is a base cost that can be offset by rebates.
Individual sales and their quantities over time.
Rebates can be redeemed later and act as a partial refund to the purchaser. A rebate must be tied back to a sale through the drug code and date of sale.
There are several questions we might have to answer as an analyst at our fictional company.
- How much revenue do we have, broken out by drug?
- How much revenue is each distributor bringing in?
- What are the lowest and highest prices paid, after rebates have been applied?
We will need to relate our tables together if we are going to answer any of these questions. Let’s take a look at how we could answer the first question using joins.
To calculate revenue, we’ll need to aggregate the sales data. Looking at the sales table, however, you can see that we have the quantity sold in each transaction, but not the price at which it was sold. The products table contains the price, so we’ll need to relate the two tables together. How would you do this without a computer? You would go row by row, taking the Code column from sales and matching it against the Code column in the products table.
That’s basically what a SQL database does while executing a join. Let’s write the SQL that tells our database how to do the same.
SQL joins consist of a what and a how: What tables are we going to relate together, and how do I do that? The table written after the JOIN keyword is what we’re relating to the sales table, and the condition after the ON keyword tells the database what must be true in order for any given pair of rows to be related.
Let’s take a look at the resulting table after running this query.
Now that we’ve related the sales and products tables together, we have all of the necessary pieces to answer the original question, but we still need to go a step further. We need to aggregate by drug code, and sum the price across those drug codes.
This query brings us to the answer for our first question.
What are the different types of join?
There are several different kinds of join, but fortunately for us, only a few of them make up the vast majority of use cases. Let’s get a high level overview of the different join types before diving into examples for each one. I’ll list them roughly in order of how frequently you might find a use for each.
|Type||Equivalent To||What Does It Do?|
|JOIN||INNER JOIN||Result contains rows where there is a match between both tables. Rows without a match are omitted.|
|LEFT JOIN||LEFT OUTER JOIN||Result contains all rows from the left side of the join. NULL values appear where there was no match on the right side.|
|RIGHT JOIN||RIGHT OUTER JOIN||Result contains all rows from the right side of the join. NULL values appear where there was no match on the left side.|
|FULL JOIN||FULL OUTER JOIN||Result contains all rows. NULL values appear where there was no match.|
|CROSS JOIN||Result contains all combinations of the left and right side tables (Cartesian product).|
INNER JOIN or simply
JOIN is what we started with in the example at the beginning of this post. The
INNER part is optional and does not change the meaning of the query in any way. An inner join is going to return only those rows that satisfy the condition in the
ON clause. If we refer to the join part of the original example query, we can see that this means our result data will contain rows only where the product code in the sales table has a corresponding entry in the products table.
In our original example, every row from sales has a corresponding row in the products table. What would happen if we had a sale record with a product code that’s not in our products table?
In that case, the extra row from the sales table would not appear at all in our results. This type of join is useful when you would like to omit any rows where there is no match between tables. This behavior is the opposite of
OUTER joins (i.e.
FULL join types).
LEFT OUTER JOIN or simply
LEFT JOIN guarantees that every row from the left hand table will appear in the output. The
OUTER part is optional and does not change the meaning of the query in any way.
In this case, if we have a record in the sales table whose product code does not exist in the products table, our output will still contain that row. Because there was no matching product, what would the product columns contain? Let’s look at that row of the output specifically.
NULL value is special in SQL and denotes the absence of a value. The
OUTER family of join types will use
NULL to populate columns for rows where a match was not found for the left or right side.
We can use the scenario introduced at the beginning as a practical example of when a
LEFT JOIN would be needed. In our first example, we have a rebates table that should offset the prices paid in the sales table. We left out the rebates table when we wrote our initial query for analyzing revenue.
While every sale had a corresponding product, the same is not true of rebates. In fact, the rebates table has only 3 rows. If we were to use an inner join, our query would incorrectly output only 3 rows in total. Let’s revise the initial revenue query to include rebates while making use of a
LEFT JOIN in the process.
The coalesce function returns, in order from left to right, the first non-null value. This is necessary because our
LEFT JOIN is going to populate the rebate dollars column with
NULL for any sale record that has no corresponding rebate. In those cases, zero will be the first non-null value that the coalesce function encounters.
RIGHT OUTER JOIN or simply
RIGHT JOIN is similar to the
LEFT JOIN in that it ensures every row (from the right hand side in this case) is present in the output.
Looking at our previous example of an additional sales record without a matching product, in this case the extra sale would be omitted, because we are essentially starting from the right hand side (i.e. the products table). If instead of an extra sale we had an extra product without a corresponding sale, our output would appear as follows:
FULL JOIN, or equivalently
FULL OUTER JOIN, is the combination of the
LEFT JOIN and
RIGHT JOIN. In this situation, an output row can be missing either the left or right hand side of the join.
Let’s combine the previous
LEFT JOIN and
RIGHT JOIN examples so that we have both an extra sales record and an extra product. It’s important to note here that the new sale has no corresponding product, and likewise, the new product has no corresponding sale. The output would appear as follows in this case:
CROSS JOIN forms a Cartesian product, or in other words, all possible combinations of row pairs between two tables.
Note here that it doesn’t make sense to have an
ON clause, because we’re explicitly asking the database to generate all combinations, regardless of whether the rows match in any way. Taking the sales and products tables from our earlier examples and applying a
CROSS JOIN would generate 18 rows, or 6 per product, because every product is matched to every sale.
You may sometimes see a
CROSS JOIN written as an inner join with an
ON condition that always evaluates to true.
FROM clause also supports a syntax that will produce cross joins.
You might be wondering when we would need to produce all combinations like this. Referring to our original example, let’s assume that we need to organize our revenue report not only by drug code, but by quarter as well. We also want a row for every possible product and quarter, even if there were no sales for that particular combination.
Let’s assume we have a table containing the quarters that we’re interested in.
Because we may not have a sale for all possible product / quarter combinations, it makes sense to use a
CROSS JOIN to generate these combinations as a first step. We can then join to the products and sales tables with a
LEFT JOIN, while using coalesce to handle cases where a given product / quarter did not have a sale.