In this post I’ll be covering the syntax of the SQL
GROUP BY and walking through its various use cases with practical examples. I’m assuming you are familiar with basic SQL concepts such as tables, rows, columns, and simple select statements.
If you don’t, I suggest checking out Helen Anderson’s excellent Relational Database Fundamentals article, as well as the rest of her SQL 101 posts.
What is GROUP BY in SQL?
GROUP BY clause takes a list of column names and determines unique groupings within the input rows according to those column names. The output of the
GROUP BY will contain only rows having distinct values for the columns that were listed. The effect is similar to using the
DISTINCT keyword in the
SELECT part of your query.
Let’s take a look at an example.
This log table contains (fictional) requests to this site and includes the IP address, browser, referrer, requested page, and time of visit for each visitor.
We’ll get the following result after running our first
GROUP BY query.
Not incredibly useful by itself, and in fact, is exactly what we’d get if we instead used
DISTINCT as in the following example.
SQL Aggregate Functions
The SQL aggregate functions are what distinguish
GROUP BY from
DISTINCT. The count, sum, min, and max functions are all examples of common aggregate functions.
How could we answer some of the following questions with our knowledge of
GROUP BY and aggregate functions?
- How many visits have there been to each page?
- How many page visits by browser and referrer?
- Can we break down page visits by month?
Let’s use the
count function to answer the first question.
The count function has determined how many rows are present in each group.
What happens if a column does not appear in the GROUP BY clause or in an aggregate function?
Let’s take a second to answer this before moving on.
You might be wondering what would happen if we selected another column, such as IP address, in the query we just executed. Let’s try that out in PostgreSQL and see what happens.
ERROR: column "log.ip" must appear in the GROUP BY clause or be used in an aggregate function
The reason for this is fairly straightforward.
GROUP BY can either show you unique values, like Chrome and Firefox in our example, or it can perform some kind of aggregation. If we ask for a column that is neither unique (i.e. part of the
GROUP BY list of columns) nor part of an aggregation, we’re essentially forcing the database to make an arbitrary decision about which row to pick in a group.
Most SQL databases will refuse to select columns that are neither grouped nor aggregated. In earlier versions of MySQL, this was possible, but the result is arbitrary if the values vary within the group. The only_full_group_by configuration setting in MySQL 5.6 and prior defaults to false, meaning such column references are allowed. In MySQL 5.7 and later any such query will be rejected by default.
What can you do if you need those extra columns in your output?
First of all, this makes the most sense with a min or max aggregate, where there’s potentially (but not always) a single record per group with that value. The solution involves turning your initial
GROUP BY query into a subquery that’s used to join to the original table.
Let’s look at an example where we’re interested in the earliest visit to the site by browser.
If you’re a little rusty on joins or they’re new to you, take a look at SQL Join Types Explained.
GROUP BY Multiple Columns
So far we’ve only had one column in our
GROUP BY clauses. In our first example we determined how many visits each page received. Can we extend that by adding referrer? Fortunately, having multiple columns in the group list is pretty easy.
That will give us the following result.
GROUP BY Month
Up until now every reference in our group list has been to a table column. What about using computed values as part of the grouping? In this example we want to break down page visits by month. We have the
visited column, but we’ll need to transform that value into a month.
The extract function is our friend here.
PostgreSQL and MySQL allow aliases to be used in the
GROUP BY, but SQL Server does not. In SQL Server you would need to repeat the extract function call, once in the select, and again in the group clause.
SQL HAVING vs WHERE
HAVING clause is used to filter groups out of results. The
WHERE clause operates on input rows, and thus performs its filtering prior to any grouping. You’ll usually see aggregate functions such as min, max, sum, or count in the having clause. Let’s build on the prior example to add a
Here we’re interested only in those page/month groups that received more than 1 visit.