In this post I’ll be covering the syntax of the SQL GROUP BY and walking through its various use cases with practical examples. I’m assuming you are familiar with basic SQL concepts such as tables, rows, columns, and simple select statements.

If you don’t, I suggest checking out Helen Anderson’s excellent Relational Database Fundamentals article, as well as the rest of her SQL 101 posts.

What is GROUP BY in SQL?

The GROUP BY clause takes a list of column names and determines unique groupings within the input rows according to those column names. The output of the GROUP BY will contain only rows having distinct values for the columns that were listed. The effect is similar to using the DISTINCT keyword in the SELECT part of your query.

Let’s take a look at an example.

SELECT browser FROM log GROUP BY browser;
IP Browser Referrer Page Visited
49.100.122.19 Chrome Google /2019/08/10/joins.html 2019-08-29 18:06:53
50.148.126.14 Chrome Google /2019/08/01/group-by.html 2019-08-27 10:05:50
51.128.136.18 Chrome Google /2019/08/10/joins.html 2019-08-25 11:50:32
32.120.111.12 Firefox Google /2019/08/01/group-by.html 2019-08-25 09:30:16
12.178.106.10 Firefox Bing /2019/07/01/recurser.html 2019-07-26 01:20:15
This log table contains (fictional) requests to this site and includes the IP address, browser, referrer, requested page, and time of visit for each visitor.

We’ll get the following result after running our first GROUP BY query.

Browser
Chrome
Firefox

Not incredibly useful by itself, and in fact, is exactly what we’d get if we instead used DISTINCT as in the following example.

SELECT DISTINCT browser FROM log;

SQL Aggregate Functions

The SQL aggregate functions are what distinguish GROUP BY from DISTINCT. The count, sum, min, and max functions are all examples of common aggregate functions.

How could we answer some of the following questions with our knowledge of GROUP BY and aggregate functions?

  • How many visits have there been to each page?
  • How many page visits by browser and referrer?
  • Can we break down page visits by month?

Let’s use the count function to answer the first question.

SELECT browser, count(*) FROM log GROUP BY browser;
Browser Count
Chrome 4
Firefox 2

The count function has determined how many rows are present in each group.

What happens if a column does not appear in the GROUP BY clause or in an aggregate function?

Let’s take a second to answer this before moving on.

You might be wondering what would happen if we selected another column, such as IP address, in the query we just executed. Let’s try that out in PostgreSQL and see what happens.

ERROR: column "log.ip" must appear in the GROUP BY clause or be used in an aggregate function

The reason for this is fairly straightforward. GROUP BY can either show you unique values, like Chrome and Firefox in our example, or it can perform some kind of aggregation. If we ask for a column that is neither unique (i.e. part of the GROUP BY list of columns) nor part of an aggregation, we’re essentially forcing the database to make an arbitrary decision about which row to pick in a group.

Most SQL databases will refuse to select columns that are neither grouped nor aggregated. In earlier versions of MySQL, this was possible, but the result is arbitrary if the values vary within the group. The only_full_group_by configuration setting in MySQL 5.6 and prior defaults to false, meaning such column references are allowed. In MySQL 5.7 and later any such query will be rejected by default.

What can you do if you need those extra columns in your output?

First of all, this makes the most sense with a min or max aggregate, where there’s potentially (but not always) a single record per group with that value. The solution involves turning your initial GROUP BY query into a subquery that’s used to join to the original table.

Let’s look at an example where we’re interested in the earliest visit to the site by browser.

SELECT * FROM log JOIN (
SELECT browser, min(visited) AS min_visit FROM log GROUP BY browser
) AS first_visit ON
log.browser = first_visit.browser AND
log.visited = first_visit.min_visit;

If you’re a little rusty on joins or they’re new to you, take a look at SQL Join Types Explained.

GROUP BY Multiple Columns

So far we’ve only had one column in our GROUP BY clauses. In our first example we determined how many visits each page received. Can we extend that by adding referrer? Fortunately, having multiple columns in the group list is pretty easy.

SELECT browser, referrer, count(*) FROM log GROUP BY browser, referrer;

That will give us the following result.

Browser Referrer Count
Firefox Google 1
Chrome Google 4
Firefox Bing 1

GROUP BY Month

Up until now every reference in our group list has been to a table column. What about using computed values as part of the grouping? In this example we want to break down page visits by month. We have the visited column, but we’ll need to transform that value into a month.

The extract function is our friend here.

SELECT page, extract(month from visited) as month_visited, count(*) FROM log 
GROUP BY page, month_visited;
Page Month Visited Count
/2019/08/10/joins.html 8 2
/2019/08/01/group-by.html 8 2
/2019/07/01/recurser.html 7 1

PostgreSQL and MySQL allow aliases to be used in the GROUP BY, but SQL Server does not. In SQL Server you would need to repeat the extract function call, once in the select, and again in the group clause.

SQL HAVING vs WHERE

The HAVING clause is used to filter groups out of results. The WHERE clause operates on input rows, and thus performs its filtering prior to any grouping. You’ll usually see aggregate functions such as min, max, sum, or count in the having clause. Let’s build on the prior example to add a HAVING clause.

SELECT page, extract(month from visited) as month_visited, count(*) FROM log 
GROUP BY page, month_visited
HAVING count(*) > 1;

Here we’re interested only in those page/month groups that received more than 1 visit.

Page Month Visited Count
/2019/08/10/joins.html 8 2
/2019/08/01/group-by.html 8 2