SQL tutorial
SQL Group By
Learn more about SQL, a standard language for interacting with databases and storing, manipulating, and retrieving data from databases.
Go hands-on with SQL in our free interactive SQL tutorial.
While WHERE
is one of the most useful and frequently used clauses in SQL as it can be applied to filter data, the SQL GROUP BY
clause is the basis for performing aggregation, unlocking the power of data analytics in SQL.
The SQL GROUP BY
clause modifies the behavior of a SELECT
statement and aggregates data over the column or columns specified. This results in the data being returned in a summarized form, and having equal to or less than the number of rows from the source.
In a simple query, the GROUP BY
clause follows the table name, with the column values to aggregate over afterward:
SELECT <expression> FROM <table> GROUP BY <columns>;
In a more complex query with other clauses, GROUP BY
follows the table name and WHERE
, but precedes ORDER BY
and LIMIT
:
SELECT <expression> FROM <table> WHERE <predicate>
GROUP BY <columns>
ORDER BY <column> LIMIT <row count>;
The GROUP BY
clause is almost always used with aggregate functions. Since the data are being returned on a different row-level than that of the source, these functions are needed to perform summary calculations on another column or columns that do not form the basis for aggregation.
One of the most commonly executed queries by an analyst is the “count star” query, which uses the COUNT
function to answer the question – how many rows are in this table?
Let’s take a look at this type of query for the employees
table:
SELECT COUNT(*) FROM employees;
This returns the result which is the number of rows – 300,024. However, we still have not used the GROUP BY
clause in the above.
Let’s say instead, we wanted to answer a specific business question about the ~300K employees: for example, what is the number of employees by gender?
To answer this question and count the number of rows appearing for each value occurring in the gender column, we add the GROUP BY
clause and the column name, and SQL will return the aggregated result:
SELECT gender, COUNT(*) FROM employees GROUP BY gender;
gender | COUNT(*) |
M | 179973 |
F | 120051 |
We can see that there are less females in the company’s employ with ~120K vs. the nearly ~180K male employees. Though only two rows were returned here, we also see that the sum of the two counts adds up to the total number of rows in the table, as 179,973 + 120,051 = 300,024.
GROUP BY
can also be used to perform aggregations over multiple columns, and calculations for all combinations of the distinct values appearing in each. In this case the column names following the GROUP BY
clause are separated by commas. For example, we could also calculate the number of employees by gender and by last name:
SELECT last_name, gender, COUNT(*) FROM employees GROUP BY last_name, gender;
Note that for queries using aggregation, column names that appear in the SELECT
statement must also appear following the GROUP BY
clause. Though the converse is not true, omitting the columns which are aggregated over from being displayed can make it difficult to ascertain which rows of the result correspond to which values of the aggregated columns.
As a general best practice for interpretability, when writing queries using the GROUP BY
clause, all columns which appear after it should also after SELECT
and vice-versa.
Learn SQL Today
Get hands-on experience writing code with interactive tutorials in our free online learning platform.
- Free and fun
- Designed for beginners
- No downloads or setup required