WHERE is one of the most useful and frequently used clauses in SQL as it can be applied to filter data, the SQL
GROUP BY clause is the basis for performing aggregation, unlocking the power of data analytics in SQL.
GROUP BY clause modifies the behavior of a
SELECT statement and aggregates data over the column or columns specified. This results in the data being returned in a summarized form, and having equal to or less than the number of rows from the source.
In a simple query, the
GROUP BY clause follows the table name, with the column values to aggregate over afterward:
SELECT <expression> FROM <table> GROUP BY <columns>;
In a more complex query with other clauses,
GROUP BY follows the table name and
WHERE, but precedes
ORDER BY and
SELECT <expression> FROM <table> WHERE <predicate> GROUP BY <columns> ORDER BY <column> LIMIT <row count>;
GROUP BY clause is almost always used with aggregate functions. Since the data are being returned on a different row-level than that of the source, these functions are needed to perform summary calculations on another column or columns that do not form the basis for aggregation.
One of the most commonly executed queries by an analyst is the “count star” query, which uses the
COUNT function to answer the question – how many rows are in this table?
Let’s take a look at this type of query for the
SELECT COUNT(*) FROM employees;
This returns the result which is the number of rows – 300,024. However, we still have not used the
GROUP BY clause in the above.
Let’s say instead, we wanted to answer a specific business question about the ~300K employees: for example, what is the number of employees by gender?
To answer this question and count the number of rows appearing for each value occurring in the gender column, we add the
GROUP BY clause and the column name, and SQL will return the aggregated result:
SELECT gender, COUNT(*) FROM employees GROUP BY gender;
We can see that there are less females in the company’s employ with ~120K vs. the nearly ~180K male employees. Though only two rows were returned here, we also see that the sum of the two counts adds up to the total number of rows in the table, as 179,973 + 120,051 = 300,024.
GROUP BY can also be used to perform aggregations over multiple columns, and calculations for all combinations of the distinct values appearing in each. In this case the column names following the
GROUP BY clause are separated by commas. For example, we could also calculate the number of employees by gender and by last name:
SELECT last_name, gender, COUNT(*) FROM employees GROUP BY last_name, gender;
Note that for queries using aggregation, column names that appear in the
SELECT statement must also appear following the
GROUP BY clause. Though the converse is not true, omitting the columns which are aggregated over from being displayed can make it difficult to ascertain which rows of the result correspond to which values of the aggregated columns.
As a general best practice for interpretability, when writing queries using the
GROUP BY clause, all columns which appear after it should also after
SELECT and vice-versa.
Learn SQL Today
Get hands-on experience writing code with interactive tutorials in our free online learning platform.
- Free and fun
- Designed for beginners
- No downloads or setup required