How to find the mode in SQL, aka MAX(COUNT(*)) not working

In statistics, the mode is the function that returns the element that appear most often in a series. This is what people try to achieve when they run MAX(COUNT(*))… and they find out that it doesn’t work. Let’s see how to obtain the mode in SQL.

Let’s use an example, as usual. Suppose we need to find which department in a company has most employees.

SELECT department_id, COUNT(*) AS how_many
    FROM employee
    GROUP BY department_id
    ORDER BY how_many DESC
    LIMIT 1;

What we’re doing here is to get the unique least of departments, get the number of employees for each of them, sort the rows by the (ascending) number of employees, and only take the first of them.

WITH TIES

However, several departments could have the same number of employees, but the above query can only return one row. If we want to get all the department with the highest number of employees, we can use this query:

SELECT department_id, COUNT(*) AS how_many
    FROM employee
    GROUP BY department_id
    ORDER BY how_many DESC
    FETCH FIRST 1 ROWS WITH TIES;

The FETCH FIRST n ROWS WITH TIES syntax means that we surely want the first n rows, but the following rows have the same values for the ORDER BY columns (in this case, the same COUNT(*)) they need to be returned as well.

Support for WITH TIES

However, not all DBMSs support this syntax. It is supported by:

  • PostgreSQL
  • MariaDB
  • SQL Server
  • Oracle

It is not supported by:

  • MySQL
  • SQLite
  • Db2
  • Redshift
  • Snowflake

Alternative methods

On DBMSs that don’t support WITH TIES, you can look at alternative methods, which include subqueries and window functions.

Get all the modes using a subquery

SELECT department_id, COUNT(*) AS how_many
    FROM employee
    GROUP BY department_id
    HAVING how_many = (
        SELECT COUNT(*) AS how_many
            FROM employee
            GROUP BY department_id
            ORDER BY how_many DESC
            LIMIT 1
    );

Here we:

  • Use the first query in this article to get the highest number of employees in a department (subquery in HAVING).
  • Count the number of employees per department again.
  • From this statistics, we only take the rows where the number of employees is equal to the highest number of employees.

Get all the modes with the window function method

While the above method works, the syntax is a bit too verbose and it can be slow. We can use a window function instead.

SELECT department_id, how_many AS how_many
    FROM (
        SELECT
                department_id,
                COUNT(*) AS how_many,
                RANK() OVER (ORDER BY COUNT(*) DESC) AS department_rank
            FROM employee
            GROUP BY department_id
    ) s
    WHERE department_rank = 1;

What we do is:

  • Group employees by department, count the employees for each department.
  • Assign a rank to each row, starting from the highest count (ORDER BY COUNT(*) DESC).
    • Note that rows where COUNT(*) is the same, the rank will also be the same.
  • From the results of this subquery, select the rows with the lower rank (highest count).

Why MAX(COUNT(*)) won’t work

SELECT MAX(COUNT(*)) FROM table GROUP BY column won’t work on any DBMS.

The reason is that aggregate functions (like MAX() and COUNT()) accept any number of rows as input and return one value as output. Input rows are each group’s rows if GROUP BY is used, or all table rows otherwise.

In this case, COUNT() will get each group’s rows and return a value for the group. You expect MAX() to so the same, but it’s not what you’re writing: you’re passing it the result of an aggregation. This operation is not possible.

Performance

All the examples require an index on department_id, or an index whose first column is department_id. Otherwise at least one full table scan (sequential read) will be performed, and for big tables this operation is slow.

To return multiple modes, the WITH TIES method is the fastest. It’s roughly as fast as the first query, that returns at most one row.

The window function method will be fast.

The subquery method can be slow, because the GROUP BY is run twice. Note that LIMIT 1 won’t make a GROUP BY any faster, because the database can’t return a correct result before reading all the relevant rows.

Reference

Related articles:

Wikipedia: