SQL Server: How to count occurrences of a substring in a string

Sometimes you want to count all occurrences of a substring into a bigger string. For example, you may want to count all occurrences of a name in a text. This can be done, but it’s not straight-forward because SQL Server doesn’t have a specific function to do it.

Counting the occurrences of a character

This is the simplest case: the substring we want to count has a length of exactly one character. We can do it this way:

SELECT
    LEN(summary) - LEN(REPLACE(summary, 'x', ''))
        AS occurrences
    FROM article

Here’s what this expression does:

  • Gets the number of characters in the superstring;
  • Deletes the x’s from the superstring and counts the characters again;
  • Returns the difference or, in other words, the number of x’s.

Counting the occurrences of a substring of any length

If we don’t know the length of the substring, the expression to use is a bit more complex:

SELECT
    (LEN(summary) - LEN(REPLACE(summary, 'France',''))) / LEN('France')
        AS occurrences
    FROM article

Here’s what it does:

  • Gets the number of characters in the superstring;
  • Deletes the occurrences of ‘France’ from the superstring and counts the characters again;
  • Returns the length difference divided by the length of ‘France’.

The length of ‘France’ is 6 characters. So, for example, if the length difference is 6 there is one occurrence, and if the length difference is 12 there are two occurrences. If there are zero occurrences the expression will still work, because zero divided by any number returns zero.

Case-sensitivity

The above expressions are case-sensitive. We can make them case-insensitive by using the LOWER() function.

Single character, case insensitive:

SELECT
    LEN(summary) - LEN(REPLACE(LOWER(summary), LOWER('x'), ''))
        AS occurrences
    FROM article

Any number of characters, case insensitive:

SELECT
    (LEN(summary) - LEN(REPLACE(LOWER(summary), LOWER('France'(,''))) / LEN('France')
        AS occurrences
    FROM article

In both cases, LOWER() is applied before REPLACE(), to make it work when. the cases don’t match. It’s not used with LEN() because the case obviously doesn’t affect a string length.

Performance

In the examples above, the expressions should be relatively fast (though, of course, it can be a problem when applied to a huge number of rows).

Nothing prevents us to use those expressions in a WHERE clause or in an ORDER BY clause, but this may heavily affect performance. The reason is that, if there is an index on the summary column, it cannot be used because the query optimizer cannot make assumptions on the results of the expressions.

If it is really necessary, this problem can be solved by building a computer column and an index on it, but this is out of the scope of this article.

Reference

More articles:

SQL Server documentation: