Find texts that contain HTML tags

How to select text values that contain HTML tags. How to disallow inserting texts that contain HTML tags.

Selecting texts that contain tags

Example: A table should contain plain texts, without formatting. But the web application that writes into the table used to have a bug, so occasionally HTML text is written into the table. Now that the bug has been fixed, we need to find rows containing HTML so that someone can fix them.

Finding any HTML tag:

SELECT id
    FROM post
    WHERE body LIKE '%<_%/>%'
    OR body LIKE '%<_%>%<_%/>%';

Finding a specific tag (in this case <table>):

SELECT id
    FROM post
    WHERE body LIKE '%<table%>%';

Disallowing tags from being inserted into a column

Example: Now that our table is free from HTML tags, we want to be sure that they cannot be inserted in the future.

If we can, we should just disallow the < or > characters. No HTML tag can be formed without those characters.

CREATE TABLE post (
    id INT NOT NULL,
    body TEXT CHECK (body NOT LIKE '%<%'),
    ...
    PRIMARY KEY (id)
);

But probably we can’t do that, because our texts should be allowed to use those characters, for example to show math expressions. In this case we’ll have to use a looser filter:

CREATE TABLE post (
    id INT NOT NULL,
    body TEXT CHECK (body NOT LIKE '%<_%>%'),
    ...
    PRIMARY KEY (id)
);