Problem
I wrote this Data Explorer query (reproduced below) some time ago that helped identify posts containing potential spelling/punctuation errors and earned me a Copy Editor badge on Anime.SE.
SELECT DISTINCT PostTypes.Name AS [Post Type],
Posts.Score AS [Post Score],
Posts.Id AS [Post Link],
Posts.LastActivityDate AS [Last Activity]
FROM Posts
INNER JOIN PostTypes
ON Posts.PostTypeId = PostTypes.Id
INNER JOIN Users
ON Posts.OwnerUserId = Users.Id
INNER JOIN PostHistory
ON Posts.Id = PostHistory.PostId
WHERE Posts.Score > 0 AND -- Low quality posts are probably not worth salvaging
Posts.ClosedDate IS NULL AND
Users.Reputation < 4000 AND -- Posts by trusted users are exempt from editing (that is the theory anyway)
(
(
Posts.PostTypeId IN (1, 3) AND
Posts.Id <> ALL (
SELECT DISTINCT PostId
FROM PostHistory
WHERE PostId = PostHistory.PostId AND
PostHistoryTypeId IN (14, 15)
GROUP BY PostId
HAVING COUNT(PostHistoryTypeId) % 2 = 1
)
) OR -- Locked posts can't be edited
(
Posts.PostTypeId IN (2, 3) AND
Posts.ParentId <> ALL (
SELECT DISTINCT PostId
FROM PostHistory
WHERE PostId = Posts.ParentId AND
PostHistoryTypeId IN (14, 15)
GROUP BY PostId
HAVING COUNT(PostHistoryTypeId) % 2 = 1
)
) OR
Posts.PostTypeId IN (4, 5)
) AND
(
(
Posts.Title NOT LIKE '%?' AND
(
LOWER(Posts.Title) LIKE 'is %' OR
LOWER(Posts.Title) LIKE 'are %' OR
LOWER(Posts.Title) LIKE 'do[en ]%' OR
LOWER(Posts.Title) LIKE '%wh[aeoy]%' OR
LOWER(Posts.Title) LIKE '%how%'
)
) OR
(
Posts.Title LIKE '[abcdefghijklmnopqrstuvwxyz]%' OR
Posts.Title LIKE '%[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz]([ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrtuvwxyz]%' OR
Posts.Title LIKE '%[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrtuvwxyz])[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz]%' OR
Posts.Title LIKE '%[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz],[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz]%' OR
Posts.Title LIKE '% i[ '']%'
) OR
(
Posts.Body LIKE '[abcdefghijklmnopqrstuvwxyz]%' OR
Posts.Body LIKE '%[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz]([ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrtuvwxyz]%' OR
Posts.Body LIKE '%[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrtuvwxyz])[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz]%' OR
Posts.Body LIKE '%[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz],[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz]%' OR
Posts.Body LIKE '% i[ '']%'
)
)
ORDER BY Posts.LastActivityDate DESC
Is there any way to speed this up and, more importantly, simplify the query expression? Especially the part where I list the letters of the alphabet in both upper and lower cases, because I think in SQL, [A-Za-z]
doesn’t give what I want.
Solution
Formatting
I really like my SQL queries to be formatted a certain way. For example, if I’m retrieving multiple columns I typically do something like
SELECT
col1
,col2
,col3
,col4
FROM mytable
I thought it was super ugly at first, but having each on its own line is nice for readability (I don’t like having them on the same line as the SELECT
either, same reason). Putting the comma at the front of the line seems kinda weird, but I do like that I don’t have to try and remember to add commas at the EOL if I grab more columns, which is an error I used to run into all of the time.
I like to format my JOIN
s a certain way as well, for example
SELECT *
FROM Table1 AS Alias1
JOIN Table2 AS Alias2
ON Alias1.Column1 = Alias2.Column3
I find that this makes it much easier to visually separate each table, or CTE, or sub-query, that is being joined. I also find that it makes the conditions by which they’re joined much easier to distinguish as well.
Lastly, I find that when combining multiple conditions it is a lot nicer to have the boolean logic (AND
/OR
) to be at the beginning of the line; then I just look at a single column to see how that row is combined with everything else.
SELECT * FROM myTable
WHERE (COND1
AND COND2
AND COND3)
OR COND4;
Use constants
Right now you have a bunch of magic numbers all over the place, e.g.
Posts.Score > 0
Users.Reputation < 4000
Posts.PostTypeId IN (1, 3) AND
PostHistoryTypeId IN (14, 15)
I personally really like to have constants for values where it makes sense. In this case, I’d probably rewrite those lines to look like this
DECLARE @MinPostScore int = 0
DECLARE @MaxReputation int = 4000
DECLARE @PostTypes TABLE (Value int)
INSERT INTO @PostTypes VALUES (1)
INSERT INTO @PostTypes VALUES (3)
DECLARE @PostHistoryTypes TABLE (Value int)
INSERT INTO @PostHistoryTypes VALUES (14)
INSERT INTO @PostHistoryTypes VALUES (15)
...
Posts.Score > @MinPostScore
Users.Reputation < @MaxReputation
Posts.PostTypeId IN (SELECT Value FROM @PostTypes) AND
PostHistoryTypeId IN (SELECT Value FROM @PostHistoryTypes)
You may also find it helpful to declare values used repeatedly (i.e. '[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz]'
) as variables and insert/concat as needed.
Comments per clause in WHERE
A number of these clauses don’t make a lot of sense to a lay reader (mostly because I don’t know much about SEDE) – variables instead of magic numbers would help, but maybe a comment describing the query (or replacing it with a function, if performance/SEDE allows it) would help.
Character classes
According to msdn you can use character ranges with LIKE
. This would make the last few lines look like this:
Posts.Title LIKE '[a-z]%' OR
Posts.Title LIKE '%[A-Za-z]([A-Za-z]%' OR
Posts.Title LIKE '%[A-Za-z])[A-Za-z]%' OR
Posts.Title LIKE '%[A-Za-z],[A-Za-z]%' OR
Posts.Title LIKE '% i[ '']%'
You can then simplify this even more, like so
Posts.Title LIKE '[a-z]%' OR
Posts.Title LIKE '%[A-Za-z][(),][A-Za-z]%' OR
Posts.Title LIKE '% i[ '']%'
Unfortunately the columns you’re comparing use case-sensitive collation (played around using this to figure that out) so you can’t just leave out one version. SEDE is run on T-SQL, so it won’t support the iLIKE
operator either.
Short circuiting
Unfortunately, short circuiting isn’t guaranteed in T SQL, so we can’t guarantee any optimizations here, however if you carefully examine your conditions you might find ways where you can take advantage of short-circuiting if it were to occur. For example, you might choose to place the “most likely” case first, or use De Morgan’s Laws to switch up the order such that each case is mutually exclusive. YMMV.