Posts that probably need editing

Posted on

Problem

I wrote this Data Explorer query (reproduced below) some time ago that helped identify posts containing potential spelling/punctuation errors and earned me a Copy Editor badge on Anime.SE.

SELECT DISTINCT PostTypes.Name AS [Post Type],
                Posts.Score AS [Post Score],
                Posts.Id AS [Post Link],
                Posts.LastActivityDate AS [Last Activity]
FROM Posts
INNER JOIN PostTypes
ON Posts.PostTypeId = PostTypes.Id
INNER JOIN Users
ON Posts.OwnerUserId = Users.Id
INNER JOIN PostHistory
ON Posts.Id = PostHistory.PostId
WHERE Posts.Score > 0 AND -- Low quality posts are probably not worth salvaging
      Posts.ClosedDate IS NULL AND
      Users.Reputation < 4000 AND -- Posts by trusted users are exempt from editing (that is the theory anyway)
      (
          (
              Posts.PostTypeId IN (1, 3) AND
              Posts.Id <> ALL (
                  SELECT DISTINCT PostId
                  FROM PostHistory
                  WHERE PostId = PostHistory.PostId AND
                        PostHistoryTypeId IN (14, 15)
                  GROUP BY PostId
                  HAVING COUNT(PostHistoryTypeId) % 2 = 1
              )
          ) OR -- Locked posts can't be edited
          (
              Posts.PostTypeId IN (2, 3) AND
              Posts.ParentId <> ALL (
                  SELECT DISTINCT PostId
                  FROM PostHistory
                  WHERE PostId = Posts.ParentId AND
                        PostHistoryTypeId IN (14, 15)
                  GROUP BY PostId
                  HAVING COUNT(PostHistoryTypeId) % 2 = 1
              )
          ) OR
          Posts.PostTypeId IN (4, 5)
      ) AND
      (
          (
              Posts.Title NOT LIKE '%?' AND
              (
                  LOWER(Posts.Title) LIKE 'is %' OR
                  LOWER(Posts.Title) LIKE 'are %' OR
                  LOWER(Posts.Title) LIKE 'do[en ]%' OR
                  LOWER(Posts.Title) LIKE '%wh[aeoy]%' OR
                  LOWER(Posts.Title) LIKE '%how%'
              )
          ) OR
          (
              Posts.Title LIKE '[abcdefghijklmnopqrstuvwxyz]%' OR
              Posts.Title LIKE '%[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz]([ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrtuvwxyz]%' OR
              Posts.Title LIKE '%[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrtuvwxyz])[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz]%' OR
              Posts.Title LIKE '%[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz],[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz]%' OR
              Posts.Title LIKE '% i[ '']%'
          ) OR
          (
              Posts.Body LIKE '[abcdefghijklmnopqrstuvwxyz]%' OR
              Posts.Body LIKE '%[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz]([ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrtuvwxyz]%' OR
              Posts.Body LIKE '%[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrtuvwxyz])[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz]%' OR
              Posts.Body LIKE '%[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz],[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz]%' OR
              Posts.Body LIKE '% i[ '']%'
          )
      )
ORDER BY Posts.LastActivityDate DESC

Is there any way to speed this up and, more importantly, simplify the query expression? Especially the part where I list the letters of the alphabet in both upper and lower cases, because I think in SQL, [A-Za-z] doesn’t give what I want.

Solution

Formatting

I really like my SQL queries to be formatted a certain way. For example, if I’m retrieving multiple columns I typically do something like

SELECT
    col1
    ,col2
    ,col3
    ,col4
FROM mytable

I thought it was super ugly at first, but having each on its own line is nice for readability (I don’t like having them on the same line as the SELECT either, same reason). Putting the comma at the front of the line seems kinda weird, but I do like that I don’t have to try and remember to add commas at the EOL if I grab more columns, which is an error I used to run into all of the time.

I like to format my JOINs a certain way as well, for example

SELECT * 
FROM Table1 AS Alias1
     JOIN Table2 AS Alias2
         ON Alias1.Column1 = Alias2.Column3

I find that this makes it much easier to visually separate each table, or CTE, or sub-query, that is being joined. I also find that it makes the conditions by which they’re joined much easier to distinguish as well.

Lastly, I find that when combining multiple conditions it is a lot nicer to have the boolean logic (AND/OR) to be at the beginning of the line; then I just look at a single column to see how that row is combined with everything else.

SELECT * FROM myTable
WHERE (COND1
       AND COND2
       AND COND3)
      OR COND4;

Use constants

Right now you have a bunch of magic numbers all over the place, e.g.

Posts.Score > 0
Users.Reputation < 4000
Posts.PostTypeId IN (1, 3) AND
PostHistoryTypeId IN (14, 15)

I personally really like to have constants for values where it makes sense. In this case, I’d probably rewrite those lines to look like this

DECLARE @MinPostScore int = 0
DECLARE @MaxReputation int = 4000
DECLARE @PostTypes TABLE (Value int)
INSERT INTO @PostTypes VALUES (1)
INSERT INTO @PostTypes VALUES (3)
DECLARE @PostHistoryTypes TABLE (Value int)
INSERT INTO @PostHistoryTypes VALUES (14)
INSERT INTO @PostHistoryTypes VALUES (15)

...

Posts.Score > @MinPostScore
Users.Reputation < @MaxReputation
Posts.PostTypeId IN (SELECT Value FROM @PostTypes) AND
PostHistoryTypeId IN (SELECT Value FROM @PostHistoryTypes)

You may also find it helpful to declare values used repeatedly (i.e. '[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz]') as variables and insert/concat as needed.


Comments per clause in WHERE

A number of these clauses don’t make a lot of sense to a lay reader (mostly because I don’t know much about SEDE) – variables instead of magic numbers would help, but maybe a comment describing the query (or replacing it with a function, if performance/SEDE allows it) would help.


Character classes

According to msdn you can use character ranges with LIKE. This would make the last few lines look like this:

Posts.Title LIKE '[a-z]%' OR
Posts.Title LIKE '%[A-Za-z]([A-Za-z]%' OR
Posts.Title LIKE '%[A-Za-z])[A-Za-z]%' OR
Posts.Title LIKE '%[A-Za-z],[A-Za-z]%' OR
Posts.Title LIKE '% i[ '']%'

You can then simplify this even more, like so

Posts.Title LIKE '[a-z]%' OR
Posts.Title LIKE '%[A-Za-z][(),][A-Za-z]%' OR
Posts.Title LIKE '% i[ '']%'

Unfortunately the columns you’re comparing use case-sensitive collation (played around using this to figure that out) so you can’t just leave out one version. SEDE is run on T-SQL, so it won’t support the iLIKE operator either.


Short circuiting

Unfortunately, short circuiting isn’t guaranteed in T SQL, so we can’t guarantee any optimizations here, however if you carefully examine your conditions you might find ways where you can take advantage of short-circuiting if it were to occur. For example, you might choose to place the “most likely” case first, or use De Morgan’s Laws to switch up the order such that each case is mutually exclusive. YMMV.

Leave a Reply

Your email address will not be published. Required fields are marked *