SQL Antipatterns: Avoiding the Pitfalls of Database Programming (Pragmatic Programmers)

Category: Programming
Author: Bill Karwin
4.6
All Stack Overflow 30
This Month Stack Overflow 18

Comments

by anonymous   2019-07-21

To generate a distinct list of "numbers" from all of your strings:

SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(t.id_numbers,',',d.i+1),',',-1) AS n
  FROM mytable t
  JOIN ( SELECT 0 AS i
         UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3
         UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6
         UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9
       ) d
    ON d.i <= ( CHAR_LENGTH(t.id_numbers) 
              - CHAR_LENGTH(REPLACE(t.id_numbers,',','')) )
 GROUP BY n
 HAVING n <> ''

To put that back into a comma separated list (ARRGGHH! WHY do we want to that?), we can use the query above as an inline view...

SELECT GROUP_CONCAT(q.n ORDER BY q.n+0) AS id_numbers
  FROM ( SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(t.id_numbers,',',d.i+1),',',-1) AS n
           FROM mytable t
           JOIN ( SELECT 0 AS i 
                  UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3
                  UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6
                  UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9
                 ) d
              ON d.i <= ( CHAR_LENGTH(t.id_numbers) 
                        - CHAR_LENGTH(REPLACE(t.id_numbers,',','')) )
           GROUP BY n
           HAVING n <> ''
       ) q

This is essentially the same answer that was given to the question that Mark identified as a possible duplicate.

Rather than explain how all of that that works, I'm going to suggest that you AVOID storing comma separated lists of a values as a string.

I recommend you take a look at Chapter 2 of Bill Karwin's excellent book "SQL Antipatterns: Avoiding the Pitfalls of Database Programming".

http://www.amazon.com/SQL-Antipatterns-Programming-Pragmatic-Programmers/dp/1934356557

Those articles you saw that said this can't be done in MySQL, those articles are essentially right... you can't do this in SQL because you do not want to do this in SQL. Doing it does violence to the relational database model.

by anonymous   2019-07-21

I'm wondering why you need two joins to the lightboxes_tbl table. It seems like the second reference to that table (alias D) is unnecessary. Seems like you could just use an OR.

As a demonstration, replicating the predicates in your query:

LEFT JOIN lightboxes_tbl C
       ON B.lightbox_id = C.id
      AND ( C.author = 'scott@co.com'
          OR C.authorized_viewers = 'scott@co.com'
          )

But given that authorized_user contains a comma separated list (ACCKKK!!!), I suspect you really want to look for an exact match to an item in the comma separated list. The LIKE comparison that currently have is equivalent to an equals comparison (to the entire contents of authorized_viewers column). You could add '%' wildcard characters to search for the value as part of the string...

But that approach is will also match strings containing e.g. ebscott@co.com, which may not be what you really want.

You could use the FIND_IN_SET function to find an exact match within the comma separated list...

LEFT JOIN lightboxes_tbl C
       ON B.lightbox_id = C.id
      AND ( C.author = 'scott@co.com'
          OR FIND_IN_SET('scott@co.com',C.authorized_viewers)
          )

Storing comma separated lists is a SQL anti-pattern. I recommend Bill Karwin's book: SQL Anti-Patterns: Avoiding the Pitfalls of Database Programming

http://www.amazon.com/SQL-Antipatterns-Programming-Pragmatic-Programmers/dp/1934356557

by anonymous   2019-07-21

I first want to mention.... storing comma separated lists is a SQL AntiPattern. I recommend Chapter 2 of Bill Karwin's excellent book "SQL Antipatterns: Avoiding the Pitfalls of Database Programming. Avaiable at Amazon and other fine booksellers [https://www.amazon.com/SQL-Antipatterns-Programming-Pragmatic-Programmers/dp/1934356557]


To answer the question you asked... with well formed lists, we can use the MySQL FIND_IN_SET function. As a demonstration

    FIND_IN_SET('8','4,8,34') 

evaluates to TRUE.

Reference: FIND_IN_SET http://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_find-in-set


If we want to avoid using the MySQL specific FIND_IN_SET function, we could use a LIKE comparison. But to get that to work, we need to add a comma character at the beginning and end of the subcat list, and then search for a comma delimited value, e.g. ',8,' in the list. For example

 CONCAT(',',subcat,',') LIKE '%,8,%'  
by anonymous   2019-07-21

Q: "What is the best recommended way how to deal with this kind of stored data?"

A: The best recommendation is to avoid storing data as comma separated lists. (And no, this does not mean we should use semicolons in place of commas as delimiters in the list.)

For an introductory discussion of this topic, I recommend a review of Chapter 2 in Bill Karwin's book: "SQL AntiPatterns: Avoiding the Pitfalls of Database Programming"

Which is conveniently available here

https://www.amazon.com/SQL-Antipatterns-Programming-Pragmatic-Programmers/dp/1934356557

and from other fine booksellers. Setting that recommendation aside for a moment...


To retrieve the first element from a semicolon delimited list, we can use the SUBSTRING_INDEX function.

As a demonstration:

SELECT SUBSTRING_INDEX('abc;def;ghi',';',1)

returns

'abc'

The MySQL SUBSTRING_INDEX function is documented here: https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_substring-index


I recognize that this might be considered a "link only answer". A good answer to this question is going to be much longer, giving examples to demonstrate the pitfall of storing comma separated lists.

If the database will only ever view the comma separated list as a blob of data, in its entirety, without a need to examine the contents of the list, then I would consider storing it, similar to the way we would store a .jpg image in the database.

I would store and retrieve a .jpg image as a BLOB, just a block of bytes, in its entirety. Save the whole thing, and retrieve the whole thing. I'm not ever going to have the database manipulate the contents of the image. I'm not going to ever ask the database to examine the image to discern information about what is "in" the jpg image. I'm not going to ask the database to derive any meaningful information out of it... How many people are in a photo, what are the names of people in a photo, add a person to the photo, and so on.

I will only condone storing a comma separated (or semicolon separated) separated list if we are intending it to be an object, an opaque block of bytes, like we handle a jpg image.

by anonymous   2019-07-21

Personally I would not worry about Clean Code in database terms. What you need to worry about is data integrity and performance and security.

I believe as you get more familiar with the database model, it will become easier to understand those long procs. Some things I do though to help troubleshoot later are: If the proc has dynamic sql always write it with the debug variable mode that you can use to grab the SQL code that is built. This will save hours of troubleshooting time.

If the proc does a lot of work make sure to put it in transactions with Try Catch blocks so that everything gets rolled back if there is a problem. You can also take advantage of the fact that rollbacks don't affect table variables and store debugging and error info you will need in an exception table after the rollback. That will make it much easier on prod to see what happens when it fails.

Try to be consistent in your spacing and structure. There are various tools you can use to format. And consistent aliasing helps too. Alias every field referenced because 6 months later you may not remember which one it came from easily. Personally I find it much easier to understand long procs if all the inner joins are before the left joins. This lets me easily see what tables must have data and if the results are wrong it is often one of these that is to blame because it should be a left join.

Be consistent in your parameter data types. Make them match the database datatype of the field you will reference with that parameter.

Store everything in source control and ruthlessly delete all commented out code. If a parameter is not being used, get rid of it. If something is hardcoded, you may want to convert it to a parameter particularly if it is being used more than once.

If you have some business logic which might not be easy to figure out later, put in a comment explaining it.

Try to build new procs in sections, so that you can easily test the results as you go. So for instance suppose you are doing a financial report of all returned orders. You might first write a cte or temp table or table variable to get all the returns. Then you might join to that to get the financial records for just those returns and aggregate them. Then finally you would join all that to other details that you need in the report. So you can check by running just the first part and seeing that there are 2098 returns in the period. After you have aggregated the financial information, you should have one record for each of those 2098 returns. Then at then end you can see of you still have all the needed records. This helps to make sure you didn't lose any along the way by joining unless the data is such that you should have and you understand why you did. You can also see if you are getting extra records from joins to one to many tables which may or may not be appropriate (which is why understanding the data model is crucial).

If you are going to write complex sql, you need to read some books on performance tuning (learning which constructs like correlated subqueries and cursors and what to use instead) to avoid using and how to make sure your where clauses are sargable. There are several books available for SQL Server performance tuning. You also need to read the book on SQL Antipatterns which will be of much more help than the clean code books: http://www.amazon.com/SQL-Antipatterns-Programming-Pragmatic-Programmers/dp/1934356557/ref=sr_1_1?ie=UTF8&qid=1406570810&sr=8-1&keywords=sql+antipatterns

by anonymous   2019-01-13

Don't store comma separated lists of values. (This doesn't mean to use semicolons in place of commas.)

SQL Antipatterns: Avoiding the Pitfalls of Database Programming -- Bill Karwin, Chapter 2

Available at Amazon and other fine booksellers.

https://www.amazon.com/SQL-Antipatterns-Programming-Pragmatic-Programmers/dp/1934356557


For a finite number of values, we can split those out into separate rows

(Replace inline view t with your table)

SELECT t.id
     , i.n
     , NULLIF(SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT( t.room_type ,';;'),';',i.n),';',-1),'') AS room_type_n
     , NULLIF(SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT( t.room_size ,';;'),';',i.n),';',-1),'') AS room_size_n
  FROM ( SELECT 1 AS id, 'a;b;c' AS room_type, '5;7;12'  AS room_size
         UNION ALL SELECT 2,'b;c;d;f','8;2;4'
         UNION ALL SELECT 3,'b;c;d;f;g;h','6;4;5;6;7;2;9'
       ) t
 CROSS
  JOIN ( SELECT 1 AS n
         UNION ALL SELECT 2
         UNION ALL SELECT 3
         UNION ALL SELECT 4
         UNION ALL SELECT 5
         UNION ALL SELECT 6
         UNION ALL SELECT 7
         UNION ALL SELECT 8
         UNION ALL SELECT 9
       ) i 
HAVING NOT ( room_type_n <=> NULL )
 ORDER BY t.id, i.n

Returns:

 id    n  room_type_n  room_size_n
---  ---  -----------  -----------
  1    1  a            5
  1    2  b            7
  1    3  c            12
  2    1  b            8
  2    2  c            2
  2    3  d            4
  2    4  f            (NULL)
  3    1  b            6
  3    2  c            4
  3    3  d            5
  3    4  f            6
  3    5  g            7
  3    6  h            2

Inline view i should be extended to the finite, maximum number of values we expect to extract: UNION ALL SELECT 10 UNION ALL SELECT 11 ....

by anonymous   2018-05-09
Please allow this question to serve as an example of a pitfall. Why we do not store comma separated lists is discussed in Chapter 2 Jaywalking in Bill Karwin's book "SQL Antipatterns", available at Amazon and other fine booksellers. https://www.amazon.com/SQL-Antipatterns-Programming-Pragmatic-Programmers/dp/1934356557
by anonymous   2017-12-18
@MattiaFrusca: We typically reserve `SET` datatype for special cases of strictly predetermined values. And we don't modify the definition of the column when a user request can have any new value that they want to store. Running an ALTER TABLE can kill concurrency; and we typically don't grant ALTER TABLE priv to the application (its more problems than it solves.) Also, storing comma separated lists is an anti-pattern, see Chapter 2 in Bill Karwin's book https://www.amazon.com/SQL-Antipatterns-Programming-Pragmatic-Programmers/dp/1934356557(available from amazon and other fine booksellers.)
by anonymous   2017-11-27
The first pattern is the normative relational pattern. Consider a future use case, for example, find all recipe with "oil" as an ingredient. Or if you needed to enforce a foreign key constraint... you would find the second pattern to be unsuitable. Bill Karwin has written an excellent book which has a chapter that covers the topic of "storing lists" in a column. "SQL Antipatterns: Avoiding the Pitfalls of Database Programming" available from Amazon and other fine booksellers. https://www.amazon.com/SQL-Antipatterns-Programming-Pragmatic-Programmers/dp/1934356557/ (See Chapter 2. Jaywalking)
by anonymous   2017-08-20

According to this answer by @axiac, better solution in terms of compatibility and performance is shown below.

It is also explained in SQL Antipatterns book, Chapter 15: Ambiguous Groups.

To improve performance, combined index is also added for (group_id, price, id).

SOLUTION

SELECT a.id, a.name, a.group_id, a.price
FROM items a
LEFT JOIN items b 
ON a.group_id = b.group_id 
AND (a.price > b.price OR (a.price = b.price and a.id > b.id))
WHERE b.price is NULL;

See explanation on how it works for more details.

By accident as a side-effect this query works in my case where I needed to include ALL records with group_id equals to NULL AND one item from each group with the lowest price.

RESULT

+----+--------+----------+-------+
| id | name   | group_id | price |
+----+--------+----------+-------+
|  1 | Item A |     NULL | 10.00 | 
|  2 | Item B |     NULL | 20.00 | 
|  3 | Item C |     NULL | 30.00 | 
|  4 | Item D |        1 | 40.00 | 
|  5 | Item E |        2 | 50.00 | 
+----+--------+----------+-------+

EXPLAIN

+----+-------------+-------+------+-------------------------------+--------------------+---------+----------------------------+------+--------------------------+
| id | select_type | table | type | possible_keys                 | key                | key_len | ref                        | rows | Extra                    |
+----+-------------+-------+------+-------------------------------+--------------------+---------+----------------------------+------+--------------------------+
|  1 | SIMPLE      | a     | ALL  | NULL                          | NULL               | NULL    | NULL                       |    7 |                          | 
|  1 | SIMPLE      | b     | ref  | PRIMARY,id,items_group_id_idx | items_group_id_idx | 5       | agi_development.a.group_id |    1 | Using where; Using index | 
+----+-------------+-------+------+-------------------------------+--------------------+---------+----------------------------+------+--------------------------+
by anonymous   2017-08-20

I assume you have some primary key field on each joined table, since UserID is not unique. I'll assume your primary key is called ID. We'll take the records with the lowest ID. This meets your "first" criteria.

SELECT  Users.UserID, Users.FirstName, Users.LastName, hp.HomePhone,
        wp.WorkPhone, fn.FaxNumber
FROM Users
LEFT JOIN HomePhone hp ON hp.UserID = Users.UserID
LEFT JOIN HomePhone hp2 ON hp2.UserID = Users.UserID AND hp2.ID < hp.ID
LEFT JOIN WorkPhone wp ON wp.UserID = Users.UserID
LEFT JOIN WorkPhone wp2 ON wp2.UserID = Users.UserID AND wp2.ID < wp.ID
LEFT JOIN FaxNumber fn ON fn.UserID = Users.UserID
LEFT JOIN FaxNumber fn2 ON fn2.UserID = Users.UserID AND fn2.ID < fn.ID
WHERE hp2.ID IS NULL AND wp2.ID IS NULL AND fn2.ID IS NULL

There is a whole chapter on this type of issue, called "Ambiguous Gruops", in the book SQL Antipatterns.

by anonymous   2017-08-20

This question has been asked and answered before, but I don't want to hunt for it; this question should be closed as a duplicate. But, to answer your question:

The commas in the string, the column value, are just characters. Those are part of the string. They aren't seen as "separators" between values in the SQL text. The way SQL sees it, the column contains a single value, not a "list" of values.

So, in your query, the IN (field) is equivalent to an equals comparison. It's equivalent to comparing to a string. For example:

... WHERE 2316 = '1908,2315,2316'

And those aren't equal, so the row isn't returned. The "surprisingly" finding of a match, in the case of:

... WHERE 1908 IN ('1908,2315,2316')

that's explained because that string is being evaluated in a numeric context. That is, the comparison returns true, because all of these also true:

... WHERE 1908 = '1908,2315,2316' + 0
... WHERE 1908 = '1908xyz' + 0
... WHERE 1908 = '1907qrs' + 1

(When evaluated in a numeric context, a string gets converted to numeric. It just happens that the string evaluates to a numeric value that equals the integer value it's being comparing to.)


You may be able to make use of the MySQL FIND_IN_SET function. For example:

... WHERE FIND_IN_SET(2316,'1908,2315,2316')

But, please seriously reconsider the design of storing comma separated list. I recommend Bill Karwin's "SQL Antipatterns" book...

http://www.amazon.com/SQL-Antipatterns-Programming-Pragmatic-Programmers/dp/1934356557