One of my favorite books to throw out for noobs is both a pattern and antipattern book for database folks. https://www.amazon.com/SQL-Antipatterns-Programming-Pragmati...
To generate a distinct list of "numbers" from all of your strings:
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(t.id_numbers,',',d.i+1),',',-1) AS n
FROM mytable t
JOIN ( SELECT 0 AS i
UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3
UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6
UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9
) d
ON d.i <= ( CHAR_LENGTH(t.id_numbers)
- CHAR_LENGTH(REPLACE(t.id_numbers,',','')) )
GROUP BY n
HAVING n <> ''
To put that back into a comma separated list (ARRGGHH! WHY do we want to that?), we can use the query above as an inline view...
SELECT GROUP_CONCAT(q.n ORDER BY q.n+0) AS id_numbers
FROM ( SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(t.id_numbers,',',d.i+1),',',-1) AS n
FROM mytable t
JOIN ( SELECT 0 AS i
UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3
UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6
UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9
) d
ON d.i <= ( CHAR_LENGTH(t.id_numbers)
- CHAR_LENGTH(REPLACE(t.id_numbers,',','')) )
GROUP BY n
HAVING n <> ''
) q
This is essentially the same answer that was given to the question that Mark identified as a possible duplicate.
Rather than explain how all of that that works, I'm going to suggest that you AVOID storing comma separated lists of a values as a string.
I recommend you take a look at Chapter 2 of Bill Karwin's excellent book "SQL Antipatterns: Avoiding the Pitfalls of Database Programming".
Those articles you saw that said this can't be done in MySQL, those articles are essentially right... you can't do this in SQL because you do not want to do this in SQL. Doing it does violence to the relational database model.
I'm wondering why you need two joins to the lightboxes_tbl table. It seems like the second reference to that table (alias D) is unnecessary. Seems like you could just use an OR.
As a demonstration, replicating the predicates in your query:
LEFT JOIN lightboxes_tbl C
ON B.lightbox_id = C.id
AND ( C.author = 'scott@co.com'
OR C.authorized_viewers = 'scott@co.com'
)
But given that authorized_user contains a comma separated list (ACCKKK!!!), I suspect you really want to look for an exact match to an item in the comma separated list. The LIKE comparison that currently have is equivalent to an equals comparison (to the entire contents of authorized_viewers column). You could add '%' wildcard characters to search for the value as part of the string...
But that approach is will also match strings containing e.g. ebscott@co.com, which may not be what you really want.
You could use the FIND_IN_SET function to find an exact match within the comma separated list...
LEFT JOIN lightboxes_tbl C
ON B.lightbox_id = C.id
AND ( C.author = 'scott@co.com'
OR FIND_IN_SET('scott@co.com',C.authorized_viewers)
)
Storing comma separated lists is a SQL anti-pattern. I recommend Bill Karwin's book: SQL Anti-Patterns: Avoiding the Pitfalls of Database Programming
I first want to mention.... storing comma separated lists is a SQL AntiPattern. I recommend Chapter 2 of Bill Karwin's excellent book "SQL Antipatterns: Avoiding the Pitfalls of Database Programming. Avaiable at Amazon and other fine booksellers [https://www.amazon.com/SQL-Antipatterns-Programming-Pragmatic-Programmers/dp/1934356557]
To answer the question you asked... with well formed lists, we can use the MySQL FIND_IN_SET function. As a demonstration
If we want to avoid using the MySQL specific FIND_IN_SET function, we could use a LIKE comparison. But to get that to work, we need to add a comma character at the beginning and end of the subcat list, and then search for a comma delimited value, e.g. ',8,' in the list. For example
Q: "What is the best recommended way how to deal with this kind of stored data?"
A: The best recommendation is to avoid storing data as comma separated lists. (And no, this does not mean we should use semicolons in place of commas as delimiters in the list.)
For an introductory discussion of this topic, I recommend a review of Chapter 2 in Bill Karwin's book: "SQL AntiPatterns: Avoiding the Pitfalls of Database Programming"
and from other fine booksellers. Setting that recommendation aside for a moment...
To retrieve the first element from a semicolon delimited list, we can use the SUBSTRING_INDEX function.
As a demonstration:
SELECT SUBSTRING_INDEX('abc;def;ghi',';',1)
returns
'abc'
The MySQL SUBSTRING_INDEX function is documented here: https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_substring-index
I recognize that this might be considered a "link only answer". A good answer to this question is going to be much longer, giving examples to demonstrate the pitfall of storing comma separated lists.
If the database will only ever view the comma separated list as a blob of data, in its entirety, without a need to examine the contents of the list, then I would consider storing it, similar to the way we would store a .jpg image in the database.
I would store and retrieve a .jpg image as a BLOB, just a block of bytes, in its entirety. Save the whole thing, and retrieve the whole thing. I'm not ever going to have the database manipulate the contents of the image. I'm not going to ever ask the database to examine the image to discern information about what is "in" the jpg image. I'm not going to ask the database to derive any meaningful information out of it... How many people are in a photo, what are the names of people in a photo, add a person to the photo, and so on.
I will only condone storing a comma separated (or semicolon separated) separated list if we are intending it to be an object, an opaque block of bytes, like we handle a jpg image.
Personally I would not worry about Clean Code in database terms. What you need to worry about is data integrity and performance and security.
I believe as you get more familiar with the database model, it will become easier to understand those long procs. Some things I do though to help troubleshoot later are:
If the proc has dynamic sql always write it with the debug variable mode that you can use to grab the SQL code that is built. This will save hours of troubleshooting time.
If the proc does a lot of work make sure to put it in transactions with Try Catch blocks so that everything gets rolled back if there is a problem. You can also take advantage of the fact that rollbacks don't affect table variables and store debugging and error info you will need in an exception table after the rollback. That will make it much easier on prod to see what happens when it fails.
Try to be consistent in your spacing and structure. There are various tools you can use to format. And consistent aliasing helps too. Alias every field referenced because 6 months later you may not remember which one it came from easily. Personally I find it much easier to understand long procs if all the inner joins are before the left joins. This lets me easily see what tables must have data and if the results are wrong it is often one of these that is to blame because it should be a left join.
Be consistent in your parameter data types. Make them match the database datatype of the field you will reference with that parameter.
Store everything in source control and ruthlessly delete all commented out code. If a parameter is not being used, get rid of it. If something is hardcoded, you may want to convert it to a parameter particularly if it is being used more than once.
If you have some business logic which might not be easy to figure out later, put in a comment explaining it.
Try to build new procs in sections, so that you can easily test the results as you go. So for instance suppose you are doing a financial report of all returned orders. You might first write a cte or temp table or table variable to get all the returns. Then you might join to that to get the financial records for just those returns and aggregate them. Then finally you would join all that to other details that you need in the report. So you can check by running just the first part and seeing that there are 2098 returns in the period. After you have aggregated the financial information, you should have one record for each of those 2098 returns. Then at then end you can see of you still have all the needed records. This helps to make sure you didn't lose any along the way by joining unless the data is such that you should have and you understand why you did. You can also see if you are getting extra records from joins to one to many tables which may or may not be appropriate (which is why understanding the data model is crucial).
If you are going to write complex sql, you need to read some books on performance tuning (learning which constructs like correlated subqueries and cursors and what to use instead) to avoid using and how to make sure your where clauses are sargable. There are several books available for SQL Server performance tuning. You also need to read the book on SQL Antipatterns which will be of much more help than the clean code books: http://www.amazon.com/SQL-Antipatterns-Programming-Pragmatic-Programmers/dp/1934356557/ref=sr_1_1?ie=UTF8&qid=1406570810&sr=8-1&keywords=sql+antipatterns
For a finite number of values, we can split those out into separate rows
(Replace inline view t with your table)
SELECT t.id
, i.n
, NULLIF(SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT( t.room_type ,';;'),';',i.n),';',-1),'') AS room_type_n
, NULLIF(SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT( t.room_size ,';;'),';',i.n),';',-1),'') AS room_size_n
FROM ( SELECT 1 AS id, 'a;b;c' AS room_type, '5;7;12' AS room_size
UNION ALL SELECT 2,'b;c;d;f','8;2;4'
UNION ALL SELECT 3,'b;c;d;f;g;h','6;4;5;6;7;2;9'
) t
CROSS
JOIN ( SELECT 1 AS n
UNION ALL SELECT 2
UNION ALL SELECT 3
UNION ALL SELECT 4
UNION ALL SELECT 5
UNION ALL SELECT 6
UNION ALL SELECT 7
UNION ALL SELECT 8
UNION ALL SELECT 9
) i
HAVING NOT ( room_type_n <=> NULL )
ORDER BY t.id, i.n
Returns:
id n room_type_n room_size_n
--- --- ----------- -----------
1 1 a 5
1 2 b 7
1 3 c 12
2 1 b 8
2 2 c 2
2 3 d 4
2 4 f (NULL)
3 1 b 6
3 2 c 4
3 3 d 5
3 4 f 6
3 5 g 7
3 6 h 2
Inline view i should be extended to the finite, maximum number of values we expect to extract: UNION ALL SELECT 10 UNION ALL SELECT 11 ....
Please allow this question to serve as an example of a pitfall. Why we do not store comma separated lists is discussed in Chapter 2 Jaywalking in Bill Karwin's book "SQL Antipatterns", available at Amazon and other fine booksellers. https://www.amazon.com/SQL-Antipatterns-Programming-Pragmatic-Programmers/dp/1934356557
@MattiaFrusca: We typically reserve `SET` datatype for special cases of strictly predetermined values. And we don't modify the definition of the column when a user request can have any new value that they want to store. Running an ALTER TABLE can kill concurrency; and we typically don't grant ALTER TABLE priv to the application (its more problems than it solves.) Also, storing comma separated lists is an anti-pattern, see Chapter 2 in Bill Karwin's book https://www.amazon.com/SQL-Antipatterns-Programming-Pragmatic-Programmers/dp/1934356557(available from amazon and other fine booksellers.)
The first pattern is the normative relational pattern. Consider a future use case, for example, find all recipe with "oil" as an ingredient. Or if you needed to enforce a foreign key constraint... you would find the second pattern to be unsuitable. Bill Karwin has written an excellent book which has a chapter that covers the topic of "storing lists" in a column. "SQL Antipatterns: Avoiding the Pitfalls of Database Programming" available from Amazon and other fine booksellers. https://www.amazon.com/SQL-Antipatterns-Programming-Pragmatic-Programmers/dp/1934356557/ (See Chapter 2. Jaywalking)
To improve performance, combined index is also added for (group_id, price, id).
SOLUTION
SELECT a.id, a.name, a.group_id, a.price
FROM items a
LEFT JOIN items b
ON a.group_id = b.group_id
AND (a.price > b.price OR (a.price = b.price and a.id > b.id))
WHERE b.price is NULL;
See explanation on how it works for more details.
By accident as a side-effect this query works in my case where I needed to include ALL records with group_id equals to NULLANDone item from each group with the lowest price.
RESULT
+----+--------+----------+-------+
| id | name | group_id | price |
+----+--------+----------+-------+
| 1 | Item A | NULL | 10.00 |
| 2 | Item B | NULL | 20.00 |
| 3 | Item C | NULL | 30.00 |
| 4 | Item D | 1 | 40.00 |
| 5 | Item E | 2 | 50.00 |
+----+--------+----------+-------+
EXPLAIN
+----+-------------+-------+------+-------------------------------+--------------------+---------+----------------------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+-------------------------------+--------------------+---------+----------------------------+------+--------------------------+
| 1 | SIMPLE | a | ALL | NULL | NULL | NULL | NULL | 7 | |
| 1 | SIMPLE | b | ref | PRIMARY,id,items_group_id_idx | items_group_id_idx | 5 | agi_development.a.group_id | 1 | Using where; Using index |
+----+-------------+-------+------+-------------------------------+--------------------+---------+----------------------------+------+--------------------------+
I assume you have some primary key field on each joined table, since UserID is not unique. I'll assume your primary key is called ID. We'll take the records with the lowest ID. This meets your "first" criteria.
SELECT Users.UserID, Users.FirstName, Users.LastName, hp.HomePhone,
wp.WorkPhone, fn.FaxNumber
FROM Users
LEFT JOIN HomePhone hp ON hp.UserID = Users.UserID
LEFT JOIN HomePhone hp2 ON hp2.UserID = Users.UserID AND hp2.ID < hp.ID
LEFT JOIN WorkPhone wp ON wp.UserID = Users.UserID
LEFT JOIN WorkPhone wp2 ON wp2.UserID = Users.UserID AND wp2.ID < wp.ID
LEFT JOIN FaxNumber fn ON fn.UserID = Users.UserID
LEFT JOIN FaxNumber fn2 ON fn2.UserID = Users.UserID AND fn2.ID < fn.ID
WHERE hp2.ID IS NULL AND wp2.ID IS NULL AND fn2.ID IS NULL
There is a whole chapter on this type of issue, called "Ambiguous Gruops", in the book SQL Antipatterns.
This question has been asked and answered before, but I don't want to hunt for it; this question should be closed as a duplicate. But, to answer your question:
The commas in the string, the column value, are just characters. Those are part of the string. They aren't seen as "separators" between values in the SQL text. The way SQL sees it, the column contains a single value, not a "list" of values.
So, in your query, the IN (field) is equivalent to an equals comparison. It's equivalent to comparing to a string. For example:
... WHERE 2316 = '1908,2315,2316'
And those aren't equal, so the row isn't returned. The "surprisingly" finding of a match, in the case of:
... WHERE 1908 IN ('1908,2315,2316')
that's explained because that string is being evaluated in a numeric context. That is, the comparison returns true, because all of these also true:
... WHERE 1908 = '1908,2315,2316' + 0
... WHERE 1908 = '1908xyz' + 0
... WHERE 1908 = '1907qrs' + 1
(When evaluated in a numeric context, a string gets converted to numeric. It just happens that the string evaluates to a numeric value that equals the integer value it's being comparing to.)
You may be able to make use of the MySQL FIND_IN_SET function. For example:
... WHERE FIND_IN_SET(2316,'1908,2315,2316')
But, please seriously reconsider the design of storing comma separated list. I recommend Bill Karwin's "SQL Antipatterns" book...
- Developing Time-Oriented Database Applications in SQL by Rich. Snodgrass: https://www.amazon.com/Developing-Time-Oriented-Database-App...
-----
Anything by Joe Celko:
- SQL for Smarties: https://www.amazon.com/SQL-Antipatterns-Programming-Pragmati...
To generate a distinct list of "numbers" from all of your strings:
To put that back into a comma separated list (ARRGGHH! WHY do we want to that?), we can use the query above as an inline view...
This is essentially the same answer that was given to the question that Mark identified as a possible duplicate.
Rather than explain how all of that that works, I'm going to suggest that you AVOID storing comma separated lists of a values as a string.
I recommend you take a look at Chapter 2 of Bill Karwin's excellent book "SQL Antipatterns: Avoiding the Pitfalls of Database Programming".
http://www.amazon.com/SQL-Antipatterns-Programming-Pragmatic-Programmers/dp/1934356557
Those articles you saw that said this can't be done in MySQL, those articles are essentially right... you can't do this in SQL because you do not want to do this in SQL. Doing it does violence to the relational database model.
I'm wondering why you need two joins to the
lightboxes_tbl
table. It seems like the second reference to that table (aliasD
) is unnecessary. Seems like you could just use anOR
.As a demonstration, replicating the predicates in your query:
But given that
authorized_user
contains a comma separated list (ACCKKK!!!), I suspect you really want to look for an exact match to an item in the comma separated list. TheLIKE
comparison that currently have is equivalent to an equals comparison (to the entire contents ofauthorized_viewers
column). You could add'%'
wildcard characters to search for the value as part of the string...But that approach is will also match strings containing e.g.
ebscott@co.com
, which may not be what you really want.You could use the
FIND_IN_SET
function to find an exact match within the comma separated list...Storing comma separated lists is a SQL anti-pattern. I recommend Bill Karwin's book: SQL Anti-Patterns: Avoiding the Pitfalls of Database Programming
http://www.amazon.com/SQL-Antipatterns-Programming-Pragmatic-Programmers/dp/1934356557
I first want to mention.... storing comma separated lists is a SQL AntiPattern. I recommend Chapter 2 of Bill Karwin's excellent book "SQL Antipatterns: Avoiding the Pitfalls of Database Programming. Avaiable at Amazon and other fine booksellers [https://www.amazon.com/SQL-Antipatterns-Programming-Pragmatic-Programmers/dp/1934356557]
To answer the question you asked... with well formed lists, we can use the MySQL
FIND_IN_SET
function. As a demonstrationevaluates to TRUE.
Reference:
FIND_IN_SET
http://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_find-in-setIf we want to avoid using the MySQL specific
FIND_IN_SET
function, we could use a LIKE comparison. But to get that to work, we need to add a comma character at the beginning and end of the subcat list, and then search for a comma delimited value, e.g.',8,'
in the list. For exampleQ: "What is the best recommended way how to deal with this kind of stored data?"
A: The best recommendation is to avoid storing data as comma separated lists. (And no, this does not mean we should use semicolons in place of commas as delimiters in the list.)
For an introductory discussion of this topic, I recommend a review of Chapter 2 in Bill Karwin's book: "SQL AntiPatterns: Avoiding the Pitfalls of Database Programming"
Which is conveniently available here
https://www.amazon.com/SQL-Antipatterns-Programming-Pragmatic-Programmers/dp/1934356557
and from other fine booksellers. Setting that recommendation aside for a moment...
To retrieve the first element from a semicolon delimited list, we can use the
SUBSTRING_INDEX
function.As a demonstration:
returns
The MySQL
SUBSTRING_INDEX
function is documented here: https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_substring-indexI recognize that this might be considered a "link only answer". A good answer to this question is going to be much longer, giving examples to demonstrate the pitfall of storing comma separated lists.
If the database will only ever view the comma separated list as a blob of data, in its entirety, without a need to examine the contents of the list, then I would consider storing it, similar to the way we would store a .jpg image in the database.
I would store and retrieve a .jpg image as a BLOB, just a block of bytes, in its entirety. Save the whole thing, and retrieve the whole thing. I'm not ever going to have the database manipulate the contents of the image. I'm not going to ever ask the database to examine the image to discern information about what is "in" the jpg image. I'm not going to ask the database to derive any meaningful information out of it... How many people are in a photo, what are the names of people in a photo, add a person to the photo, and so on.
I will only condone storing a comma separated (or semicolon separated) separated list if we are intending it to be an object, an opaque block of bytes, like we handle a jpg image.
Personally I would not worry about Clean Code in database terms. What you need to worry about is data integrity and performance and security.
I believe as you get more familiar with the database model, it will become easier to understand those long procs. Some things I do though to help troubleshoot later are: If the proc has dynamic sql always write it with the debug variable mode that you can use to grab the SQL code that is built. This will save hours of troubleshooting time.
If the proc does a lot of work make sure to put it in transactions with Try Catch blocks so that everything gets rolled back if there is a problem. You can also take advantage of the fact that rollbacks don't affect table variables and store debugging and error info you will need in an exception table after the rollback. That will make it much easier on prod to see what happens when it fails.
Try to be consistent in your spacing and structure. There are various tools you can use to format. And consistent aliasing helps too. Alias every field referenced because 6 months later you may not remember which one it came from easily. Personally I find it much easier to understand long procs if all the inner joins are before the left joins. This lets me easily see what tables must have data and if the results are wrong it is often one of these that is to blame because it should be a left join.
Be consistent in your parameter data types. Make them match the database datatype of the field you will reference with that parameter.
Store everything in source control and ruthlessly delete all commented out code. If a parameter is not being used, get rid of it. If something is hardcoded, you may want to convert it to a parameter particularly if it is being used more than once.
If you have some business logic which might not be easy to figure out later, put in a comment explaining it.
Try to build new procs in sections, so that you can easily test the results as you go. So for instance suppose you are doing a financial report of all returned orders. You might first write a cte or temp table or table variable to get all the returns. Then you might join to that to get the financial records for just those returns and aggregate them. Then finally you would join all that to other details that you need in the report. So you can check by running just the first part and seeing that there are 2098 returns in the period. After you have aggregated the financial information, you should have one record for each of those 2098 returns. Then at then end you can see of you still have all the needed records. This helps to make sure you didn't lose any along the way by joining unless the data is such that you should have and you understand why you did. You can also see if you are getting extra records from joins to one to many tables which may or may not be appropriate (which is why understanding the data model is crucial).
If you are going to write complex sql, you need to read some books on performance tuning (learning which constructs like correlated subqueries and cursors and what to use instead) to avoid using and how to make sure your where clauses are sargable. There are several books available for SQL Server performance tuning. You also need to read the book on SQL Antipatterns which will be of much more help than the clean code books: http://www.amazon.com/SQL-Antipatterns-Programming-Pragmatic-Programmers/dp/1934356557/ref=sr_1_1?ie=UTF8&qid=1406570810&sr=8-1&keywords=sql+antipatterns
Don't store comma separated lists of values. (This doesn't mean to use semicolons in place of commas.)
SQL Antipatterns: Avoiding the Pitfalls of Database Programming -- Bill Karwin, Chapter 2
Available at Amazon and other fine booksellers.
https://www.amazon.com/SQL-Antipatterns-Programming-Pragmatic-Programmers/dp/1934356557
For a finite number of values, we can split those out into separate rows
(Replace inline view
t
with your table)Returns:
Inline view
i
should be extended to the finite, maximum number of values we expect to extract:UNION ALL SELECT 10 UNION ALL SELECT 11 ...
.According to this answer by @axiac, better solution in terms of compatibility and performance is shown below.
It is also explained in SQL Antipatterns book, Chapter 15: Ambiguous Groups.
To improve performance, combined index is also added for
(group_id, price, id)
.See explanation on how it works for more details.
By accident as a side-effect this query works in my case where I needed to include ALL records with
group_id
equals toNULL
AND one item from each group with the lowest price.I assume you have some primary key field on each joined table, since UserID is not unique. I'll assume your primary key is called ID. We'll take the records with the lowest ID. This meets your "first" criteria.
There is a whole chapter on this type of issue, called "Ambiguous Gruops", in the book SQL Antipatterns.
This question has been asked and answered before, but I don't want to hunt for it; this question should be closed as a duplicate. But, to answer your question:
The commas in the string, the column value, are just characters. Those are part of the string. They aren't seen as "separators" between values in the SQL text. The way SQL sees it, the column contains a single value, not a "list" of values.
So, in your query, the
IN (field)
is equivalent to an equals comparison. It's equivalent to comparing to a string. For example:And those aren't equal, so the row isn't returned. The "surprisingly" finding of a match, in the case of:
that's explained because that string is being evaluated in a numeric context. That is, the comparison returns true, because all of these also true:
(When evaluated in a numeric context, a string gets converted to numeric. It just happens that the string evaluates to a numeric value that equals the integer value it's being comparing to.)
You may be able to make use of the MySQL
FIND_IN_SET
function. For example:But, please seriously reconsider the design of storing comma separated list. I recommend Bill Karwin's "SQL Antipatterns" book...
http://www.amazon.com/SQL-Antipatterns-Programming-Pragmatic-Programmers/dp/1934356557