When working with large databases, SQL structure effects query performance dramatically. I recently refactored a query that looked like this:
Refactored into:
The query went from around 45 seconds down to 3. Digging into the MySQL docs a little more I found this:
The optimizer rewrites the statement to a correlated subquery:
SELECT … FROM t1 WHERE EXISTS (SELECT 1 FROM t2 WHERE t2.b = t1.a);If the inner and outer queries return M and N rows, respectively, the execution time becomes on the order of O(MxN), rather than O(M+N) as it would be for an uncorrelated subquery.
An implication is that an IN subquery can be much slower than a query written using an IN(value_list)operator that lists the same values that the subquery would return.
What is the likelihood that DB2, SQL Server, etc., behave the same way?
Douglas: It’s hard to say. Database engines optimize differently.
It’s highly likely that you would see a performance increase removing the sub-select as outlined in the MySQL doc or by adding a join as I did in my example.
It’s harder to say to what degree that performance increase would be. I haven’t had a chance to test out any other databases.
PostgreSQL 8.4 converts “IN” to “JOIN”, as well as “EXISTS” (just checked)
Checks like those aren’t adequate. The engine will do different things in different situations.
In is a log hit per row. Using it is the clear sign of an amateur who’s never read an SQL book.
The two statements are logically equivalent, MS SQL’s optimizer has no trouble seeing this (even back to SQL 2000) and the query plans are identical.
I totally disagree, I had a situation yesterday in MS SQL where rewriting a subquery to a join caused performance DECREASE. I just shudder to think that people actually use such a crippled database like mySQL.
I was talking about the above simple SQL statement, MS SQL will end up using the same physical ops(Nested Loop,Hash Match, or Merge Join).
I too have seen MS SQL optimizer fall down on more complex joins and have it pick a better plan with using IN, I don’t think I have ever seen it pick a better plan with a JOIN rather than an IN.
For me the rule with MS SQL server is, if you don’t need to output any columns from the inner joined table I prefer an IN clause.
With MySQL the rule is easy, NEVER use IN.
Have you got some benchmark comparing IN (subquery statement) and IN (value lists) ?
Im quite interested in this as I frequently used the latter and the list can have as many as 5000 values
I second “Ryan Bayona”‘s comment. It seems like the performance gain is not from eliminating using “IN” per se, but from converting the sub-query into an explicit JOIN.
I worked on the DB2 query compiler at IBM, and can tell you that it does indeed perform subquery-to-join elimination, which is what MySQL should be doing in this case.
Hi,
I found your blog while looking for subquery-specific benchmarks on Google.
Since I was one of the MySQL optimizer devs until recently, and now work
on the optimizer of MariaDB (MySQL’s primary fork), I can give some insight.
Indeed, MySQL doesn’t transform IN subqueries into JOINs. It transforms
IN into an equivalent EXISTS by “pushing” the expressions from the left
argument into the subquery by introducing a conjunction of equalities between
each outer expression, and its corresponding subquery expression. This usually
results in very efficient plans if the subquery tables have suitable indexes to
support these equalities. Then it is very cheap to re-execute the subquery.
If there are no such indexes, then re-executing the subquery many times becomes
very expensive, and the whole query slow. This is true for all versions of MySQL
up to 5.5 (even the latest development version).
For more info on MySQL’s subquery processing read this:
http://dev.mysql.com/doc/refman/5.5/en/in-subquery-optimization.html
However, all is not lost, the MariaDB branch of MySQL implements most of the
subquery optimizations found in the major commercial DBMSs. Our current version
MariaDB 5.3-beta essentially does automatically what you did manually above, but
even faster, by rewriting the IN predicate into a semi-join (which is faster than a join).
There are multiple ways to do this, and the optimizer chooses the best one. The
same is true for all other major DBMSs, that is why it is no surprise that manually
rewriting an IN subquery into a join may result in worse performance.
To get a flavor of the performance improvements that MariaDB 5.3 delivers
compared to MySQL, check our presentation at this years MySQL conference:
http://en.oreilly.com/mysql2011/public/schedule/detail/20238.
A compact list of all new optimizer and other features of MariaDB 5.3 can be found here:
https://kb.askmonty.org/en/what-is-mariadb-53
[…] I can get a nice performance boost (learned at https://spin.atomicobject.com/2011/03/25/mysql-in-query-performance) […]
[…] MySQL “IN” – Query Performance by Dustin Tinney […]
[…] on your database and your measured performance characteristics (take into account this experience and the optimizations available to your DBMS), one of those might make more sense for you. […]