MySQL "IN" – Query Performance

When working with large databases, SQL structure effects query performance dramatically. I recently refactored a query that looked like this:


Refactored into:


The query went from around 45 seconds down to 3. Digging into the MySQL docs a little more I found this:

The optimizer rewrites the statement to a correlated subquery:
SELECTFROM t1 WHERE EXISTS (SELECT 1 FROM t2 WHERE t2.b = t1.a);

If the inner and outer queries return M and N rows, respectively, the execution time becomes on the order of O(MxN), rather than O(M+N) as it would be for an uncorrelated subquery.

An implication is that an IN subquery can be much slower than a query written using an IN(value_list)operator that lists the same values that the subquery would return.

MySql doc on subquery restrictions

Conversation
  • Douglas says:

    What is the likelihood that DB2, SQL Server, etc., behave the same way?

  • Douglas: It’s hard to say. Database engines optimize differently.

    It’s highly likely that you would see a performance increase removing the sub-select as outlined in the MySQL doc or by adding a join as I did in my example.

    It’s harder to say to what degree that performance increase would be. I haven’t had a chance to test out any other databases.

  • Funny Falcon says:

    PostgreSQL 8.4 converts “IN” to “JOIN”, as well as “EXISTS” (just checked)

  • Checks like those aren’t adequate. The engine will do different things in different situations.

    In is a log hit per row. Using it is the clear sign of an amateur who’s never read an SQL book.

  • SigmundAusfaller says:

    The two statements are logically equivalent, MS SQL’s optimizer has no trouble seeing this (even back to SQL 2000) and the query plans are identical.

  • Aaron Kempf says:

    I totally disagree, I had a situation yesterday in MS SQL where rewriting a subquery to a join caused performance DECREASE. I just shudder to think that people actually use such a crippled database like mySQL.

  • SigmundAusfaller says:

    I was talking about the above simple SQL statement, MS SQL will end up using the same physical ops(Nested Loop,Hash Match, or Merge Join).

    I too have seen MS SQL optimizer fall down on more complex joins and have it pick a better plan with using IN, I don’t think I have ever seen it pick a better plan with a JOIN rather than an IN.

    For me the rule with MS SQL server is, if you don’t need to output any columns from the inner joined table I prefer an IN clause.

    With MySQL the rule is easy, NEVER use IN.

  • Ryan Bayona says:

    Have you got some benchmark comparing IN (subquery statement) and IN (value lists) ?

    Im quite interested in this as I frequently used the latter and the list can have as many as 5000 values

  • Ehmay Sinhorz says:

    I second “Ryan Bayona”‘s comment. It seems like the performance gain is not from eliminating using “IN” per se, but from converting the sub-query into an explicit JOIN.

  • Rob Montroy says:

    I worked on the DB2 query compiler at IBM, and can tell you that it does indeed perform subquery-to-join elimination, which is what MySQL should be doing in this case.

  • Timour says:

    Hi,

    I found your blog while looking for subquery-specific benchmarks on Google.
    Since I was one of the MySQL optimizer devs until recently, and now work
    on the optimizer of MariaDB (MySQL’s primary fork), I can give some insight.

    Indeed, MySQL doesn’t transform IN subqueries into JOINs. It transforms
    IN into an equivalent EXISTS by “pushing” the expressions from the left
    argument into the subquery by introducing a conjunction of equalities between
    each outer expression, and its corresponding subquery expression. This usually
    results in very efficient plans if the subquery tables have suitable indexes to
    support these equalities. Then it is very cheap to re-execute the subquery.
    If there are no such indexes, then re-executing the subquery many times becomes
    very expensive, and the whole query slow. This is true for all versions of MySQL
    up to 5.5 (even the latest development version).

    For more info on MySQL’s subquery processing read this:
    http://dev.mysql.com/doc/refman/5.5/en/in-subquery-optimization.html

    However, all is not lost, the MariaDB branch of MySQL implements most of the
    subquery optimizations found in the major commercial DBMSs. Our current version
    MariaDB 5.3-beta essentially does automatically what you did manually above, but
    even faster, by rewriting the IN predicate into a semi-join (which is faster than a join).
    There are multiple ways to do this, and the optimizer chooses the best one. The
    same is true for all other major DBMSs, that is why it is no surprise that manually
    rewriting an IN subquery into a join may result in worse performance.

    To get a flavor of the performance improvements that MariaDB 5.3 delivers
    compared to MySQL, check our presentation at this years MySQL conference:
    http://en.oreilly.com/mysql2011/public/schedule/detail/20238.

    A compact list of all new optimizer and other features of MariaDB 5.3 can be found here:
    https://kb.askmonty.org/en/what-is-mariadb-53

  • […] on your database and your measured performance characteristics (take into account this experience and the optimizations available to your DBMS), one of those might make more sense for you. […]

  • Comments are closed.