Database indexes are very important

The SQL query that I’m running 650 times is as follows:

SELECT h.protein_acc, h.taxon_id FROM homologene h, all_proteins a, synonyms s WHERE
s.synonym = '$_' AND a.alt = s.protein_acc AND h.protein_acc = a.protein_acc
UNION SELECT h2.protein_acc, h2.taxon_id FROM homologene h1, homologene h2, all_proteins a
WHERE a.alt = '$_' AND h1.protein_acc = a.protein_acc AND h1.hid = h2.hid;

I estimated that the script would take 10 hours to run (this SQL query isn’t the only thing it’s doing, of course). That’s far too long. Using the handy EXPLAIN function showed me where I was going wrong:

+-------+--------+-----------------+---------+---------+----------------------+--------+-------------+
| table | type   | possible_keys   | key     | key_len | ref                  | rows   | Extra       |
+-------+--------+-----------------+---------+---------+----------------------+--------+-------------+
| s     | ref    | PRIMARY,synonym | synonym |      27 | const                |      1 | Using where |
| a     | ref    | PRIMARY,alt     | alt     |      28 | s.protein_acc        |      3 | Using where |
| h     | ALL    | NULL            | NULL    |    NULL | NULL                 | 174762 | Using where |
| a     | ref    | PRIMARY,alt     | alt     |      28 | const                |      1 | Using where |
| h2    | ALL    | PRIMARY         | NULL    |    NULL | NULL                 | 174762 |             |
| h1    | eq_ref | PRIMARY         | PRIMARY |      31 | h2.hid,a.protein_acc |      1 | Using index |
+-------+--------+-----------------+---------+---------+----------------------+--------+-------------+

See those two rows that say “ALL”? That’s right, it’s doing a linear search through 174,762 records to find the record of interest. Twice.

What to do? Create a decent index. After running “ALTER TABLE homologene ADD INDEX (protein_acc);“, we now have:

+-------+------+---------------------+-------------+---------+---------------+------+-------------+
| table | type | possible_keys       | key         | key_len | ref           | rows | Extra       |
+-------+------+---------------------+-------------+---------+---------------+------+-------------+
| s     | ref  | PRIMARY,synonym     | synonym     |      27 | const         |    1 | Using where |
| a     | ref  | PRIMARY,alt         | alt         |      28 | s.protein_acc |    3 | Using where |
| h     | ref  | protein_acc         | protein_acc |      27 | a.protein_acc |    1 |             |
| a     | ref  | PRIMARY,alt         | alt         |      28 | const         |    1 | Using where |
| h1    | ref  | PRIMARY,protein_acc | protein_acc |      27 | a.protein_acc |    1 |             |
| h2    | ref  | PRIMARY             | PRIMARY     |       4 | h1.hid        | 1747 |             |
+-------+------+---------------------+-------------+---------+---------------+------+-------------+

The script went from taking 10 hours to about 15 minutes. That makes me very happy.

Edit: 15 minutes may have been too optimistic. If only there was some way of speeding up GenBank, then I’d be in business!

Leave a Reply

:mrgreen: :neutral: :twisted: :shock: :smile: :???: :cool: :evil: :grin: :oops: :razz: :roll: :wink: :cry: :eek: :lol: :mad: :sad: