When optimizing code, you profile it, find out where it’s spending most of it’s time, and figure out how to make it spend less time there. As a sysadmin, I tend to wear a lot of hats, but the one I wear the most by far is the DBA hat–I spend far more time hacking on MySQL optimizations than any particular other thing I do. So, in order to make myself more efficient, if I can find a way to streamline the time I spend with MySQL, I can reap the biggest benefits.
It turns out doing so isn’t so hard. Here are some tricks I’ve learned.
Now, the reason I spend so much time with MySQL is that it’s a prime cause for bottlenecks in our applications. Those bottlenecks tend to fall into three categories:
The queries cause tables to lock, halting further queries until the first completes.
The queries are improperly indexed, requiring queries to examine or sort the entire table when they don’t have to.
The queries act on more data than the system can adequately manage at once.
There are different solutions to each of the above problems. I’ll walk through a few of them, but before I do, let me start with the lowest-hanging fruit that is common to all three: upgrade to the latest version of MySQL, and then upgrade to Percona Server. Percona Server is a drop-in replacement for MySQL than has tighter optimization and better monitoring features. And when I say drop-in replacement, I really mean drop-in replacement: installing it is literally trivial. We’ve done so on both Ubuntu 10.04 Server and RHEL 5, and the process in both cases was backup your data (just in case), remove the MySQL Server package, and then install the Percona Server package. That’s it. All of your client libraries will continue to work, you can still use handy performance tools like MySQL Tuner and MyTop, you can even continue replicating to your non-Percona slaves (which you should also upgrade, of course). It grants you instant benefits for virtually no investment.
After that, we can’t know what process to take to optimize a query without knowing what queries need to be optimized. We at Agora use several tools for tracking them down. The main tools we use are:
Munin, which is trending software: it tracks statistics over time and plots them onto a graph. While we’re not 100% happy with Munin, it’s issues are easy enough to work around (should you be at the scale that requires working around them), and there’s nothing simpler for getting up and running quickly. This can query MySQL’s statistics counters and plot them, so you can see all manner of data on how your server is behaving: from memory usage, to breakdowns of query types, to number of slow queries, to disk access patterns, and so on.
The MySQL slow query log, which simply logs any queries that took longer to execute than some (easily configured) threshold.
The EXPLAIN SELECT query will identify how MySQL interprets a given query, allowing you to identify how expensive your queries are, and how well-utilized your indices are. (EXPLAIN is very powerful, but also confusing. I recommend looking it up on Google. Here is an explanation in further detail.)
The MySQL SHOW PROCESSLIST query, which can help identify locking queries while they’re running. One notable benefit of Percona is that it is much better about showing the current state of a query, and how many rows are being acted upon, than MySQL, which makes it easier to track down what the problematic part of a query is.
MySQL Tuner is a little Perl script that helps identify common causes on a MySQL Server by running SHOW VARIABLES and comparing the output to several heuristics. While this isn’t as useful when you have a lot of experience tuning MySQL’s variables, it’s enormously useful if you havn’t.
Our usual process is to identify when we hit performance issues with Munin, check the slow query log at those times to identify problem queries, and then use EXPLAIN, SHOW PROCESSLIST, and MySQL Tuner (and/or looking directly at SHOW VARIABLES, if you know what you’re doing) to help identify why the queries are problematic.
While I’d love to break down every possible symptom I’ve seen and how it can be fixed, that could easily fill a book. (And, in fact, a quick search on Amazon reveals that it already does.) So, you’re just going to have to use your intuition (and Google) to track it down yourself. The most important quality is having an inquisitive nature. If you see that something’s wrong, instead of looking for a magic word that fixes it, try to understand what’s going on behind the visible symptoms, or try to understand why the magic word works.
Let me walk through common solutions to the three above types of problems to get you started.
Let’s suppose that you’re finding, via SHOW PROCESSLIST, that a number of processes are piling up due to a locked query. What would you do to fix that? I would first start by looking at the table’s engine (most easily seen by SHOW CREATE TABLE). If the table is MyISAM, try converting it to InnoDB. InnoDB has a lot more overhead, but in exchange it never locks a whole table at once; so while individual queries will require more time, they won’t preclude other queries from running. Let’s suppose that your tables are already InnoDB, or that for some reason they can’t be (for example, you’re relying on FULLTEXT indices, which InnoDB does not support). The next thing I’d check is to ensure your various server tuning variables are adequately sized. This is a very complex topic with lots of things to fiddle with, but I recommend running MySQL Tuner as a starting point. Common problems are setting the MyISAM key buffer or the InnoDB buffer pool too small, causing your database to swap. If you’ve already checked that, then your problem may be the efficiency of your indices.
You can identify how your indices are being used by running EXPLAIN to see how the query is being executed. If it’s doing a full table scan, creating a temporary table, or performing a filesort, you should either create a new index or restructure the query. Additionally, be aware that indices add overhead: whenever the table is modified, each index needs to be updated as well. Oftentimes the overhead in doing so can cause locks, especially on expensive indices (such as indices on a large number of columns, on large text columns, or any FULLTEXT indices). You may need to evaluate how the table is being accessed and remove any unnecessary indices, so there’s less overhead needed. For example, we had a number of FULLTEXT indices on Gamebattles that were causing a large number of table locks in order to keep them updated. We ended up moving to Sphinx for FULLTEXT indexing and removing the indices in MySQL.
Finally, the greatest bane of MySQL is hitting disk. If you ever have to swap memory to disk ever, then your database performance will drop like a rock. If Munin or MySQL Tuner indicate that your disk usage is too high, look at increasing your buffer sizes. If you’ve maxed your system’s physical RAM, then you should buy more. If you have adequate RAM but your disks are still too slow, buy faster disks. When in doubt, spring for SSDs.
While this is a pretty quick overview of how to approach debugging inefficiencies in your databases, I hope that you’ll find it helpful. If you’re interested in more detailed reviews of any particular point (as there’s an almost bottomless amount of practical wisdom to be acquired when dealing with MySQL), feel free to let me know via the comments.
GameBattles is one of our most popular sites, with an active user base in the millions. While we’ve dealt with sites in that scale before, this particular case has been no walk in the park. While we’re a Rails shop, the fact that the site is written in PHP actually has nothing to due with the difficulties we’ve encountered. The real problems are that the codebase is almost a decade old, has been developed by dozens of developers over that time, has grown organically with a clear roadmap defined only recently, and has almost never been refactored.
So, at various intervals over the last year, I (and others) have been digging into the site and trying to clean up what we can. It seemed that the process of what weve been doing is interesting, but talking about it at any length has been something I’ve hesitated to do, as I don’t like talking about something without quantifiable statistics. But, the fact of the matter is, programming is an art, not a science, and anyone who says different is selling something. And the real trick with refactoring is that it doesn’t meet any immediate business goals: it’s purpose is entirely human, as it entirely exists to make a clean and sane working environment. All of the tangible business gains (security, performance, developer velocity, etc.) are all secondary, and so unless your managers “get it,” it can be a hard thing to argue for.
In our case, it’s no secret that GameBattles' stability was going downhill, and so we needed to do something. And the only way for us to be able to audit for problems was to make the codebase manageable.
Let me give you a sense of where we started: the code was in version control, but only in the technical sense of the word; in reality, SVN was being used more as an excessively complicated rsync—it was a means to transport code from one developer machine to many web servers. The repository was over two gigabytes in size, about a third of which was PHP code, and the remainder being static assets. (And this is in addition to static assets housed elsewhere, such as on our CDN.) There were at least four PHP frameworks and three CMSs. Some of the code was object-oriented PHP5, but plotting a class diagram would have required at least four dimensions. If you think you’ve seen spaghetti code, think again: walking into this mess was like walking into the alien hive.
Always, always start with the low-hanging fruit. Your task is both difficult and boring. Don’t make it any harder than it already has to be.
Similarly, start with broad strokes. Try to find where you’ll experience the biggest gains first, and don’t work on things with small returns until you’re reasonably sure there’s nothing bigger.
However, working on anything is better than working on nothing, so don’t dither. Be decisive.
Make lists. I, personally, keep three: alive, dead, and unsettled, each referring to files in the codebase. Alive files are those known to be (at least partially) necessary to the core functionality of the site. Dead files are those known to be not referenced by anything alive. Unsettled files are those you just don’t know enough about. Never fail to keep these lists up to date, even though you’ll be changing them almost constantly. Yes, it’s tedious, but it’s necessary.
Use common sense. Trust your feelings. If something seems important, it probably is. If something seems useless, it probably is. If the files havn’t been updated in over two years, they’re very likely dead.
Sometimes you’re lucky and you can nuke entire files. Sometimes you’re not and you have to perform surgery on files to remove the parts that don’t need to be there. Don’t be afraid to dive in, but remember: don’t get in over your head. If you need to make a lot of changes for a small gain, move on to somewhere else for now. You can always come back later.
If you’re not on a UNIX system, get on one. Your best friends in this process are _ find_ and _ grep_, and I can’t even imagine where you’d begin without them. Find is a tool that lets you scan for files on disk based off of a number of parameters—name, type, last updated time, etc. Grep, of course, lets you search the contents of files. You can chain them together, too: I have probably typed find . -type f -name ‘*.php’ | xargs grep ‘search term’ more times than I can count—that snippet will search all PHP files in (or below) the current directory and tell you which ones contain the search term. (You can also just use grep -R ‘search term’ .—which does nearly same thing—if you don’t have a lot of binary files that will take up a lot of search time.) This is particularly handy for seeing if a class or database table is referenced somewhere.
Looking at the database is a great heuristic for what’s important and what’s not. Recent versions of MySQL (which is what we use for GameBattles' backend) actually have a lot of useful metadata hiding in the information_schema.TABLES table—things like how large a table is (both in rows, and bytes), when it was last written to, and so on. If you find tables that haven’t been updated in a year, then they’re probably not used, and any files that reference them probably aren’t either. Additionally, if you find tables that are written to but never read from, they’re dead. Delete the references and then delete the table.
Make backups of everything. Date them. Have a rollback plan in writing for when you delete something a little overzealously, since your brain is going to be far too fried when you’re in the thick of it to do anything other than follow instructions.
Once you’ve gotten rid of something, deploy it immediately. Not only will you find if you broke the world while what you did is still fresh in your mind, but it’s also remarkably cathartic to put some closure on what you just accomplished. And you will need the positive reinforcement.
I cannot emphasize the human factor of the above notes enough. The entire point of refactoring is maintaining your sanity. Don’t sacrifice it in the process.
While the labor is toilsome, the rewards are unparalleled. GameBattles codebase is currently an eighty-megabyte Git repository, less than four percent of the size of what we started with. The number of potential security holes has dropped like a rock. Our uptime has gone from less than two “ nines” to almost four, and is continuing to improve. Our developers are moving faster than ever. Our users are happier. And, most importantly, I havn’t lost my mind. (Yet.)