Jayson Minard wrote a very good article on upgrading a production site and what can go wrong and what we can learn from it.
Yesterday, I performed an upgrade to a third-party package used with Zend Developer Zone. It has an automated schema update system which silently performs actions on the database that had a large impact on ZDZ and related sites causing an outage. So, there are good lessons from my post-mortem that I would like to share with the community.
The Start of the Problem
First, let us look at the actual list of actions that started the issue:
1. The upgrade does a schema check on first load
2. The upgrade then corrects the schema to be valid for the new release (performing table changes via DDL)
3. The upgrade then may modify large amounts of existing data, or delete large amounts of old data
These schema and data updates can cause huge potential issues when the database and tables are used concurrently by the online site. First, the DDL changes will lock the affected tables. And for some storage engines (i.e. MyISAM in MySQL) the modifications or deletes will also cause table locks, and in other engines they could cause contention on locked rows, and in other engines cause things like rollback segments to overflow. The online site then waits behind the locks or contention causing threads in the web server to be held until no more threads are left to serve actual user requests. No more threads, no more site.
Learn from the problem; via Yes, I Crashed the Site!.