I came across an interesting problem at work a couple of days ago. One of our CakePHP-based applications has a scheduled task (cronjob) that runs every minute and imports mail from a number of IMAP mailboxes into the database. All of a sudden the script broke. In a very particular way. Not a single line of code was change in the script itself, and there were no significant changes in the rest of the application that the script is using. But one day, for some reason, it started to import emails from only half of the mailboxes, completely ignoring the rest.
A lengthy midnight drunk debugging session helped to realize the problem. Apparently, in the list of mailboxes to check there were a few old mailboxes, which were not used anymore for any email. The script was checking them but they were always empty. Server administrator removed those mailboxes during a scheduled maintenance window. Now, the script couldn’t connect to the mailbox anymore. No problem, there was error handling for this case. But was error handling didn’t take into account is the combination of time it takes timeout on the non-existing mailbox check and the database connection timeout.
It so happened, that these few old mailboxes were right in the middle of the list of all mailboxes. The script was importing mail from the first few mailboxes just fine. But then while it was taking a long time to timeout on a few non-existing mailboxes, the database connection got closed due to the inactivity timeout. This wasn’t handled properly. The script was only checking for the existing database connection at the beginning, when it was opening the connection – not later on. As the result, nothing could have been imported from the rest of the mailboxes.
As always, once you know the problem, you also know how to fix it or work around it. But troubleshooting a problem like this is tricky. Seeing the queries working for one mailbox and not working for another is disorienting at least. Having no recent changes to the application doesn’t help either. And the problem being not in the code itself, but in the resources is also not very obvious.
So, here you go. If you ever have a weird problem like that – check that your resources (database connection, network connection, file handler) are still available to you.