backup

Exporting messages from Gmail with fetchmail and procmail

One of the projects that I am involved in has a requirement of importing all the historical emails from a number of Gmail accounts into another system. It’s not the most challenging of tasks, but since I spent a bit of time on it, I figured I should blog it here too, just in case a similar need will arise in the future.

In my particular case, I need two different solutions. One for exporting all of the messages from all folders of all Gmail accounts in question (Gmail for Work). And the other is for exporting only the messages from the “Sent Mail” folder, which were sent on specific dates.

The solution that I derived is based on the classic tools for this purpose – fetchmail and procmail. Fetchmail is awesome at fetching emails using all kinds of protocols. Procmail is amazing at sorting, filtering, and otherwise processing the email messages.

So, here we go. First of all, we need to tell fetchmail where to get the messages from. I didn’t want to create to separate configurations for each of my tasks, so I left only the options common between them in the configuration file, and the rest I will be passing as command line arguments, depending on scenario.

Note that I’ve been running these tests from a dedicated environment, where I only had the root user. You don’t have to run it as root – it’ll work as any other just fine. Also, keep in mind that I used “/root/fetchmail-test/” folder for my test runs. You might need to adjust the paths if you have it any different.

Here’s my fetchmail.rc file, which I used to test a single mailbox. A new “poll” section will go into this file later, for each mailbox that I’ll need to export.

poll imap.gmail.com proto imap:
username "[email protected]" is root here
password "somepass"
fetchall
keep
ssl

If you are not root, you might need to adjust the second line, replacing “root” with your username. Also, for testing purposes, you can use “fetchlimit 1” instead of “fetchall“.

Now, we need two configuration files for procmail. The first one is super simple – I’ll use this for simply pushing all downloaded messages into a single giant mbox file. Here’s the procmail-all.rc:

VERBOSE=0
DEFAULT=/root/fetchmail-test/fetchmail.all.mbox

As you can see, it only defines the verbosity level and the default mailbox. The second configuration file is a bit more complicated. I’ll use it for the sent items only. The sent items folder limit will be done with fetchmail. But I want to do further is disregard all messages, which were not sent on a specific date. Here is my procmail-sent.rc:

VERBOSE=0
DEFAULT=/dev/null
:0
* ^Date: .*28 Jul 2016.*|\
^Date: .*27 Jul 2016.*
/root/fetchmail-test/fetchmail.sent.mbox

Again, we have the verbosity level and the default mailbox to save messages to. Since I want to disregard them unless they match a certain condition, I specify /dev/null. Then, I specify my condition, which is simply a bunch of regular expressions for the Date header. Usually, Date header is a not very reliable as different MUAs (Mail User Agents) use different formats, time zones, etc. In this particular case test results seemed consistent (maybe Gmail fixes the header), and I didn’t have any other more reliable criteria to use.

As you can see, I use a very basic condition for date matching. So, if the Date header matches either “28 Jul 2016” or “27 Jul 2016“, the message is saved in the mbox file, rather than being thrown into the default mailbox.

Now, all I need is a way to tie fetchmail and procmail together, as well as provide some additional options. For that I created the two one-liner shell scripts, just so that I won’t need to figure out the command line arguments if I look at this whole thing six month later.

Here is the check-all.sh script (multi-line for readability):

#!/bin/bash
fetchmail -f fetchmail.rc \
-r "[Gmail]/All Mail" \
--mda "procmail /root/fetchmail-test/procmail-all.rc"

and here is the check-sent.sh script (multi-line for readability):

#!/bin/bash
fetchmail -f fetchmail.rc \
-r "[Gmail]/Sent Mail" \
--mda "procmail /root/fetchmail-test/procmail-sent.rc"

If you run either one of these scripts, you’ll see the output similar to this:

$ ./check-all.sh
fetchmail: WARNING: Running as root is discouraged.
410 messages for [email protected] imap.gmail.com (folder [Gmail]/All Mail).
reading message [email protected]@gmail-imap.l.google.com:1 of 410 (446 header octets) (222 body octets) not flushed
reading message [email protected]@gmail-imap.l.google.com:2 of 410 (869 header octets) (230 body octets) not flushed
reading message [email protected]@gmail-imap.l.google.com:3 of 410 (865 header octets) (230 body octets) not flushed
...

Here are a few resources that you might find helpful:

Test your backups!

You can read all the books in the world and know all there is to know, but if you don’t follow the wisdom and practice the knowledge, then it’s all useless. That’s my lesson from yesterday.

The Tao of Backup, which I linked to before, says:

So, what happened? Well, as I was preparing for the Fedora 24 installation, I wanted to backup some of my files, as the partition would be formatted. I’ve connected an external USB drive with plenty of space and ZIP-archived a few of the vital directories on to it.

That was a very simple backup procedure and I saw the resulting files on the volume. What else should I do, right? Wrong! I should have tested the restore. I didn’t.

Most of the directories that I backed up were small – /etc, /opt, /root. But my /home directory was about 20 GBs. The external USB disk used the FAT-32 file system, which has a 4 GB file size limit. So only the first 4 GBs of my /home folder were backed up. Funny enough, those files were mostly browser cache and image thumbnails – stuff that should be ignored from backups. The main two folders that I wanted – Desktop and .ssh were not part of the backup. And I only realized that after the partition has been formatted.

So, yeah, I should have tested the backup.

P.S.: Gladly, I do have backups elsewhere, and most of my work is committed to GitHub/BitBucket anyways.

Tao of Backup

Tao of Backup is yet another way to tell people to backup their files. Not only it explains why it is important, but also how to do it properly. My favorite chapter is on testing:

The novice asked the backup master: “Master, now that my backups have good coverage, are taken frequently, are archived, and are distributed to the four corners of the earth, I have supreme confidence in them. Have I achieved enlightenment? Surely now I comprehend the Tao Of Backup?”

The master paused for one minute, then suddenly produced an axe and smashed the novice’s disk drive to pieces. Calmly he said: “To believe in one’s backups is one thing. To have to use them is another.”

The novice looked very worried.

Funny, but so true.

Backup your data! Unless you are Google

BBC reports that one of the Google data centers experienced a data loss, after a nearby power power facility was struck by lightnings four times in a row. Only about 0.000001% of total disk space was permanently affected, it is said.

A thing called “backup” immediately comes to mind. This was something I had to deal with in pretty much every company I worked for as a sysadmin. Backup your data or lose it, right?

Well, maybe. For most of those companies a dedicated storage or a couple of tape drives could easily solve the problem. But Google is often special in one way or the other.

A quick Google search (hehe, yup) for how much data Google stores, brings up this article from last year, linking to this estimation approach – there are no officially published numbers, so the estimate is all we can do: 10-15 exabytes. (10-15 exabytes, Carl!) And that’s from the last year.

Using this method, they determined that Google holds somewhere around 10-15 exabytes of data. If you are in the majority of the population that doesn’t know what an exabyte is, no worries. An exabyte equals 1 million terabytes, a figure that may be a bit easier to relate to.

Holy Molly, that’s a lot of data! To back this up, you’ll need at least double the storage. And some really lightning-fast (pan intended) technology. Just to give you an idea, some of the fastest tape drives have a throughput of about 1 TB / hour and a native capacity of about 10 TB (have a look here, for example). The backup process will take about … forever to complete.

So if tapes are out, then we are backing up onto another storage. Having the storage in the same data center sort of defeats the purpose (see above regarding “lightning”). Having a storage in another data center (or centers) means you’ll need some super fast networks.

You could probably do quite a bit of optimization with incremental and differential backups, but you’d still need quite a substantial infrastructure.

Simpler, I guess, just spread your data across many data centers with several copies all over the place, and hope for the best.

But that’s for Google. For the rest of us, backup is still an option. (Read some of these horror stories if you are not convinced yet.)

And since we are on the subject of backups, let me ask you this: how are you doing backups? Are you still with tapes, or local NAS, or, maybe, something cloud-based? Which software do you use? What’s your strategy?

For me, dealing with mostly small setups, Amazon S3 with HashBackup is sufficient enough. I don’t even need to rotate the backups anymore. Just do a full daily.