march: message archiving scripts


abstract | features | requirements | download | installation | caveats | news | todo


abstract people get a lot of email. large mailboxes are inefficient. usage patterns suggest that old mail is less likely to be accessed than recent mail. an automated method of archiving mail would be pretty useful.

i've thrown together some simple scripts which operate on mbox mailboxes to expire messages from current folders after a set number of days. messages are placed in monthly archive folders in a separate directory hierarchy.

this "generational" approach works well, as it maintains a fixed-length cache of recent messages at all times. simply archiving whole mailboxes, say at the end of the month, expires some messages after too long, and some too early. this is at the expense of some cpu and time, but nothing is free.

features variable message expiry time. three weeks works well for me.

works with qmail's "home dir sticky bit" feature and procmail lockfiles to avoid mailbox access collisions while filing.

strips out UW IMAP ("DON'T DELETE THIS MESSAGE -- FOLDER INTERNAL DATA") messages. we're reformatting the mailbox such that the cached info stored in the message is rendered useless anyway.

uncompresses gzipped archive folders before appending new messages.

requirements a recent version of perl. you need the IPC::Open2 and Date::Parse libraries. to find out if your perl qualifies:
perl -e 'use IPC::Open2;use Date::Parse;'

formail. comes with procmail, which you should be using anyway :)

download the most recent version is here.
installation untar. copy scripts into your path. edit the rc file to suit. optionally call from your crontab, like so:
# every tuesday/friday, archive all current mailboxen
55 4 * * 3,6 find $HOME/mail/current -type f -print0 \
	| xargs -0 -n 1 march-box
the following may be handy to tame excess disk use. another step in the generational approach.
# every wednesday compress archives that aren't and are over two weeks old
45 23 * * 4 find $HOME/mail/archive -type f -name '[A-Za-z0-9]*[0-9]' \
	-mtime +14 -print0 | xargs -0 gzip -9
note: more recent versions of gzip (1.3) complain if there are no arguments, which can happen if no files are found by "find". if your xargs takes the "-r" flag (GNU xargs) you can do this instead to silence the error message:
 
# every wednesday compress archives that aren't and are over two weeks old
45 23 * * 4 find $HOME/mail/archive -type f -name '[A-Za-z0-9]*[0-9]' \
        -mtime +14 -print0 | xargs -0r gzip -9
caveats does not yet handle mailboxes in subdirectories of the current mailbox folder. this will be remedied soon.

to the best of my knowledge and testing, the scripts do not lose messages. i've run them against 10,000-message boxes at a time, and every message is accounted for afterwards. as a precaution, the box is copied to a backup directory before any processing is done, and remains there until the next run of the program. possibly a waste of space, but i prefer the peace of mind. easily disabled if you like.

*however*, if you interrupt the archive process (ctrl-c for instance) and rerun it on the same mailbox, you will likely get duplicates of the messages you've archived up to that point. this is because the script replaces the entire mailbox when bailing out, even though it may have archived some of the messages. this is considered a bug to be fixed in a future release.

the mailbox locking routines require that procmail delivery recipes use local lockfiles. a good idea regardless.

since filing is based on the Date: header, it is somewhat at the whim of the sender where the message will finally end up. incorrectly set clocks will cause the message to be placed in the correspondingly incorrect folder. unparseable or nonexistent dates will cause the message to be placed in the "foldername-baddate" folder.

this would be a whole lot simpler with mh format boxes, but i don't like them.

news 24 may 04
version .3 released. the executables now don't need to be in your $PATH. march-box uses it's location to find march-msg.

3 may 01
version .2 released. rename executables. handle SIGTERM more gracefully. per-user config file. some code cleanup.

2 may 01
version .1 released.

todo handle nested folders, mailboxes in subdirectories.

better error recovery.

code cleanup.

insure against duplicates when filing (formail msgid check).

auto-compress archive folders when done filing.


last modified: 24 may 04