Corpora
Ling-spam: A mixture of 481 spam messages and 2412 messages sent via the Linguist list, a moderated (hence, spam-free) list about the profession and science of linguistics. Attachments, HTML tags, and duplicate spam messages received on the same day are not included. 

PU1: A mixture of 481 spam messages and 618 legitimate messages received by a particular user, after replacing each token (i.e., word, number, punctuation mark, etc.) by a unique number throughout the corpus. Only the earliest five legitimate messages of each sender are retained. Attachments, HTML tags, and duplicate spam messages received on the same day are not included. 

PU123A: Four corpora, based on private mailboxes, as in PU1 above. Unlike the earlier form of PU1, the corpora in this directory are only in "bare" form: tokens are separated by white characters, but no stop-list or lemmatizer has been applied. Apart from this difference and the distribution of the messages in the 10 parts, the PU1 corpus in this directory is the same as the PU1 corpus above. Attachments, HTML tags, and duplicate spam messages received on the same day are not included.  For more information, please read the detailed description.

Enron-spam: preprocessed and raw forms of Enron-Spam datasets. The "preprocessed" directory contains the messages in preprocessed format. Attachments, HTML tags, and duplicate spam messages received on the same day are not included. The "raw" directory contains the messages in their original form. Spam messages in non-Latin encodings, ham messages sent by the owners of the mailboxes to themselves (sender in "To:", "Cc:", or "Bcc" field), and a handful of virus-infected messages have been removed, but no other modification has been made. For more information, please read the detailed description.


Software

MailboxEncoder version 2.2 (online documentation): A program to create encoded benchmark corpora like the PU1 corpus above.

i-config would be happy to publicize in this page new corpora created with the MailboxEncoder program .