|
Ling-spam: A mixture of 481 spam messages and 2412 messages
sent via the Linguist list, a moderated (hence,
spam-free) list about the profession and science of linguistics. Attachments,
HTML tags, and duplicate spam messages received on the same day are not included.
PU1: A mixture of 481 spam messages and 618 legitimate messages
received by a particular user, after replacing each token (i.e., word, number,
punctuation mark, etc.) by a unique number throughout the corpus. Only the
earliest five legitimate messages of each sender are retained. Attachments, HTML
tags, and duplicate spam messages received on the same day are not included.
PU123A: Four corpora, based on private mailboxes, as in PU1 above. Unlike the earlier
form of PU1, the corpora in this directory are only in "bare" form: tokens are separated
by white characters, but no stop-list or lemmatizer has been applied. Apart from this
difference and the distribution of the messages in the 10 parts, the PU1 corpus in this
directory is the same as the PU1 corpus above. Attachments, HTML
tags, and duplicate spam messages received on the same day are not included.
For more information, please read the detailed description.
Enron-spam: preprocessed and raw forms of Enron-Spam datasets. The "preprocessed" directory
contains the messages in preprocessed format. Attachments, HTML tags, and duplicate
spam messages received on the same day are not included. The "raw" directory contains the
messages in their original form. Spam messages in non-Latin encodings, ham messages sent by the owners
of the mailboxes to themselves (sender in "To:", "Cc:", or "Bcc"
field), and a handful of virus-infected messages have been removed,
but no other modification has been made. For more information, please read the detailed description.
|