Train Bayes with an email archive

General eFa discussion
Post Reply
r31griffo
Posts: 19
Joined: 31 Mar 2017 05:09

Train Bayes with an email archive

Post by r31griffo »

G'day everyone,

I've recently implemented eFa (3.0.2.2) for filtering emails inbound to my Exchange 2013 server (receives 50~75 emails a day). We don't necessarily receive a lot of spam but after a couple of cryptolocker style attachments were executed in recent times :roll: I've taken upon myself to do something about it.

With such a low number of emails it will take some time for the Bayes database to start working and may not be effective due to the small dataset it will have to work with, so in an effort to speed up the process I've stumbled past an archive of spam messages to assist with populating the data set:
http://untroubled.org/spam/
I would also need to export known good emails from a few of the mailboxes to level out good vs bad.

1. I searched but came up empty, has someone else already managed to do this?
2. How effective is a fully populated Bayes database and will the effort be worth the gain?
3. Any tips and pointers to what would need to be done would be greatly appreciated, eg where the emails need to be stored, the format they need to be in, after the files are in place how to execute Bayes to begin learning.

If I can pull it off I'll post back the process and script(s) so others can replicate it.
jase72
Posts: 20
Joined: 21 Jul 2017 09:06

Re: Train Bayes with an email archive

Post by jase72 »

Hi Griffo,

Having used EFA for a while and benefited from it greatly, I thought it about time I helped out someone here. So first post! Luck you... I hope. I also hope you're getting notifications on replies as your post is getting a little old, I sure took my time! (c;

Answer for question 2 is everyone will tell you Bayes isn't worth much without training.
Answer for question 3 is I leave them in Exchange but to get them into a format for spamassassin (mbox) I use Thunderbird to download via IMAP, 'cos Thunderbird stores the emails in a mbox file. I can just grab the file once it's synced and upload it to EFA.

Answer for question 1 is the wall of text that follows...

What I'd recommend you do to get all these old emails is the following (and I'll assume you have some moderate understanding of Exchange here);
  1. Open the ECP and do an enterprise search (compliance management > in-place eDiscovery & hold). How you do this is up to you, but I'd suggest choosing "Enable de-duplication" when you do -- this will collapse all the found emails into one folder.
    See https://technet.microsoft.com/en-au/lib ... .150).aspx for more information.
  2. Confirm the results are the emails you want, obviously.
  3. Open up Outlook and fire up a profile for the administrator (or whichever account you ran the ECP search from).
  4. Add a secondary mailbox of Discovery Search Mailbox.
    https://msdn.microsoft.com/en-us/librar ... .149).aspx for more info.
  5. Copy out the emails into an empty folder that you have control of. You can export to PST and then import into your mailbox if you so desire.
  6. Review the emails in your new folder. Delete anything that shouldn't be there.
    Alternatively you can review each email and move them into a folder when you know they're the spam/malware you want (to never see again). Either way is fine, but make sure you end up with all the unwanted emails in a single folder and nothing else in it.
  7. Enable IMAP on the Exchange server. If IMAP is abhorrent to you then you can disable it after we're done. I leave it on, but there's no rule for it on our firewall.
  8. Fire up Thunderbird and sync it to the account that has this crap-ridden folder. When setting up the sync folders you can turn off all folders except the one you need.
  9. Close Thunderbird once the sync is finished.
  10. Browse to %appdata%\Thunderbird\Profiles\bunch-of-alphanumeric.default\ImapMail\Exchange.server.name.
  11. Find the file that's the name of the folder you've dumped the emails in. If it's in a subfolder then you need to browse down to that. The folder path will be the same as the path in Outlook and then the data file will have the same name as the folder the junk is in, no extension.
  12. Open WinSCP, log into the EFA box (SFTP or SCP port 22) and then drag in the file to where ever you like. Default will be home (/home/accountname).

    Finally...!
  13. SSH into EFA, drop to shell. Then run;
    sudo sa-learn --spam --mbox --showdots /full/path/to/the/FILE
That's it. Watch the dots and the result at the end.

---------------------------------------------------------------

Theoretically you could get EFA to run it's mail client (forgotten which it is) to connect back to the Exchange server via IMAP and download it directly to the box. You could also cron a job that then ran the sa-learn on a regular basis with said downloaded file. If automating you just have to make sure the folder is always devoid of any false positives.

At the moment I've got anything with a header of "X-Spam-Status: Yes" or SCL of 8 or more being BCC'ed to a "Spam Analysis" mailbox which I have added as a secondary. I then review the junk that's in there, delete anything that's legit and then when I'm happy with it I go to step 8 and proceed from there. Once done I move the emails out to a processed folder (keeping them just in case we need to rebuild the database at any time) which means the next time I run a sa-learn it doesn't have to troll over thousands of emails it's already processed. The Thunderbird file can get a bit bloated over time, so I sometimes just delete that as well and resync.

Hope that helps!
User avatar
pdwalker
Posts: 1553
Joined: 18 Mar 2015 09:16

Re: Train Bayes with an email archive

Post by pdwalker »

useful tip for spam training :clap:
benscha
Posts: 19
Joined: 23 Jan 2018 07:19

Re: Train Bayes with an email archive

Post by benscha »

hi Guys i'm new with the efa-project. before i used copfilter on a old ipcop system.

from there i used the DMZS-sa-learn.pl script from https://www.dmzs.com/tools/files/spam/DMZS-sa-learn.pl to train mails directly from a Mailbox on our Exchangeserver over Imap.

i just had to change some path and the script is also working on efa.

Code: Select all

#!/usr/bin/perl
#
# Process mail from imap server shared folder 'spam' & 'not-spam' through spamassassin sa-learn
# dmz@dmzs.com - March 19, 2004
# http://www.dmzs.com/tools/files/spam.phtml
# LGPL
#
# Things to try if it doesn't work
# 1) Turn debug onto 1 and see if you connect to imap server ad get messages (yes i could have made a command line flag, just didn't see the need once I got it working :)
# 2) Check your local.cf for spamassassin (in debian it's /etc/spamassassin/local.cf) bayes_path settings. 
#
# Also be sure to check that your spamassassin is truely using the bayes files (-D manual startup of spamd to debug there)
#
# 3) remove any special characters from the imap password, this does not work

use lib '/usr/share/perl5/vendor_perl/';
use Mail::IMAPClient;

my $debug=0;
#turn debug on if you need more information
my $debug=0;
my $salearn;

# START Edit these settings:
# Warning: settings will be overwritten from imap_run_now.sh script!
#my $directory = '';
#my $dirAmp = 'off';

# create a Folder .INBOX in your Exchange Mailbox and the subfolders "spam" and "ham"
my $spamfolder = '.INBOX/spam';
my $nospamfolder = '.INBOX/not-spam';
my $spam = '';
my $nospam = '';

my $imap = Mail::IMAPClient->new( 

# Server => 'xxx.xxx.xxx.xxx:143',
Server => 'IPAddress:Port',
#User => 'DOMAIN/spam/spam', The User is named "spam" and the mailbox name is also "spam"
User => 'DOMAIN/spam/spam',
Password => '**********',
Debug => '0',
Ssl => '0',

Timeout => '5',
Buffer => '65536',

);


# END


#if($dirAmp eq 'off' || $dirAmp eq '') {
#	$spam = $directory.'/'.$spamfolder;
#	$nospam = $directory.'/'.$nospamfolder;
#} else {
#	$spam = '&'.$directory.'/'.$spamfolder;
#	$nospam = '&'.$directory.'/'.$nospamfolder;
#}


$spam = $spamfolder;
$nospam = $nospamfolder;

#For the use of the "&" in a IMAP folder the "&" needs to be escaped with a "\". Since the "\" is a special char itself, it needs to be escaped too.
#Therefore a folder with a UTF-7 encoded special char has to be entered as "\\\&ANY-ffentliche Ordner" in the Web-GUI to make sure sed does not replac the "&"
# with the found expression (reference in replacement "feature" of sed)
#This leads to a leading "\" which needs to be stripped - otherwise the IMAP folder is not addressed correctly

if(index($spam, "\\") == 0 && index($spam, "&") == 1) {
	$spam =  substr($spam,1);
}
if(index($nospam, "\\") == 0 && index($nospam, "&") == 1) {
	$nospam = substr($nospam, 1);
}  

if (!defined($imap)) { die "IMAP Login Failed"; }

# If debugging, print out the total counts for each mailbox
if ($debug) {
  print $spam, " spamfolder\n";
  print $nospam, " nospamfolder\n";	
  my $spamcount = $imap->message_count($spam);
  print $spamcount, " Spam to process\n";

  my $nonspamcount = $imap->message_count($nospam);
  print $nonspamcount, " Notspam to process\n" if $debug;
}

# Process the spam mailbox
$imap->select($spam);
my @msgs = $imap->search("ALL");
for (my $i=0;$i <= $#msgs; $i++)
{
  # I put it into a file for processing, doing it into a perl var & piping through sa-learn just didn't seem to work
  $imap->message_to_file("/tmp/salearn",$msgs[$i]);

  # execute sa-learn w/data
  if ($debug) { $salearn = `/usr/bin/sa-learn -D --spam -u spam --no-sync --progress /tmp/salearn`; } 
  else { $salearn = `/usr/bin/sa-learn --spam -u spam --no-sync --progress /tmp/salearn`; }
  print "-------\nSpam: ",$salearn,"\n-------\n" if $debug;

  # delete processed message
  $imap->delete_message($msgs[$i]);
  unlink("/tmp/salearn");
}
$imap->expunge();
$imap->close();

# Process the not-spam mailbox
$imap->select($nospam);
my @msgs = $imap->search("ALL");
for (my $i=0;$i <= $#msgs; $i++)
{
  $imap->message_to_file("/tmp/salearn",$msgs[$i]);
  # execute sa-learn w/data
  if ($debug) { $salearn = `/usr/bin/sa-learn -D --ham -u spam --no-sync --progress /tmp/salearn`; }
  else { $salearn = `/usr/bin/sa-learn --ham -u spam --no-sync --progress /tmp/salearn`; }
  print "-------\nNotSpam: ",$salearn,"\n-------\n" if $debug;

  # delete processed message
  $imap->delete_message($msgs[$i]);
  unlink("/tmp/salearn");
}
$imap->expunge();
$imap->close();

$imap->logout();

# integrate learned stuff
my $sarebuild = `/usr/bin/sa-learn --sync -u spam `;
print "-------\nRebuild: ",$sarebuild,"\n-------\n" if $debug;


let the script run by a cronjob or do a manual start
always happy for any hints and tipps! :clap: | EFA 3.0.2.6
Post Reply