spamassassin autolearn always no

General eFa discussion
Post Reply
Woger
Posts: 67
Joined: 15 Mar 2017 10:54

spamassassin autolearn always no

Post by Woger »

My bayes database is scoring but every spam email I receive has got "spamassassin autolearn: no". I do quite some manual learning but it seems that autolearn is not working. I have add
bayes_auto_learn_threshold_spam 9.0
to /etc/MailScanner/spamassassin.conf and restarted mailscanner, but still no autolearn.

Does anybody know how to enable this?

Thanks,
Roger
User avatar
shawniverson
Posts: 3644
Joined: 13 Jan 2014 23:30
Location: Indianapolis, Indiana USA
Contact:

Re: spamassassin autolearn always no

Post by shawniverson »

This?

Code: Select all

bayes_auto_learn 1
Woger
Posts: 67
Joined: 15 Mar 2017 10:54

Re: spamassassin autolearn always no

Post by Woger »

I have that too, and I just saw an autlearn yes (no spam) , so it seems to be working. But I can't find a single spam message where the autolearn also said yes :(
User avatar
shawniverson
Posts: 3644
Joined: 13 Jan 2014 23:30
Location: Indianapolis, Indiana USA
Contact:

Re: spamassassin autolearn always no

Post by shawniverson »

I wonder if the new bayes behavior is the culprit?

https://spamassassin.apache.org/full/3. ... shold.html

(look at the end)
User avatar
pdwalker
Posts: 1553
Joined: 18 Mar 2015 09:16

Re: spamassassin autolearn always no

Post by pdwalker »

How so?

Assuming that bayes_auto_learn is set to 1 /etc/MailScanner/spamassassin.conf then the system should be auto learning according to the following settings

bayes_auto_learn_threshold_nonspam n.nn (default: 0.1)

bayes_auto_learn_threshold_spam n.nn (default: 12.0, minimum 6, 3 from header, 3 from body score)

bayes_auto_learn_on_error (0 | 1) (default: 0)

This last entry says
"With bayes_auto_learn_on_error off, autolearning will be performed even if bayes classifier already agrees with the new classification (i.e. yielded BAYES_00 for what we are now trying to teach it as ham, or yielded BAYES_99 for spam). This is a traditional setting, the default was chosen to retain backwards compatibility."
which means the default value should always "learn" detected spam/ham according to the thresholds above.

[edit]

My bayes_auto_learn_threshold_spam is set to 9.0, and my spam that is 9.0 or greater shows "SpamAssassin Autolearn: N".

Maybe the bayes_auto_learn_on_error default is actually 1? I'm going to set this value to 0 and see if it makes a difference.
User avatar
pdwalker
Posts: 1553
Joined: 18 Mar 2015 09:16

Re: spamassassin autolearn always no

Post by pdwalker »

So I set it to 0, and it didn't make a difference - the header still showed "spam assassin autolearn = N"

I did a manual learn, and it reported "learned 1 message" which definitely means it didn't autolearn.

Now that I think about it, I wonder if there is a way to report whether a particular message has been "learned" or not by spamassassin?

*scratches chin* Looks like I have a new distraction for today.
User avatar
pdwalker
Posts: 1553
Joined: 18 Mar 2015 09:16

Re: spamassassin autolearn always no

Post by pdwalker »

It might be a bug in Mailscanner. Checking...

[edit] no, not a bug. still checking.
User avatar
pdwalker
Posts: 1553
Joined: 18 Mar 2015 09:16

Re: spamassassin autolearn always no

Post by pdwalker »

I ran this query on the mailscanner maillog table:

Code: Select all

SELECT timestamp, sascore, spamreport FROM mailscanner.maillog where (spamreport like '%autolearn=%' and not spamreport like '(blacklisted)') order by timestamp desc; 
and I get the following interesting results (only a few examples shown):

Code: Select all

'2017-09-01 00:18:04','-4.48','not spam, SpamAssassin (not cached, score=-4.478, required 4, autolearn=not spam, BAYES_00 -1.90, DKIM_SIGNED 0.10, DKIM_VALID -0.10, DKIM_VALID_AU -0.10, HTML_MESSAGE 0.00, MIME_QP_LONG_LINE 0.00, ML_SPAM_HEADER_NO -0.01, ML_SPF_PASS -0.68, MXPF_TEST 0.00, RCVD_IN_DNSWL_MED -2.30, RCVD_IN_SORBS_SPAM 0.50, SPF_HELO_PASS -0.00, T_SPF_PERMERROR 0.01)'
'2017-09-01 00:17:25','13.00','spam, SpamAssassin (not cached, score=13.003, required 4, autolearn=spam, BAYES_50 0.80, DATE_IN_PAST_12_24 1.05, DCC_CHECK 1.10, DEAR_SOMETHING 1.97, DIGEST_MULTIPLE 0.29, FREEMAIL_FORGED_FROMDOMAIN 0.20, FREEMAIL_FROM 0.00, FREEMAIL_REPLYTO_END_DIGIT 0.25, HEADER_FROM_DIFFERENT_DOMAINS 0.00, HPF_PASS -0.10, HTML_MESSAGE 0.00, KAM_LAZY_DOMAIN_SECURITY 1.00, MISSING_MID 0.50, ML_SPAMINFO_EXISTS 3.00, MXPF_TEST 0.00, PYZOR_CHECK 1.39, RCVD_IN_BL_SPAMCOP_NET 1.35, RCVD_IN_DNSWL_MED -2.30, SPF_HELO_PASS -0.00, URIBL_DBL_SPAM 2.50)'
'2017-09-01 00:18:04','-4.48','not spam, SpamAssassin (not cached, score=-4.478, required 4, autolearn=not spam, BAYES_00 -1.90, DKIM_SIGNED 0.10, DKIM_VALID -0.10, DKIM_VALID_AU -0.10, HTML_MESSAGE 0.00, MIME_QP_LONG_LINE 0.00, ML_SPAM_HEADER_NO -0.01, ML_SPF_PASS -0.68, MXPF_TEST 0.00, RCVD_IN_DNSWL_MED -2.30, RCVD_IN_SORBS_SPAM 0.50, SPF_HELO_PASS -0.00, T_SPF_PERMERROR 0.01)'
'2017-09-01 00:17:25','spam, SpamAssassin (not cached, score=13.003, required 4, autolearn=spam, BAYES_50 0.80, DATE_IN_PAST_12_24 1.05, DCC_CHECK 1.10, DEAR_SOMETHING 1.97, DIGEST_MULTIPLE 0.29, FREEMAIL_FORGED_FROMDOMAIN 0.20, FREEMAIL_FROM 0.00, FREEMAIL_REPLYTO_END_DIGIT 0.25, HEADER_FROM_DIFFERENT_DOMAINS 0.00, HPF_PASS -0.10, HTML_MESSAGE 0.00, KAM_LAZY_DOMAIN_SECURITY 1.00, MISSING_MID 0.50, ML_SPAMINFO_EXISTS 3.00, MXPF_TEST 0.00, PYZOR_CHECK 1.39, RCVD_IN_BL_SPAMCOP_NET 1.35, RCVD_IN_DNSWL_MED -2.30, SPF_HELO_PASS -0.00, URIBL_DBL_SPAM 2.50)'
The key thing to see in the spamassassin report is the "autolearn="

When I look at all my records, not every entry has an autolearn, including some of the ones that I, based on my autolearn threshold settings should have been learned.

Filtered for the ones with an autolearn in the spamreport, I can see that it does autolearn some for both spam and not spam, so autolearn is definitely working in some circumstances. My log entries above show this.

The only thing I can think of is from the spamassassin documentation
Note: SpamAssassin requires at least 3 points from the header, and 3 points from the body to auto-learn as spam. Therefore, the minimum working value for this option is 6.
So I'm guessing that the messages that are not getting autolearned are because the header or body is not scoring at least three on the spam autolearn test.

Let's test a message that was marked as spam, but not autolearned
spam, SpamAssassin (not cached, score=12.714, required 4, BAYES_99 4.00, BAYES_999 2.00, HTML_FONT_FACE_BAD 0.98, HTML_FONT_LOW_CONTRAST 0.00, HTML_MESSAGE 0.00, KAM_BADIPHTTP 2.00, KAM_LAZY_DOMAIN_SECURITY 1.00, ML_SPAM_HEADER_NO -0.01, MXPF_TEST 0.00, RAZOR2_CF_RANGE_51_100 0.50, RAZOR2_CF_RANGE_E8_51_100 1.89, RAZOR2_CHECK 0.92, RCVD_IN_DNSWL_MED -2.30, SPF_HELO_PASS -0.00, T_KAM_HTML_FONT_INVALID 0.01, URIBL_SBL 1.62, URIBL_SBL_A 0.10)
A spamassassin -D -t on that message reveals:

Code: Select all

2274 Sep  1 14:04:41.927 [12309] dbg: learn: auto-learn: message score: 11.714, computed score for autolearn: 5.26
2275 Sep  1 14:04:41.927 [12309] dbg: learn: auto-learn? ham=-10, spam=9, body-points=1.045, head-points=1.991, learned-points=6
2276 Sep  1 14:04:41.927 [12309] dbg: learn: auto-learn? no: inside auto-learn thresholds, not considered ham or spam
and that appears to be the case.


tl;dr: spamassassin autolearn is working correctly as designed.
User avatar
pdwalker
Posts: 1553
Joined: 18 Mar 2015 09:16

Re: spamassassin autolearn always no

Post by pdwalker »

pdwalker wrote: 01 Sep 2017 04:46Now that I think about it, I wonder if there is a way to report whether a particular message has been "learned" or not by spamassassin?
I didn't need to do this, but I felt compelled to answer my own question anyway.

If you edit /var/www/html/mailscanner/status.php and find the sql query at the beginning of the file, insert the following lines just before the "FROM" line:
,case
when spamreport like '%autolearn=spam%' then 'spam'
when spamreport like '%autolearn=not spam%' then 'not spam'
else '-'
so

Code: Select all

 mcpsascore,
 '' AS status
FROM
 maillog
becomes:

Code: Select all

 mcpsascore,
 '' AS status
,case
  when spamreport like '%autolearn=spam%' then 'spam'
  when spamreport like '%autolearn=not spam%' then 'not spam'
  else '-'
end as autolearn
FROM
 maillog
and you'll end up with an extra column on the far right that'll answer your question whether a message has been autolearned or not, and if so, as what. (not a multiple display language friendly fix, I was in a hurry)
autolearn.png
autolearn.png (85.84 KiB) Viewed 9610 times

Note to self: Make the modifications necessary to also display whether it has been "learned" by the bayes classifier or not, in the same column.

Note to self: but don't do this today no matter how much you want to.
User avatar
shawniverson
Posts: 3644
Joined: 13 Jan 2014 23:30
Location: Indianapolis, Indiana USA
Contact:

Re: spamassassin autolearn always no

Post by shawniverson »

Big thanks for digging into this!!! :dance: :clap:
User avatar
pdwalker
Posts: 1553
Joined: 18 Mar 2015 09:16

Re: spamassassin autolearn always no

Post by pdwalker »

No worries.

It's moments like the above that makes me think I have OCD. Might as well make use of it!

Actually, having a "learned" column would be useful, especially if I can combine the autolearn information with the sa_learn data.

I think I'll do that next.
Post Reply