eFa v5 bayes behaviour

Bugs in eFa 5
Post Reply
gregecslo
Posts: 65
Joined: 09 Sep 2018 17:55

eFa v5 bayes behaviour

Post by gregecslo »

Hi.

Created new efa appliance on Rocky.

Configured it, all is good, except for bayes. It IS working, it IS learning but when it classifies mail it is really not so decisive as it was in V3 and V4 of Efa.

I have:

Code: Select all

dbg: bayes: corpus size: nspam = 1190, nham = 12441
dbg: bayes: DB expiry: tokens in DB: 979401, Expiry max size: 1500000, Oldest atime: 1725361640, Newest atime: 1725888528, Last expire: 0, Current time: 1725888537
So I have enough spam/ham and really enough tokens...
What I find weird is this:

BAYES_50 and BAYES_40 have like 10.000 hits EACH.
BAYES_80 only 600
BAYES_95 even less: 341
BAYES_99: 284
BAYES_20 only 150
BAYES_60 only 87

I have no BAYES lower than 40 at all. I am training and also use autolearn.

I have also transferred corpus trained on efa v4 where it worked correctly.

Is Spamassassin v4 really so much more conservative or am I doing something wrong here?

Thanks!
gregecslo
Posts: 65
Joined: 09 Sep 2018 17:55

Re: eFa v5 bayes behaviour

Post by gregecslo »

One more thing...
Some mails even dont have BAYES added in score list, confirmed on 2 installs (one new and other migrate from v4)

Code: Select all

Score	Matching Rule	Description
1.95	DATE_IN_FUTURE_06_12	Date: is 6 to 12 hours after Received: date
1.10	DCC_CHECK	Detected as bulk mail by DCC (dcc-servers.net)
0.10	DKIM_SIGNED	Message has a DKIM or DK signature, not necessarily valid
-0.50	DKIM_VALID	Message has at least one valid DKIM or DK signature
-1.00	DKIM_VALID_AU	Message has a valid DKIM or DK signature from author's domain
-0.10	DKIM_VALID_EF	Message has a valid DKIM or DK signature from envelope-from domain
-0.00	DMARC_PASS	DMARC pass policy
0.25	FREEMAIL_ENVFROM_END_DIGIT	Envelope-from freemail username ends in digit
0.30	FREEMAIL_FROM	Sender email is commonly abused enduser mail provider
0.00	HTML_MESSAGE	HTML included in message
-0.00	RCVD_IN_DNSWL_NONE	Sender listed at https://www.dnswl.org/, no trust
-0.00	SPF_HELO_PASS	SPF: HELO matches SPF record
2.50	URIBL_DBL_PHISH	Contains a Phishing URL listed in the Spamhaus DBL blocklist
I don`t know why this happens...
User avatar
shawniverson
Posts: 3760
Joined: 13 Jan 2014 23:30
Location: Indianapolis, Indiana USA
Contact:

Re: eFa v5 bayes behaviour

Post by shawniverson »

I see you posted to the Spamassassin list. I don't have any specific explanation other than the SA developers have made some major changes to the bayes scoring. I would try training the bayes with a balanced corpus of known ham and spam, turning off autolearn (it isn't very smart, any badly scored emails will automatically poison your bayes), and changing the scores of the bayes probability levels to handle the spammy emails appropriately.
gregecslo
Posts: 65
Joined: 09 Sep 2018 17:55

Re: eFa v5 bayes behaviour

Post by gregecslo »

Yeah, guilty :)

It was really bothering me why same exported training data when imported to SA V4 behaved completely different.

I got my answer, maybe it will help others.

And yeah, autolearn I will turn it off for sure, it will do more harm than good anyways...
gregecslo
Posts: 65
Joined: 09 Sep 2018 17:55

Re: eFa v5 bayes behaviour

Post by gregecslo »

I posted this on mailing list as well...

I think we need to open bug for this?


Hi again.


In V4 there is something wrong with bayes...


I received 3 identical mails (1 external sender, 3 internal recipients) and scores are like this:


2 X like:

Code: Select all

0.00    ARC_SIGNED      Message has a ARC signature
-0.10   ARC_VALID       Message has a valid ARC signature
-0.40   DCC_REPUT_00_12 DCC reputation between 0 and 12 % (mostly ham)
0.10    DKIM_INVALID    DKIM or DK signature exists, but is not valid
0.10    DKIM_SIGNED     Message has a DKIM or DK signature, not necessarily valid
-0.00   DMARC_PASS      DMARC pass policy
0.25    GMD_PDF_HORIZ   Contains pdf 100-240 (high) x 450-800 (wide)
0.50    GMD_PDF_SQUARE  Contains pdf 180-360 (high) x 180-360 (wide)
0.00    HTML_MESSAGE    HTML included in message
1.02    MISSING_HEADERS Missing To: header
1.50    PHISH_LNK_URI   Typical phishing tactic - pre filled mail in link
-0.00   RCVD_IN_DNSWL_NONE      Sender listed at https://www.dnswl.org/, no trust
0.00    RCVD_IN_VALIDITY_CERTIFIED_BLOCKED      ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information.
0.00    RCVD_IN_VALIDITY_RPBL_BLOCKED   ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information.
0.00    RCVD_IN_VALIDITY_SAFE_BLOCKED   ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information.
-0.00   SPF_HELO_PASS   SPF: HELO matches SPF record


AND 1X like:

Code: Select all

0.00    ARC_SIGNED      Message has a ARC signature
-0.10   ARC_VALID       Message has a valid ARC signature
1.50    BAYES_60        Bayes spam probability is 60 to 80%
-0.40   DCC_REPUT_00_12 DCC reputation between 0 and 12 % (mostly ham)
0.10    DKIM_INVALID    DKIM or DK signature exists, but is not valid
0.10    DKIM_SIGNED     Message has a DKIM or DK signature, not necessarily valid
-0.00   DMARC_PASS      DMARC pass policy
0.25    GMD_PDF_HORIZ   Contains pdf 100-240 (high) x 450-800 (wide)
0.50    GMD_PDF_SQUARE  Contains pdf 180-360 (high) x 180-360 (wide)
0.00    HTML_MESSAGE    HTML included in message
1.02    MISSING_HEADERS Missing To: header
1.50    PHISH_LNK_URI   Typical phishing tactic - pre filled mail in link
-0.00   RCVD_IN_DNSWL_NONE      Sender listed at https://www.dnswl.org/, no trust
0.00    RCVD_IN_VALIDITY_CERTIFIED_BLOCKED      ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information.
0.00    RCVD_IN_VALIDITY_RPBL_BLOCKED   ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information.
0.00    RCVD_IN_VALIDITY_SAFE_BLOCKED   ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information.
-0.00   SPF_HELO_PASS   SPF: HELO matches SPF record


Why one has "BAYES_60" and other 2 not?


My thoughts so far:

1. This is not shortcircuit as only bayes is different.
2. Mails are identical and mailserver load is... well non-existant (1 minute load 0.08)
3. Maybe some new logic in bayes to skip some?
4. Race condition (IDK I`m not coder)
5. Bayes behaves non consistent on BOTH installs I have it on
gregecslo
Posts: 65
Joined: 09 Sep 2018 17:55

Re: eFa v5 bayes behaviour

Post by gregecslo »

Or maybe because bayes storage is SQL based?
gregecslo
Posts: 65
Joined: 09 Sep 2018 17:55

Re: eFa v5 bayes behaviour

Post by gregecslo »

Also this:

Rule Description Score Total Ham Col6 Spam Col8
BAYES_40 Bayes spam probability is 20 to 40% 0.00 2,784 2,721 97.7 63 2.3
BAYES_50 Bayes spam probability is 40 to 60% 0.80 126 93 73.8 33 26.2
BAYES_60 Bayes spam probability is 60 to 80% 1.50 437 127 29.1 310 70.9
BAYES_80 Bayes spam probability is 80 to 95% 7.00 266 1 0.4 265 99.6

I only have BAYES_40 to BAYES_80 after clearing bayes DB and manually RE-learning on 2500 HAM and 2500 SPAM messages.
So NO BAYES lower than 40 or higher than 80...

There is 100% something wrong here, bayes in not decision maker at all, for me it is useless. This indecisiveness along with fact that some mails arent even BAYES scored makes me think there is a bug or I implemented it wrong?
gregecslo
Posts: 65
Joined: 09 Sep 2018 17:55

Re: eFa v5 bayes behaviour

Post by gregecslo »

I had BAYES_20 scored as "0" lol so my bad here. :)

But still there is no bayes_95 or bayes_05 so lower and upper are missing...
It really is different in 4.0.1 :)
Post Reply