Page 1 of 1
eFa v5 bayes behaviour
Posted: 09 Sep 2024 13:35
by gregecslo
Hi.
Created new efa appliance on Rocky.
Configured it, all is good, except for bayes. It IS working, it IS learning but when it classifies mail it is really not so decisive as it was in V3 and V4 of Efa.
I have:
Code: Select all
dbg: bayes: corpus size: nspam = 1190, nham = 12441
dbg: bayes: DB expiry: tokens in DB: 979401, Expiry max size: 1500000, Oldest atime: 1725361640, Newest atime: 1725888528, Last expire: 0, Current time: 1725888537
So I have enough spam/ham and really enough tokens...
What I find weird is this:
BAYES_50 and BAYES_40 have like 10.000 hits EACH.
BAYES_80 only 600
BAYES_95 even less: 341
BAYES_99: 284
BAYES_20 only 150
BAYES_60 only 87
I have no BAYES lower than 40 at all. I am training and also use autolearn.
I have also transferred corpus trained on efa v4 where it worked correctly.
Is Spamassassin v4 really so much more conservative or am I doing something wrong here?
Thanks!
Re: eFa v5 bayes behaviour
Posted: 09 Sep 2024 13:59
by gregecslo
One more thing...
Some mails even dont have BAYES added in score list, confirmed on 2 installs (one new and other migrate from v4)
Code: Select all
Score Matching Rule Description
1.95 DATE_IN_FUTURE_06_12 Date: is 6 to 12 hours after Received: date
1.10 DCC_CHECK Detected as bulk mail by DCC (dcc-servers.net)
0.10 DKIM_SIGNED Message has a DKIM or DK signature, not necessarily valid
-0.50 DKIM_VALID Message has at least one valid DKIM or DK signature
-1.00 DKIM_VALID_AU Message has a valid DKIM or DK signature from author's domain
-0.10 DKIM_VALID_EF Message has a valid DKIM or DK signature from envelope-from domain
-0.00 DMARC_PASS DMARC pass policy
0.25 FREEMAIL_ENVFROM_END_DIGIT Envelope-from freemail username ends in digit
0.30 FREEMAIL_FROM Sender email is commonly abused enduser mail provider
0.00 HTML_MESSAGE HTML included in message
-0.00 RCVD_IN_DNSWL_NONE Sender listed at https://www.dnswl.org/, no trust
-0.00 SPF_HELO_PASS SPF: HELO matches SPF record
2.50 URIBL_DBL_PHISH Contains a Phishing URL listed in the Spamhaus DBL blocklist
I don`t know why this happens...
Re: eFa v5 bayes behaviour
Posted: 13 Sep 2024 17:00
by shawniverson
I see you posted to the Spamassassin list. I don't have any specific explanation other than the SA developers have made some major changes to the bayes scoring. I would try training the bayes with a balanced corpus of known ham and spam, turning off autolearn (it isn't very smart, any badly scored emails will automatically poison your bayes), and changing the scores of the bayes probability levels to handle the spammy emails appropriately.
Re: eFa v5 bayes behaviour
Posted: 13 Sep 2024 18:14
by gregecslo
Yeah, guilty
It was really bothering me why same exported training data when imported to SA V4 behaved completely different.
I got my answer, maybe it will help others.
And yeah, autolearn I will turn it off for sure, it will do more harm than good anyways...
Re: eFa v5 bayes behaviour
Posted: 23 Sep 2024 13:24
by gregecslo
I posted this on mailing list as well...
I
think we need to open bug for this?
Hi again.
In V4 there is something wrong with bayes...
I received 3 identical mails (1 external sender, 3 internal recipients) and scores are like this:
2 X like:
Code: Select all
0.00 ARC_SIGNED Message has a ARC signature
-0.10 ARC_VALID Message has a valid ARC signature
-0.40 DCC_REPUT_00_12 DCC reputation between 0 and 12 % (mostly ham)
0.10 DKIM_INVALID DKIM or DK signature exists, but is not valid
0.10 DKIM_SIGNED Message has a DKIM or DK signature, not necessarily valid
-0.00 DMARC_PASS DMARC pass policy
0.25 GMD_PDF_HORIZ Contains pdf 100-240 (high) x 450-800 (wide)
0.50 GMD_PDF_SQUARE Contains pdf 180-360 (high) x 180-360 (wide)
0.00 HTML_MESSAGE HTML included in message
1.02 MISSING_HEADERS Missing To: header
1.50 PHISH_LNK_URI Typical phishing tactic - pre filled mail in link
-0.00 RCVD_IN_DNSWL_NONE Sender listed at https://www.dnswl.org/, no trust
0.00 RCVD_IN_VALIDITY_CERTIFIED_BLOCKED ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information.
0.00 RCVD_IN_VALIDITY_RPBL_BLOCKED ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information.
0.00 RCVD_IN_VALIDITY_SAFE_BLOCKED ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information.
-0.00 SPF_HELO_PASS SPF: HELO matches SPF record
AND 1X like:
Code: Select all
0.00 ARC_SIGNED Message has a ARC signature
-0.10 ARC_VALID Message has a valid ARC signature
1.50 BAYES_60 Bayes spam probability is 60 to 80%
-0.40 DCC_REPUT_00_12 DCC reputation between 0 and 12 % (mostly ham)
0.10 DKIM_INVALID DKIM or DK signature exists, but is not valid
0.10 DKIM_SIGNED Message has a DKIM or DK signature, not necessarily valid
-0.00 DMARC_PASS DMARC pass policy
0.25 GMD_PDF_HORIZ Contains pdf 100-240 (high) x 450-800 (wide)
0.50 GMD_PDF_SQUARE Contains pdf 180-360 (high) x 180-360 (wide)
0.00 HTML_MESSAGE HTML included in message
1.02 MISSING_HEADERS Missing To: header
1.50 PHISH_LNK_URI Typical phishing tactic - pre filled mail in link
-0.00 RCVD_IN_DNSWL_NONE Sender listed at https://www.dnswl.org/, no trust
0.00 RCVD_IN_VALIDITY_CERTIFIED_BLOCKED ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information.
0.00 RCVD_IN_VALIDITY_RPBL_BLOCKED ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information.
0.00 RCVD_IN_VALIDITY_SAFE_BLOCKED ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information.
-0.00 SPF_HELO_PASS SPF: HELO matches SPF record
Why one has "BAYES_60" and other 2 not?
My thoughts so far:
1. This is not shortcircuit as only bayes is different.
2. Mails are identical and mailserver load is... well non-existant (1 minute load 0.08)
3. Maybe some new logic in bayes to skip some?
4. Race condition (IDK I`m not coder)
5. Bayes behaves non consistent on BOTH installs I have it on
Re: eFa v5 bayes behaviour
Posted: 24 Sep 2024 07:24
by gregecslo
Or maybe because bayes storage is SQL based?
Re: eFa v5 bayes behaviour
Posted: 24 Sep 2024 08:10
by gregecslo
Also this:
Rule Description Score Total Ham Col6 Spam Col8
BAYES_40 Bayes spam probability is 20 to 40% 0.00 2,784 2,721 97.7 63 2.3
BAYES_50 Bayes spam probability is 40 to 60% 0.80 126 93 73.8 33 26.2
BAYES_60 Bayes spam probability is 60 to 80% 1.50 437 127 29.1 310 70.9
BAYES_80 Bayes spam probability is 80 to 95% 7.00 266 1 0.4 265 99.6
I only have BAYES_40 to BAYES_80 after clearing bayes DB and manually RE-learning on 2500 HAM and 2500 SPAM messages.
So NO BAYES lower than 40 or higher than 80...
There is 100% something wrong here, bayes in not decision maker at all, for me it is useless. This indecisiveness along with fact that some mails arent even BAYES scored makes me think there is a bug or I implemented it wrong?
Re: eFa v5 bayes behaviour
Posted: 27 Sep 2024 06:33
by gregecslo
I had BAYES_20 scored as "0" lol so my bad here.
But still there is no bayes_95 or bayes_05 so lower and upper are missing...
It really is different in 4.0.1