eFa v5 bayes behaviour

Bugs in eFa 5
Post Reply
gregecslo
Posts: 71
Joined: 09 Sep 2018 17:55

eFa v5 bayes behaviour

Post by gregecslo »

Hi.

Created new efa appliance on Rocky.

Configured it, all is good, except for bayes. It IS working, it IS learning but when it classifies mail it is really not so decisive as it was in V3 and V4 of Efa.

I have:

Code: Select all

dbg: bayes: corpus size: nspam = 1190, nham = 12441
dbg: bayes: DB expiry: tokens in DB: 979401, Expiry max size: 1500000, Oldest atime: 1725361640, Newest atime: 1725888528, Last expire: 0, Current time: 1725888537
So I have enough spam/ham and really enough tokens...
What I find weird is this:

BAYES_50 and BAYES_40 have like 10.000 hits EACH.
BAYES_80 only 600
BAYES_95 even less: 341
BAYES_99: 284
BAYES_20 only 150
BAYES_60 only 87

I have no BAYES lower than 40 at all. I am training and also use autolearn.

I have also transferred corpus trained on efa v4 where it worked correctly.

Is Spamassassin v4 really so much more conservative or am I doing something wrong here?

Thanks!
gregecslo
Posts: 71
Joined: 09 Sep 2018 17:55

Re: eFa v5 bayes behaviour

Post by gregecslo »

One more thing...
Some mails even dont have BAYES added in score list, confirmed on 2 installs (one new and other migrate from v4)

Code: Select all

Score	Matching Rule	Description
1.95	DATE_IN_FUTURE_06_12	Date: is 6 to 12 hours after Received: date
1.10	DCC_CHECK	Detected as bulk mail by DCC (dcc-servers.net)
0.10	DKIM_SIGNED	Message has a DKIM or DK signature, not necessarily valid
-0.50	DKIM_VALID	Message has at least one valid DKIM or DK signature
-1.00	DKIM_VALID_AU	Message has a valid DKIM or DK signature from author's domain
-0.10	DKIM_VALID_EF	Message has a valid DKIM or DK signature from envelope-from domain
-0.00	DMARC_PASS	DMARC pass policy
0.25	FREEMAIL_ENVFROM_END_DIGIT	Envelope-from freemail username ends in digit
0.30	FREEMAIL_FROM	Sender email is commonly abused enduser mail provider
0.00	HTML_MESSAGE	HTML included in message
-0.00	RCVD_IN_DNSWL_NONE	Sender listed at https://www.dnswl.org/, no trust
-0.00	SPF_HELO_PASS	SPF: HELO matches SPF record
2.50	URIBL_DBL_PHISH	Contains a Phishing URL listed in the Spamhaus DBL blocklist
I don`t know why this happens...
User avatar
shawniverson
Posts: 3776
Joined: 13 Jan 2014 23:30
Location: Indianapolis, Indiana USA
Contact:

Re: eFa v5 bayes behaviour

Post by shawniverson »

I see you posted to the Spamassassin list. I don't have any specific explanation other than the SA developers have made some major changes to the bayes scoring. I would try training the bayes with a balanced corpus of known ham and spam, turning off autolearn (it isn't very smart, any badly scored emails will automatically poison your bayes), and changing the scores of the bayes probability levels to handle the spammy emails appropriately.
gregecslo
Posts: 71
Joined: 09 Sep 2018 17:55

Re: eFa v5 bayes behaviour

Post by gregecslo »

Yeah, guilty :)

It was really bothering me why same exported training data when imported to SA V4 behaved completely different.

I got my answer, maybe it will help others.

And yeah, autolearn I will turn it off for sure, it will do more harm than good anyways...
gregecslo
Posts: 71
Joined: 09 Sep 2018 17:55

Re: eFa v5 bayes behaviour

Post by gregecslo »

I posted this on mailing list as well...

I think we need to open bug for this?


Hi again.


In V4 there is something wrong with bayes...


I received 3 identical mails (1 external sender, 3 internal recipients) and scores are like this:


2 X like:

Code: Select all

0.00    ARC_SIGNED      Message has a ARC signature
-0.10   ARC_VALID       Message has a valid ARC signature
-0.40   DCC_REPUT_00_12 DCC reputation between 0 and 12 % (mostly ham)
0.10    DKIM_INVALID    DKIM or DK signature exists, but is not valid
0.10    DKIM_SIGNED     Message has a DKIM or DK signature, not necessarily valid
-0.00   DMARC_PASS      DMARC pass policy
0.25    GMD_PDF_HORIZ   Contains pdf 100-240 (high) x 450-800 (wide)
0.50    GMD_PDF_SQUARE  Contains pdf 180-360 (high) x 180-360 (wide)
0.00    HTML_MESSAGE    HTML included in message
1.02    MISSING_HEADERS Missing To: header
1.50    PHISH_LNK_URI   Typical phishing tactic - pre filled mail in link
-0.00   RCVD_IN_DNSWL_NONE      Sender listed at https://www.dnswl.org/, no trust
0.00    RCVD_IN_VALIDITY_CERTIFIED_BLOCKED      ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information.
0.00    RCVD_IN_VALIDITY_RPBL_BLOCKED   ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information.
0.00    RCVD_IN_VALIDITY_SAFE_BLOCKED   ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information.
-0.00   SPF_HELO_PASS   SPF: HELO matches SPF record


AND 1X like:

Code: Select all

0.00    ARC_SIGNED      Message has a ARC signature
-0.10   ARC_VALID       Message has a valid ARC signature
1.50    BAYES_60        Bayes spam probability is 60 to 80%
-0.40   DCC_REPUT_00_12 DCC reputation between 0 and 12 % (mostly ham)
0.10    DKIM_INVALID    DKIM or DK signature exists, but is not valid
0.10    DKIM_SIGNED     Message has a DKIM or DK signature, not necessarily valid
-0.00   DMARC_PASS      DMARC pass policy
0.25    GMD_PDF_HORIZ   Contains pdf 100-240 (high) x 450-800 (wide)
0.50    GMD_PDF_SQUARE  Contains pdf 180-360 (high) x 180-360 (wide)
0.00    HTML_MESSAGE    HTML included in message
1.02    MISSING_HEADERS Missing To: header
1.50    PHISH_LNK_URI   Typical phishing tactic - pre filled mail in link
-0.00   RCVD_IN_DNSWL_NONE      Sender listed at https://www.dnswl.org/, no trust
0.00    RCVD_IN_VALIDITY_CERTIFIED_BLOCKED      ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information.
0.00    RCVD_IN_VALIDITY_RPBL_BLOCKED   ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information.
0.00    RCVD_IN_VALIDITY_SAFE_BLOCKED   ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information.
-0.00   SPF_HELO_PASS   SPF: HELO matches SPF record


Why one has "BAYES_60" and other 2 not?


My thoughts so far:

1. This is not shortcircuit as only bayes is different.
2. Mails are identical and mailserver load is... well non-existant (1 minute load 0.08)
3. Maybe some new logic in bayes to skip some?
4. Race condition (IDK I`m not coder)
5. Bayes behaves non consistent on BOTH installs I have it on
gregecslo
Posts: 71
Joined: 09 Sep 2018 17:55

Re: eFa v5 bayes behaviour

Post by gregecslo »

Or maybe because bayes storage is SQL based?
gregecslo
Posts: 71
Joined: 09 Sep 2018 17:55

Re: eFa v5 bayes behaviour

Post by gregecslo »

Also this:

Rule Description Score Total Ham Col6 Spam Col8
BAYES_40 Bayes spam probability is 20 to 40% 0.00 2,784 2,721 97.7 63 2.3
BAYES_50 Bayes spam probability is 40 to 60% 0.80 126 93 73.8 33 26.2
BAYES_60 Bayes spam probability is 60 to 80% 1.50 437 127 29.1 310 70.9
BAYES_80 Bayes spam probability is 80 to 95% 7.00 266 1 0.4 265 99.6

I only have BAYES_40 to BAYES_80 after clearing bayes DB and manually RE-learning on 2500 HAM and 2500 SPAM messages.
So NO BAYES lower than 40 or higher than 80...

There is 100% something wrong here, bayes in not decision maker at all, for me it is useless. This indecisiveness along with fact that some mails arent even BAYES scored makes me think there is a bug or I implemented it wrong?
gregecslo
Posts: 71
Joined: 09 Sep 2018 17:55

Re: eFa v5 bayes behaviour

Post by gregecslo »

I had BAYES_20 scored as "0" lol so my bad here. :)

But still there is no bayes_95 or bayes_05 so lower and upper are missing...
It really is different in 4.0.1 :)
gregecslo
Posts: 71
Joined: 09 Sep 2018 17:55

Re: eFa v5 bayes behaviour

Post by gregecslo »

Hi again.

I really would like to understand this behavior why identical mail (we received 2) is treated different in regards to BAYES filtering, see below:

BAYES is MISSING

Code: Select all

To: name.surname@domain.com
Subject: - Obnovite svoj kadrovski sertifikat
From: UniCredit <support@chuon.co.jp>
MIME-Version: 1.0
Content-Type: multipart/mixed;boundary=a0dfgta989888ff6vbf6458e7ca6ab21
Message-Id: <20241105132854.5BBNHGT56D24B@wps101r.anshin-sv.jp>
Date: Tue, 5 Nov 2024 22:28:54 +0900 (JST)
X-Originating-IP: [153.123.7.151]
X-Envelope-From: support@chuon.co.jp

Score	Matching Rule	Description
1.10	DCC_CHECK	Detected as bulk mail by DCC (dcc-servers.net)
0.29	DIGEST_MULTIPLE	Message hits more than one network digest check
0.10	DKIM_INVALID	DKIM or DK signature exists, but is not valid
0.10	DKIM_SIGNED	Message has a DKIM or DK signature, not necessarily valid
1.20	DMARC_QUAR	DMARC quarantine policy
0.00	HTML_MESSAGE	HTML included in message
1.50	PHISH_LNK_URI	Typical phishing tactic - pre filled mail in link
1.39	PYZOR_CHECK	Listed in Pyzor (https://pyzor.readthedocs.io/en/latest/)
2.00	RCVD_IN_BL_SPAMCOP_NET	Received via a relay in bl.spamcop.net
-0.00	RCVD_IN_DNSWL_NONE	Sender listed at https://www.dnswl.org/, no trust
-0.00	SPF_HELO_PASS	SPF: HELO matches SPF record
0.01	T_TVD_MIME_NO_HEADERS	 
100.00	MARKED BY EXTERNAL SPAM FILTER
BAYES is PRESENT

Code: Select all

To: name2.surname2@domain.com
Subject: - Obnovite svoj kadrovski sertifikat
From: UniCredit <support@chuon.co.jp>
MIME-Version: 1.0
Content-Type: multipart/mixed;boundary=536001743556gh42718394b0vbg53c25
Message-Id: <20241105130443.MNJ49E02NDG56@wps101r.anshin-sv.jp>
Date: Tue, 5 Nov 2024 22:04:43 +0900 (JST)
X-Originating-IP: [153.123.7.151]
X-Envelope-From: support@chuon.co.jp

Score	Matching Rule	Description
4.00	BAYES_60	Bayes spam probability is 60 to 80%
-0.10	DCC_REPUT_13_19	DCC reputation between 13 and 19 %
0.10	DKIM_INVALID	DKIM or DK signature exists, but is not valid
0.10	DKIM_SIGNED	Message has a DKIM or DK signature, not necessarily valid
1.20	DMARC_QUAR	DMARC quarantine policy
0.00	HTML_MESSAGE	HTML included in message
1.50	PHISH_LNK_URI	Typical phishing tactic - pre filled mail in link
2.00	RCVD_IN_BL_SPAMCOP_NET	Received via a relay in bl.spamcop.net
-0.00	RCVD_IN_DNSWL_NONE	Sender listed at https://www.dnswl.org/, no trust
-0.00	SPF_HELO_PASS	SPF: HELO matches SPF record
0.01	T_TVD_MIME_NO_HEADERS	 
100.00	MARKED BY EXTERNAL SPAM FILTER
This is super weird and also I trained my filter really carefully but still, I only have bayes like this:

Code: Select all

BAYES_20	Bayes spam probability is 5 to 20%	-0.50	5,139
BAYES_40	Bayes spam probability is 20 to 40%	0.50	5,399
BAYES_50	Bayes spam probability is 40 to 60%	1.00	202
BAYES_60	Bayes spam probability is 60 to 80%	4.00	747
BAYES_80	Bayes spam probability is 80 to 95%	6.00	300
So I have NO BAYES_05 or BAYES_99(9) extreme values.

I really do believe that something is wrong here, I have 2 EFAs, one migrated from V4 and other set up from scratch and they both display same behavior.

If I manually debug bayes on first mail which is without bayes, on manual test it get BAYES_60....

On V4 bayes was FANTASTIC.

Thanks for any input or feedback!
gregecslo
Posts: 71
Joined: 09 Sep 2018 17:55

Re: eFa v5 bayes behaviour

Post by gregecslo »

When doing spamassassin -D test I can see BAYES_XX is added there on SAME mail that mailscanner did NOT add BAYES_XX score.

Code: Select all

Nov  6 14:21:26.964 [1036654] dbg: bayes: tokenized body: 2737 tokens
Nov  6 14:21:26.965 [1036654] dbg: bayes: tokenized uri: 120 tokens
Nov  6 14:21:26.965 [1036654] dbg: bayes: tokenized invisible: 6 tokens
Nov  6 14:21:26.966 [1036654] dbg: bayes: tokenized header: 82 tokens
Nov  6 14:21:26.970 [1036654] dbg: bayes: tok_get_all: token count: 586
Nov  6 14:21:26.981 [1036654] dbg: bayes: score = 0.101741519786307
and

Code: Select all

Nov  6 14:21:26.983 [1036654] dbg: check: tagrun - tag BAYESTCHAMMY is now ready, value: 2
Nov  6 14:21:26.983 [1036654] dbg: check: tagrun - tag BAYESTCSPAMMY is now ready, value: 0
Nov  6 14:21:26.983 [1036654] dbg: check: tagrun - tag BAYESTCLEARNED is now ready, value: 2
Nov  6 14:21:26.983 [1036654] dbg: check: tagrun - tag BAYESTC is now ready, value: 586
Nov  6 14:21:26.983 [1036654] dbg: check: tagrun - tag HAMMYTOKENS is now ready, value: CODE(0x560baecc6878)
Nov  6 14:21:26.983 [1036654] dbg: check: tagrun - tag SPAMMYTOKENS is now ready, value: CODE(0x560baecc5750)
Nov  6 14:21:26.983 [1036654] dbg: check: tagrun - tag TOKENSUMMARY is now ready, value: CODE(0x560bac43f1e0)
Nov  6 14:21:26.984 [1036654] dbg: rules: ran eval rule BAYES_20 ======> got hit (1)

Code: Select all

Nov  6 14:21:27.943 [1036654] dbg: check: tests=ALL_TRUSTED,BAYES_20,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HTML_MESSAGE,T_KAM_HTML_FONT_INVALID
Nov  6 14:21:27.943 [1036654] dbg: check: subtests=__ANY_IMAGE_ATTACH,__ANY_TEXT_ATTACH,__ANY_TEXT_ATTACH_DOC,__BODY_TEXT_LINE(3),__COMMENT_EXISTS,__CT,__CTYPE_HAS_BOUNDARY,__CTYPE_MULTIPART_ALT,__CTYPE_MULTIPART_ANY,__DKIM_DEPENDABLE,__DKIM_EXISTS,__DOS_BODY_MON,__DOS_BODY_TUE,__DOS_BODY_WED,__DOS_HAS_ANY_URI,__DOS_LINK,__DOS_RCVD_WED,__DOS_REF_TODAY,__E_LIKE_LETTER(320),__FILL_THIS_FORM_FRAUD_PHISH1,__FROM_FULL_NAME,__FROM_WORDY,__GB_FAKE_RF,__GB_TO_ADDR,__HAS_ANY_EMAIL,__HAS_ANY_URI,__HAS_DATE,__HAS_DKIM_SIGHD,__HAS_FROM,__HAS_HREF(23),__HAS_HREF_ONECASE(23),__HAS_IN_REPLY_TO,__HAS_MESSAGE_ID,__HAS_MSGID,__HAS_RCVD,__HAS_SUBJECT,__HAS_THREAD_INDEX,__HAS_TNEF,__HAS_TO,__HAS_URI,__HAS_X_LOOP,__HAS_X_REF,__HIGHBITS,__HS_SUBJ_RE_FW,__HTML_LINK_IMAGE,__HTML_SINGLET(5),__HUSH_HUSH,__JM_REACTOR_DATE,__JPEG_ATTACH,__JPEG_ATTACH_2P,__KAM_HTML_FONT_INVALID,__LOCAL_PP_NONPPURL,__LOWER_E(230),__L_BODY_8BITS,__MIME_BASE64,__MIME_HTML,__MIME_VERSION,__MSGID_OK_HOST,__NONEMPTY_BODY,__NOT_SPOOFED,__PART_STOCK_CD_F,__PNG_ATTACH_2P,__RATWARE_0_TZ_DATE,__SANE_MSGID,__SINGLE_WORD_LINE(2),__SUBJ_NOT_SHORT,__SUBJ_RE,__TAG_EXISTS_BODY,__TAG_EXISTS_HEAD,__TAG_EXISTS_HTML,__TAG_EXISTS_META,__TAG_EXISTS_STYLE,__THREADED,__TOCC_EXISTS,__TO_EQ_FM_DOM_HTML_IMG,__TO_EQ_FROM_DOM,__TO_EQ_FROM_DOM_1,__TVD_MIME_ATT_TP,__TVD_OUTLOOK_IMG,__URI_MAILTO(4),__YOUR_PERSONAL (Total Subtest Hits: 683 / Deduplicated Total Hits: 81)
Mailwatch spam report in gui:

Code: Select all

Score	Matching Rule	Description
-6.00	ALL_TRUSTED	Passed through trusted hosts only via SMTP
0.10	DKIM_SIGNED	Message has a DKIM or DK signature, not necessarily valid
-0.50	DKIM_VALID	Message has at least one valid DKIM or DK signature
-1.00	DKIM_VALID_AU	Message has a valid DKIM or DK signature from author's domain
-0.10	DKIM_VALID_EF	Message has a valid DKIM or DK signature from envelope-from domain
0.00	HTML_MESSAGE	HTML included in message
0.01	T_KAM_HTML_FONT_INVALID	Test for Invalidly Named or Formatted Colors in HTML
So why mailscanner and mailwatch are not seeing this in spam report?

It should be BAYES_20 but is not...
User avatar
shawniverson
Posts: 3776
Joined: 13 Jan 2014 23:30
Location: Indianapolis, Indiana USA
Contact:

Re: eFa v5 bayes behaviour

Post by shawniverson »

What happens if you execute spamassassin as the user that MailScanner runs as?

Code: Select all

sudo su - postfix -s /bin/sh -c "spamassassin -D -t < msg"
gregecslo
Posts: 71
Joined: 09 Sep 2018 17:55

Re: eFa v5 bayes behaviour

Post by gregecslo »

Lol

I ran it 3 times...

First run:
Nov 8 16:37:06.578 [2580347] dbg: bayes: cannot use bayes on this message; not enough usable tokens found

Second run:
-0.5 BAYES_20

Third run:
1.5 BAYES_40

sudo su - postfix -s /bin/sh -c "spamassassin -D -t < /var/spool/MailScanner/quarantine/20241106/nonspam/4Xk5Fg0rtZzMxTvg 2> /tmp/logsa.txt"


Every time I run it on SAME file "/var/spool/MailScanner/quarantine/20241106/nonspam/4Xk5Fg0rtZzMxTvg" I get different results :)
User avatar
shawniverson
Posts: 3776
Joined: 13 Jan 2014 23:30
Location: Indianapolis, Indiana USA
Contact:

Re: eFa v5 bayes behaviour

Post by shawniverson »

Uhhhh.... :crazy: :crazy: :crazy: :crazy:

Do me a favor, can you temporarily disable SELinux and run some more tests?

Can you confirm that bayes autolearning is off in the context of the postfix user? (see below)

Do you have a /etc/MailScanner/spamassassin.conf that is a regular file and has a valid symlink to it at /etc/mail/spamassassin/mailscanner.cf ?

Does your settings look like this in the file?

Code: Select all

bayes_store_module              Mail::SpamAssassin::BayesStore::SQL
bayes_sql_dsn                   DBI:mysql:sa_bayes:localhost
bayes_sql_username              sa_user
bayes_sql_password              ***********************************************************
bayes_sql_override_username     postfix

bayes_auto_learn                   0
gregecslo
Posts: 71
Joined: 09 Sep 2018 17:55

Re: eFa v5 bayes behaviour

Post by gregecslo »

Hi.

Disabled selinux and rebooted.
Same result.

Ran test command 6 times:

1. Content analysis details: (9.8 points, 5.0 required) - NO BAYES
2. Content analysis details: (9.8 points, 5.0 required) - NO BAYES
3. Content analysis details: (9.8 points, 5.0 required) - BAYES_40
4. Content analysis details: (9.9 points, 5.0 required) - NO BAYES
5. Content analysis details: (9.9 points, 5.0 required) - NO BAYES
6. Content analysis details: (9.7 points, 5.0 required) - BAYES_20


Can you confirm that bayes autolearning is off in the context of the postfix user? (see below)
Bayes autolearn is OFF, 100%


Do you have a /etc/MailScanner/spamassassin.conf that is a regular file and has a valid symlink to it at /etc/mail/spamassassin/mailscanner.cf ?
YES


Does your settings look like this in the file?
YES


When NO bayes:

Code: Select all

Nov  9 07:26:54.807 [7726] dbg: config: fixed relative path: /var/lib/spamassassin/4.000001/updates_spamassassin_org/23_bayes.cf
Nov  9 07:26:54.807 [7726] dbg: config: using "/var/lib/spamassassin/4.000001/updates_spamassassin_org/23_bayes.cf" for included file
Nov  9 07:26:54.807 [7726] dbg: config: read file /var/lib/spamassassin/4.000001/updates_spamassassin_org/23_bayes.cf
Nov  9 07:26:54.807 [7726] dbg: config: parsing file /var/lib/spamassassin/4.000001/updates_spamassassin_org/23_bayes.cf
Nov  9 07:26:54.880 [7726] dbg: config: fixed relative path: /var/lib/spamassassin/4.000001/updates_spamassassin_org/60_bayes_stopwords.cf
Nov  9 07:26:54.880 [7726] dbg: config: using "/var/lib/spamassassin/4.000001/updates_spamassassin_org/60_bayes_stopwords.cf" for included file
Nov  9 07:26:54.880 [7726] dbg: config: read file /var/lib/spamassassin/4.000001/updates_spamassassin_org/60_bayes_stopwords.cf
Nov  9 07:26:54.880 [7726] dbg: config: parsing file /var/lib/spamassassin/4.000001/updates_spamassassin_org/60_bayes_stopwords.cf
Nov  9 07:26:55.889 [7726] dbg: bayes: stopwords for languages enabled: en
Nov  9 07:26:55.908 [7726] dbg: bayes: learner_new self=Mail::SpamAssassin::Plugin::Bayes=HASH(0x5581a670b768), bayes_store_module=Mail::SpamAssassin::BayesStore::SQL
Nov  9 07:26:55.930 [7726] dbg: bayes: using username: postfix
Nov  9 07:26:55.930 [7726] dbg: bayes: learner_new: got store=Mail::SpamAssassin::BayesStore::SQL=HASH(0x5581a820bb28)
Nov  9 07:26:55.940 [7726] dbg: bayes: database connection established
Nov  9 07:26:55.940 [7726] dbg: bayes: found bayes db version 3
Nov  9 07:26:55.940 [7726] dbg: bayes: Using userid: 1
Nov  9 07:26:56.998 [7726] dbg: bayes: corpus size: nspam = 10193, nham = 19971
Nov  9 07:26:56.999 [7726] dbg: bayes: skipped token 'Your' because it's in stopword list for language 'en'
Nov  9 07:26:56.999 [7726] dbg: bayes: skipped token 'that' because it's in stopword list for language 'en'
Nov  9 07:26:56.999 [7726] dbg: bayes: skipped token 'your' because it's in stopword list for language 'en'
Nov  9 07:26:56.999 [7726] dbg: bayes: skipped token 'for' because it's in stopword list for language 'en'
Nov  9 07:26:56.999 [7726] dbg: bayes: skipped token 'the' because it's in stopword list for language 'en'
Nov  9 07:26:56.999 [7726] dbg: bayes: skipped token 'has' because it's in stopword list for language 'en'
Nov  9 07:26:56.999 [7726] dbg: bayes: skipped token 'and' because it's in stopword list for language 'en'
Nov  9 07:26:56.999 [7726] dbg: bayes: skipped token 'are' because it's in stopword list for language 'en'
Nov  9 07:26:57.000 [7726] dbg: bayes: skipped token 'This' because it's in stopword list for language 'en'
Nov  9 07:26:57.000 [7726] dbg: bayes: skipped token 'you' because it's in stopword list for language 'en'
Nov  9 07:26:57.000 [7726] dbg: bayes: skipped token 'more' because it's in stopword list for language 'en'
Nov  9 07:26:57.000 [7726] dbg: bayes: tokenized body: 103 tokens
Nov  9 07:26:57.000 [7726] dbg: bayes: skipped token 'email' because it's in stopword list for language 'en'
Nov  9 07:26:57.001 [7726] dbg: bayes: tokenized uri: 34 tokens
Nov  9 07:26:57.001 [7726] dbg: bayes: tokenized invisible: 0 tokens
Nov  9 07:26:57.005 [7726] dbg: bayes: tokenized header: 78 tokens
Nov  9 07:26:57.005 [7726] dbg: bayes: tok_get_all: token count: 172
Nov  9 07:26:57.009 [7726] dbg: bayes: cannot use bayes on this message; not enough usable tokens found
Nov  9 07:26:57.009 [7726] dbg: bayes: not scoring message, returning undef
Nov  9 07:26:57.408 [7726] dbg: auto-welcomelist: sql-based connected to DBI:mysql:sa_bayes:localhost
Nov  9 07:26:57.457 [7726] dbg: auto-welcomelist: sql-based finish: disconnected from DBI:mysql:sa_bayes:localhost
Nov  9 07:26:57.514 [7726] dbg: timing: total 3150 ms - init: 1707 (54.2%), b_tie_ro: 10 (0.3%), parse: 2.9 (0.1%), extract_message_metadata: 38 (1.2%), tests_pri_-10000: 13 (0.4%), compile_gen: 208 (6.6%), get_uri_detail_list: 7 (0.2%), tests_pri_-2000: 9 (0.3%), compile_eval: 36 (1.1%), tests_pri_-1000: 6 (0.2%), tests_pri_-950: 3.7 (0.1%), tests_pri_-900: 4.2 (0.1%), tests_pri_-100: 852 (27.0%), check_dcc: 9 (0.3%), check_spf: 22 (0.7%), poll_dns_idle: 0.06 (0.0%), dkim_load_modules: 34 (1.1%), check_dkim_signature: 12 (0.4%), check_dkim_adsp: 4.6 (0.1%), check_pyzor: 3.4 (0.1%), check_razor2: 6 (0.2%), tests_pri_-90: 22 (0.7%), check_bayes: 14 (0.4%), b_tokenize: 7 (0.2%), b_tok_get_all: 2.5 (0.1%), b_comp_prob: 0.86 (0.0%), b_finish: 0.00 (0.0%), tests_pri_0: 347 (11.0%), tests_pri_10: 6 (0.2%), tests_pri_500: 7 (0.2%), tests_pri_1000: 93 (2.9%), total_txrep: 84 (2.7%), check_txrep_msg_id: 6 (0.2%), update_txrep_msg_id: 18 (0.6%), check_txrep_email_ip: 14 (0.5%), update_txrep_email_ip: 1.45 (0.0%), check_txrep_domain: 1.10 (0.0%), update_txrep_domain: 1.33 (0.0%), check_txrep_helo: 1.23 (0.0%), update_txrep_helo: 1.87 (0.1%), check_txrep_ip: 1.03 (0.0%), update_txrep_ip: 2.3 (0.1%), tests_pri_2000: 35 (1.1%)

When it has bayes:

Code: Select all

Nov  9 07:28:38.900 [8120] dbg: config: fixed relative path: /var/lib/spamassassin/4.000001/updates_spamassassin_org/23_bayes.cf
Nov  9 07:28:38.900 [8120] dbg: config: using "/var/lib/spamassassin/4.000001/updates_spamassassin_org/23_bayes.cf" for included file
Nov  9 07:28:38.900 [8120] dbg: config: read file /var/lib/spamassassin/4.000001/updates_spamassassin_org/23_bayes.cf
Nov  9 07:28:38.900 [8120] dbg: config: parsing file /var/lib/spamassassin/4.000001/updates_spamassassin_org/23_bayes.cf
Nov  9 07:28:38.959 [8120] dbg: config: fixed relative path: /var/lib/spamassassin/4.000001/updates_spamassassin_org/60_bayes_stopwords.cf
Nov  9 07:28:38.959 [8120] dbg: config: using "/var/lib/spamassassin/4.000001/updates_spamassassin_org/60_bayes_stopwords.cf" for included file
Nov  9 07:28:38.959 [8120] dbg: config: read file /var/lib/spamassassin/4.000001/updates_spamassassin_org/60_bayes_stopwords.cf
Nov  9 07:28:38.960 [8120] dbg: config: parsing file /var/lib/spamassassin/4.000001/updates_spamassassin_org/60_bayes_stopwords.cf
Nov  9 07:28:39.825 [8120] dbg: bayes: stopwords for languages enabled: en
Nov  9 07:28:39.838 [8120] dbg: bayes: learner_new self=Mail::SpamAssassin::Plugin::Bayes=HASH(0x55bc09e57378), bayes_store_module=Mail::SpamAssassin::BayesStore::SQL
Nov  9 07:28:39.857 [8120] dbg: bayes: using username: postfix
Nov  9 07:28:39.857 [8120] dbg: bayes: learner_new: got store=Mail::SpamAssassin::BayesStore::SQL=HASH(0x55bc0b235a10)
Nov  9 07:28:39.865 [8120] dbg: bayes: database connection established
Nov  9 07:28:39.866 [8120] dbg: bayes: found bayes db version 3
Nov  9 07:28:39.866 [8120] dbg: bayes: Using userid: 1
Nov  9 07:28:40.869 [8120] dbg: bayes: corpus size: nspam = 10193, nham = 19971
Nov  9 07:28:40.869 [8120] dbg: bayes: skipped token 'Your' because it's in stopword list for language 'en'
Nov  9 07:28:40.870 [8120] dbg: bayes: skipped token 'that' because it's in stopword list for language 'en'
Nov  9 07:28:40.870 [8120] dbg: bayes: skipped token 'your' because it's in stopword list for language 'en'
Nov  9 07:28:40.870 [8120] dbg: bayes: skipped token 'for' because it's in stopword list for language 'en'
Nov  9 07:28:40.870 [8120] dbg: bayes: skipped token 'the' because it's in stopword list for language 'en'
Nov  9 07:28:40.870 [8120] dbg: bayes: skipped token 'has' because it's in stopword list for language 'en'
Nov  9 07:28:40.870 [8120] dbg: bayes: skipped token 'and' because it's in stopword list for language 'en'
Nov  9 07:28:40.870 [8120] dbg: bayes: skipped token 'are' because it's in stopword list for language 'en'
Nov  9 07:28:40.870 [8120] dbg: bayes: skipped token 'This' because it's in stopword list for language 'en'
Nov  9 07:28:40.870 [8120] dbg: bayes: skipped token 'you' because it's in stopword list for language 'en'
Nov  9 07:28:40.871 [8120] dbg: bayes: skipped token 'more' because it's in stopword list for language 'en'
Nov  9 07:28:40.871 [8120] dbg: bayes: tokenized body: 103 tokens
Nov  9 07:28:40.871 [8120] dbg: bayes: skipped token 'email' because it's in stopword list for language 'en'
Nov  9 07:28:40.871 [8120] dbg: bayes: tokenized uri: 34 tokens
Nov  9 07:28:40.871 [8120] dbg: bayes: tokenized invisible: 0 tokens
Nov  9 07:28:40.873 [8120] dbg: bayes: tokenized header: 78 tokens
Nov  9 07:28:40.874 [8120] dbg: bayes: tok_get_all: token count: 172
Nov  9 07:28:40.877 [8120] dbg: bayes: score = 0.827619312007403
Nov  9 07:28:41.225 [8120] dbg: auto-welcomelist: sql-based connected to DBI:mysql:sa_bayes:localhost
Nov  9 07:28:41.242 [8120] dbg: auto-welcomelist: sql-based finish: disconnected from DBI:mysql:sa_bayes:localhost
Nov  9 07:28:41.261 [8120] dbg: timing: total 2736 ms - init: 1450 (53.0%), b_tie_ro: 9 (0.3%), parse: 2.6 (0.1%), extract_message_metadata: 33 (1.2%), tests_pri_-10000: 9 (0.3%), compile_gen: 181 (6.6%), get_uri_detail_list: 4.3 (0.2%), tests_pri_-2000: 5.0 (0.2%), compile_eval: 31 (1.1%), tests_pri_-1000: 4.2 (0.2%), tests_pri_-950: 3.3 (0.1%), tests_pri_-900: 3.3 (0.1%), tests_pri_-100: 818 (29.9%), check_dcc: 6 (0.2%), dkim_load_modules: 40 (1.5%), check_dkim_signature: 10 (0.4%), poll_dns_idle: 0.05 (0.0%), check_spf: 22 (0.8%), check_dkim_adsp: 2.9 (0.1%), check_pyzor: 3.8 (0.1%), check_razor2: 5 (0.2%), tests_pri_-90: 38 (1.4%), check_bayes: 14 (0.5%), b_tokenize: 5 (0.2%), b_tok_get_all: 2.5 (0.1%), b_comp_prob: 0.70 (0.0%), b_tok_touch_all: 0.31 (0.0%), b_finish: 2.4 (0.1%), tests_pri_0: 300 (11.0%), tests_pri_10: 6 (0.2%), tests_pri_500: 8 (0.3%), tests_pri_1000: 45 (1.6%), total_txrep: 38 (1.4%), check_txrep_msg_id: 1.70 (0.1%), update_txrep_msg_id: 1.41 (0.1%), check_txrep_email_ip: 0.53 (0.0%), update_txrep_email_ip: 6 (0.2%), check_txrep_domain: 0.97 (0.0%), update_txrep_domain: 1.28 (0.0%), check_txrep_helo: 0.91 (0.0%), update_txrep_helo: 1.16 (0.0%), check_txrep_ip: 0.87 (0.0%), update_txrep_ip: 1.12 (0.0%), tests_pri_2000: 4.3 (0.2%)
I mean... This is crazy, sometimes I get BAYES_20, sometimes BAYES_40 or even BAYES_80 and sometimes no bayes at all ON SAME MESSAGE.

Exactly :crazy: :crazy: :crazy:
User avatar
shawniverson
Posts: 3776
Joined: 13 Jan 2014 23:30
Location: Indianapolis, Indiana USA
Contact:

Re: eFa v5 bayes behaviour

Post by shawniverson »

Given that bayes is doing this unpredictable behaviour for you it is essentially unreliable and useless. I am going to populate bayes and try to reproduce the same behaviour on my end to continue troubleshooting with you.

I reached out for help from spamassassin on IRC and they want us to:

1) Try bayes using a traditional db file instead mysql (this will require exporting your bayes database, changing the configuration away from mariadb to use a file, and importing the database back). This will help determine if the problem is specific to the type of bayes db backend in use.

2) Enable db logging on mariadb to see what queries are being made to to the bayes database when mysql is in use.
User avatar
shawniverson
Posts: 3776
Joined: 13 Jan 2014 23:30
Location: Indianapolis, Indiana USA
Contact:

Re: eFa v5 bayes behaviour

Post by shawniverson »

I am able to reproduce the same behavior.

Code: Select all

sudo su - postfix -s /bin/sh -c "spamassassin -D bayes -t < /var/spool/MailScanner/quarantine/20241110/spam/4XmZWK3SLgzcDDsV"
Run #1

Code: Select all

Nov 11 21:38:24.182 [2892651] dbg: bayes: tokenized header: 62 tokens                                                                                                                               
Nov 11 21:38:24.183 [2892651] dbg: bayes: tok_get_all: token count: 224                                                                                                                             
Nov 11 21:38:24.190 [2892651] dbg: bayes: token '(unknown)' => 0.995425742574258                                                                                                                    
Nov 11 21:38:24.191 [2892651] dbg: bayes: token '(unknown)' => 0.986543689320388                                                                                                                    
Nov 11 21:38:24.191 [2892651] dbg: bayes: score = 0.982978963026553

3.0 BAYES_95               BODY: Bayes spam probability is 95 to 99%                                                                                                                               
                            [score: 0.9830]                                                     
Run #2

Code: Select all

Nov 11 21:40:03.665 [2893380] dbg: bayes: tokenized header: 62 tokens                                                                                                                               
Nov 11 21:40:03.667 [2893380] dbg: bayes: tok_get_all: token count: 224                                                                                                                             
Nov 11 21:40:03.672 [2893380] dbg: bayes: token '(unknown)' => 0.999231281198003                                                                                                                    
Nov 11 21:40:03.673 [2893380] dbg: bayes: score = 0.88213767969444                   

2.0 BAYES_80               BODY: Bayes spam probability is 80 to 95%                                                                                                                               
                            [score: 0.8821]                   

Run #3

Code: Select all

Nov 11 21:40:15.160 [2893458] dbg: bayes: tokenized header: 62 tokens                                                                                                                               
Nov 11 21:40:15.161 [2893458] dbg: bayes: tok_get_all: token count: 224                           
Nov 11 21:40:15.165 [2893458] dbg: bayes: cannot use bayes on this message; not enough usable tokens found
Nov 11 21:40:15.166 [2893458] dbg: bayes: not scoring message, returning undef                    
User avatar
shawniverson
Posts: 3776
Joined: 13 Jan 2014 23:30
Location: Indianapolis, Indiana USA
Contact:

Re: eFa v5 bayes behaviour

Post by shawniverson »

Please test the following:

Change the following line in /etc/mail/spamassassin/mailscanner.cf:

From:

Code: Select all

bayes_store_module               Mail::SpamAssassin::BayesStore::SQL
To:

Code: Select all

bayes_store_module               Mail::SpamAssassin::BayesStore::MySQL
Report back results.
gregecslo
Posts: 71
Joined: 09 Sep 2018 17:55

Re: eFa v5 bayes behaviour

Post by gregecslo »

Hi

Tested on one instance and now bayes result is consistant in 5 or 10 tests. I also have bayes_00 score which I havent had before this change.
So far it looks very promising.

Thanks man!
User avatar
_M_P
Posts: 8
Joined: 06 Aug 2024 07:03
Location: Italy

Re: eFa v5 bayes behaviour

Post by _M_P »

shawniverson wrote: 12 Nov 2024 05:14 Please test the following:

Change the following line in /etc/mail/spamassassin/mailscanner.cf:

From:

Code: Select all

bayes_store_module               Mail::SpamAssassin::BayesStore::SQL
To:

Code: Select all

bayes_store_module               Mail::SpamAssassin::BayesStore::MySQL
Report back results.
Nice: as usual, thank you.
I've done the change, is there anything else I should do with SpamAssassin?

Regards...
Post Reply