Page 1 of 1

eFa Filter sudden restarts

Posted: 26 Apr 2022 10:52
by wstemb
From the installation day I can find a lot of "silent" and unsolicited restarts of the server, I have a lot of directories in /var/crash, about 3 to 5 a day.
Most of them are so short to be almost invisible to the user, but some of them had repercussions on delaying messages (msmilter service was down and the server repeatedly restarted).
I inspected (at the level of my knowledge of eFA and mailscanner) and it seems that the problem is connected in some way to the unbound service: just before the restart in some of logs I can find errors:

mail postfix/smtp[65537]: 4KjyMK2nxczN2spt: to=<xxxxxxxxxxxxx>, relay=no
ne, delay=391070, delays=391070/0.03/0/0, dsn=4.4.3, status=deferred (Host or domain name not found.
Name service error for name=xxxxxxxxxxx type=MX: Host not found, try again)

I am not sure, but it seems to me that this error occurred when the postfix tried to resend delayed messages from the outbound queue.

No errors on the company firewall at the time of restart.

After the automatic restart, at least at the last two, there was a error message in log and in output ot systemctl status unbound.:

failed lookup, cannot probe to master k.root-servers.net

eFa is defined as relay at the perimeter, controlling the mail entering and exiting the net and most of the time it has his work done.

MailWatch Version: 1.2.18
Operating System Version: CentOS Stream 8
Postfix Version: 3.5.9
MailScanner Version: 5.4.4
ClamAV Version: 0.103.5
SpamAssassin Version: 3.4.6
PHP Version: 7.2.24
MySQL Version: 10.3.28-MariaDB
GeoIP Database Version: GeoLite2 Country database 2018-06-07 22:38:29

unbound -V
Version 1.11.0

Configure line: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-pythonmodule --with-pyunbound PYTHON=/usr/libexec/platform-python --with-libevent --with-pthreads --with-ssl --disable-rpath --disable-static --enable-relro-now --enable-pie --enable-subnet --enable-ipsecmod --with-conf-file=/etc/unbound/unbound.conf --with-pidfile=/var/run/unbound/unbound.pid --enable-sha2 --disable-gost --enable-ecdsa --with-rootkey-file=/var/lib/unbound/root.key
Linked libs: libevent 2.1.8-stable (it uses epoll), OpenSSL 1.1.1k FIPS 25 Mar 2021
Linked modules: dns64 python ipsecmod subnetcache respip validator iterator

BSD licensed, see LICENSE in source package for details.
Report bugs to unbound-bugs@nlnetlabs.nl or https://github.com/NLnetLabs/unbound/issues

unbound had the initial configuratio + the localforward.conf in conf.d for local (Intranet) DNS query defininition.

Somebody experienced something similar?
Walter

Re: eFa Filter sudden restarts

Posted: 26 Apr 2022 23:01
by shawniverson
It sounds like things are dying on your box before the system crashes. How do your resources look? Any indication in the logs as to why?

Re: eFa Filter sudden restarts

Posted: 27 Apr 2022 05:50
by pdwalker
It is really unusual for a centos box to restart unexpectedly.

Is the server having hardware problems? Is the system short of memory? Are there any errors before a restart in /var/log/messages?

I think you have to identify and resolve this problem first. The programs that make up eFa are incapable of causing server restarts.

Re: eFa Filter sudden restarts

Posted: 27 Apr 2022 14:38
by wstemb
Resources are not an issue, the eFa filter is running in a ESX VM on a blade server, so as first step I gave a lot computing resources: 4 cores, 16GB RAM.

The number of restarts is very high, but most of them are invisible to the whole system, the mail throughput is still not so high. At 20 Apr. we had a problem in the clamd@scan service, because of "Malformed Database" error. Until I deleted and rebuilt the database using freshclam, the system was crashing / rebooting every 3-6 minutes.

I can't find anything in messages log, minutes of nothing and then the bootup process:

I am working now on journalctl and crash to find part of logs not written to files.

Re: eFa Filter sudden restarts

Posted: 27 Apr 2022 17:17
by freyuh
A stupid question, but have you done a file system check on your eFa?
Maybe the filesystem is faulty...

Re: eFa Filter sudden restarts

Posted: 28 Apr 2022 13:41
by wstemb
The machine and file systems are OK.

Restarts are in some way connected to the apps working on the system, but I have to find the connection. A proof is at Apr.20, when because of "Malformed Database" I had the clamd@scan service "activating", not active, and we had a crash / restart every few minutes...

______________________

I changed two things on the system:
1. I suppressed the NDR on the mail server. The Postfix outbound queue was always high, because of NDR-s that could not be delivered and were deferred. But, we had also a system restart with mailq = 2
2. I changed a parameter in unbound.conf, just to try (because of error in the starting of unbound: failed lookup, cannot probe to master k.root-servers.net). I will write here what I changed if this is the cause.

Now, the uptime is 2 day and 3 hours, the mailq is 0, no restarts of services or machine (the startup time of most of the services is similar to the boot time of the machine.

Before I have a lot of service restart, much more than system restarts (probably due to the CRON job:

CROND[592305]: (root) CMD (/usr/sbin/eFa-Monitor-cron >/dev/null 2>&1)

which is testing services and restarting them if neccesary every minute.

But, I can not be sure if this is solved now (better to tell workarounded now) , I am waiting with journalctl to see the next crash and the real cause. In few day I will return to last configuation, step by step, and continue to look at journalctl.

I found a situation in the system startup:

Postfix is started at: " Active: active (running) since Tue 2022-04-26 11:32:35 CEST; 2 days ago"
Unbound is started at: "Active: active (running) since Tue 2022-04-26 11:33:09 CEST; 2 days ago"

half minute before, so I think this is the reason on errors found in maillog of domains not resolved at postfix restart (mailq resending).

mail postfix/smtp[65537]: 4KjyMK2nxczN2spt: to=<xxxxxxxxxxxxx>, relay=none, delay=391070, delays=391070/0.03/0/0, dsn=4.4.3, status=deferred (Host or domain name not found. Name service error for name=xxxxxxxxxxx type=MX: Host not found, try again)

What is the reason of the cron job (every minute) :

CROND[593919]: (root) CMD (/usr/sbin/checkreboot.sh)

it is checking it the file /reboot.system exists and if yes reboot the system. But I can't find who, when and why is placing the reboot.system file :-(

Re: eFa Filter sudden restarts

Posted: 28 Apr 2022 15:25
by freyuh
Which virtual SCSI controller and network adapter do you use?

Re: eFa Filter sudden restarts

Posted: 28 Apr 2022 15:50
by wstemb
VM settings:

Virtual network adapter type VMNEX3
SCSI Vmware paravirtual

lspci extract:

03:00.0 Serial Attached SCSI controller: VMware PVSCSI SCSI Controller (rev 02)
0b:00.0 Ethernet controller: VMware VMXNET3 Ethernet Controller (rev 01)

Re: eFa Filter sudden restarts

Posted: 29 Apr 2022 06:57
by freyuh
OK, the same as we do.
Two eFas on two different ESXi.
One is rocky linux 8.5 and one CentOS 8.
And they are running both flawlessly ...

Re: eFa Filter sudden restarts

Posted: 29 Apr 2022 08:21
by pdwalker
That's very strange.

I've been running eFa on ESXi 6.5 for years without a single unplanned reboot.

Are there any logs in ESXi you can check? Or are there anything inside the vm itself that might clue you in?

This is not an EFA problem, but highly likely a problem with your vm.

Re: eFa Filter sudden restarts

Posted: 29 Apr 2022 11:18
by wstemb
This is not an EFA problem, but highly likely a problem with your vm.
I am not very sure about this, I can't prove it, it is still a "sensation". Existing logs on previous restarts are always empty before the restart, so I have to wait the next crash.

Why I am not sure it is a infrastructure problem:

1. We have hundreds of VM-s on server blades, this is the only one with similar problems. No alarms on vSphere.

2. We have two exception to "the 2-6 restart a day rule":
  • 77 crashes: A day when eFa services were in abnormal states due to clamd@scan "Malformed database" error and "Activating" status, when the mail was not relayed in any direction, just entering the mailq, I had restarts every 5-6 minutes .
    Once resolved the problem, the service restarted well and when the outbound queue (>500 at the problem time) was empty, no more restarts so often (2-6 a day).
  • 0 crashes: I have now the uptime of 3 days (never happened before). Mailq = 0 everytime I am looking at him, all services (except maillscan, which is restarting every day) are as old as uptime. I did not touch anything on the infrastructure or in OS, I just eliminated the NDR on the MS Exchange server and I changed just one parameter in unbound.conf. I have now a journalctl -f scrool in a ssh terminal to see in realtime what is happening.
Our email and mailbox situation was a little strange and it was generating a very high Postfix output queue full of deferred NDR-s, I will not explain the reasons here publicly, but I think that this situation pushed the server into some race condition or over some "edge" probably not expected when planned.

Next week, if the eFa stays stable, I will return, step by step, in configuration changes to initial situation to find if the crash reappear to try to find the cause in the journalctl.

I am a fan, not a denigrator of eFA, all this is just to find and solve the real cause, still hidden :-)
When I find the cause, I will inform here in detail.

Re: eFa Filter sudden restarts

Posted: 01 May 2022 14:21
by shawniverson
Please keep us posted. I would like to recreate the conditions that caused this crash to improve the system.

Re: eFa Filter sudden restarts

Posted: 01 May 2022 18:47
by wstemb
shawniverson wrote: 01 May 2022 14:21 Please keep us posted. I would like to recreate the conditions that caused this crash to improve the system.
Sure. I am here and I am watching last 5 days on logs, no crashes. As the side effect of finding causes I saw something in logs during restarts after crashes, we will discuss it later if it is normal (premature start of postfix?) or can be avoided.

Tomorrow I will begin to revert back in config changes, one by one, I do not like things repaired "with no evident reason".

Re: eFa Filter sudden restarts

Posted: 03 May 2022 05:42
by pdwalker
Very weird, and a super annoying problem. Good luck isolating the cause, and I hope you find a reason.

Re: eFa Filter sudden restarts

Posted: 10 May 2022 08:27
by wstemb
No crashes from Apr. 26... system and efa filter stable, doing the work.
Tried to revert the MS Exchange NDR settings, had >180 mails in Postfix outbound queue, no problem.

Next days I will revert unbound config changes to check further.

Unbound proved last week to be extremely unstable as service at external routing / connectivity problems (ISP network, out of my control). During this routing / connectivity issue the unbound service dropped always to failed status at first DNS request from the system.

Re: eFa Filter sudden restarts

Posted: 10 May 2022 10:00
by pdwalker
Are DNS requests being blocked by a firewall somewhere?

Re: eFa Filter sudden restarts

Posted: 18 May 2022 08:05
by wstemb
pdwalker wrote: 10 May 2022 10:00 Are DNS requests being blocked by a firewall somewhere?
No, firewall and firewall logs are under my control, firewall opened to DNS and "green" in logs.
The unbound issue mentioned in last post was caused by a error in the routing tables in ISP routers network. After I notified them and they corrected, all is working again.
No new system crashes after April 26.

Re: eFa Filter sudden restarts

Posted: 06 Jun 2022 07:27
by wstemb
After days of uptime, the system crashed again, only once, 26. May. No new crashes after that, but I found (again) a error in the unbound log after start:

[root@mail crash]# systemctl status unbound
● unbound.service - Unbound recursive Domain Name Server
Loaded: loaded (/usr/lib/systemd/system/unbound.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/unbound.service.d
└─override.conf
Active: active (running) since Thu 2022-05-26 13:07:00 CEST; 1 weeks 3 days ago
Process: 2371 ExecStartPre=/usr/sbin/unbound-anchor -a /var/lib/unbound/root.key -c /etc/unbound/icannbundle.pem -f /etc/resolv.conf -R (code=exited, status=0/SUCCESS)
Process: 2357 ExecStartPre=/usr/sbin/unbound-checkconf (code=exited, status=0/SUCCESS)
Main PID: 5434 (unbound)
Tasks: 4 (limit: 101059)
Memory: 70.8M
CGroup: /system.slice/unbound.service
└─5434 /usr/sbin/unbound -d

May 26 13:06:25 mail.uljanik.hr systemd[1]: Starting Unbound recursive Domain Name Server...
May 26 13:06:25 mail.uljanik.hr unbound-checkconf[2357]: unbound-checkconf: no errors in /etc/unbound/unbound.conf
May 26 13:07:00 mail.uljanik.hr systemd[1]: Started Unbound recursive Domain Name Server.
May 26 13:07:00 mail.uljanik.hr unbound[5434]: [5434:0] notice: init module 0: iterator
May 26 13:07:00 mail.uljanik.hr unbound[5434]: [5434:0] info: start of service (unbound 1.11.0).
May 26 13:07:00 mail.uljanik.hr unbound[5434]: [5434:0] error: .: failed lookup, cannot probe to master k.root-servers.net


I saw similar error before, after some of crashes. This time I was not near the server when the last crash happened and the last 6 minutes in log just before the restart are missing, so I can't be sure.

Re: eFa Filter sudden restarts

Posted: 06 Jun 2022 12:16
by Aryfir
Thats weird, your unbound cannot probe only to master K.ROOT-SERVERS.NET? What about A,B,C....M.ROOT-SERVERS.NET?

Try to ping K.ROOT-SERVERS.NET on IPv4 193.0.14.129 or IPv6 2001:7fd::1, and see if you can reach that.

And also try to update your unbound root.hints