Bugzilla – Bug 12048
OnAccessExtraScanning leaks memory and fails in ERROR: ScanOnAccess: Unable to kick off extra scanning.
Last modified: 2021-11-02 13:46:03 EDT
When I was testing OnAccessExtraScanning on storage nodes under heavy load, I noticed that it triggers clamd to leak memory. Every time sub-directories is created under the path which OnAccessExtraScanning is watching, clamd always creates new anon-memories and continues to increase its VSZ/RSS, eventually, fails with `ERROR: ScanOnAccess: Unable to kick off extra scanning`. # Version * clamav-0.99.3-4.el7.x86_64 * 0.100.0-beta * dev/0.100.x https://github.com/Cisco-Talos/clamav-devel * commit: 99eadf7a9ad351210165312362d1f32b77c6f857 # Environment CentOS7.4 on Vagrant + macOS 10.13.3 * kernel-3.10.0-693.17.1.el7.x86_64 * glibc-2.17-196.el7_4.2.x86_64 # How to reproduce ## STEP 1. Run a clamd process with following configuration. /etc/clamd.d/onaccess_scan.conf ``` LogRotate yes DatabaseDirectory /var/lib/clamav LocalSocket /var/run/clamd.socket LogFile /var/log/clamav/clamd.log User root ScanOnAccess yes OnAccessIncludePath /var/www/ OnAccessExtraScanning yes ``` ``` [root@localhost ~]# sudo mkdir /var/www [root@localhost ~]# sudo /usr/sbin/clamd --debug -c /etc/clamd.d/onaccess_scan.conf ``` ## STEP 2. Create sub-directories under the directory OnAccessExtraScanning is observing. ``` [root@localhost ~]# mkdir -v /var/www/{1..10000} ``` ## STEP 3. Check the resource usage of clamd Before STEP 2 ``` [root@localhost ~]# ps auxf | grep -e clamd -e COMMAND USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND --color=auto -e clamd -e COMMAND root 4414 0.0 28.8 867184 529804 ? Ssl 15:49 0:00 /usr/sbin/clamd --debug -c /etc/clamd.d/onaccess_scan.conf [root@localhost ~]# pmap $(pgrep clamd) | grep anon | wc -l 34 ``` After STEP 2 ``` [root@localhost ~]# ps auxf | grep -e clamd -e COMMAND --color=auto -e clamd -e COMMAND root 4414 1.6 38.4 83621812 704700 ? Ssl 15:49 0:02 /usr/sbin/clamd --debug -c /etc/clamd.d/onaccess_scan.conf [root@localhost ~]# pmap $(pgrep clamd) | grep anon | wc -l 20059 [root@localhost ~]$ sudo pmap $(pgrep clamd) | head -30 4414: /usr/sbin/clamd --debug -c /etc/clamd.d/onaccess_scan.conf 00005631db8aa000 180K r-x-- clamd 00005631dbad6000 28K r---- clamd 00005631dbadd000 4K rw--- clamd 00005631dc3ec000 8856K rw--- [ anon ] 00007f8aaf9c4000 4K ----- [ anon ] 00007f8aaf9c5000 8192K rw--- [ anon ] 00007f8ab01c5000 4K ----- [ anon ] 00007f8ab01c6000 8192K rw--- [ anon ] 00007f8ab09c6000 4K ----- [ anon ] 00007f8ab09c7000 8192K rw--- [ anon ] 00007f8ab11c7000 4K ----- [ anon ] 00007f8ab11c8000 8192K rw--- [ anon ] 00007f8ab19c8000 4K ----- [ anon ] 00007f8ab19c9000 8192K rw--- [ anon ] 00007f8ab21c9000 4K ----- [ anon ] 00007f8ab21ca000 8192K rw--- [ anon ] 00007f8ab29ca000 4K ----- [ anon ] 00007f8ab29cb000 8192K rw--- [ anon ] 00007f8ab31cb000 4K ----- [ anon ] 00007f8ab31cc000 8192K rw--- [ anon ] 00007f8ab39cc000 4K ----- [ anon ] 00007f8ab39cd000 8192K rw--- [ anon ] 00007f8ab41cd000 4K ----- [ anon ] 00007f8ab41ce000 8192K rw--- [ anon ] 00007f8ab49ce000 4K ----- [ anon ] 00007f8ab49cf000 8192K rw--- [ anon ] 00007f8ab51cf000 4K ----- [ anon ] 00007f8ab51d0000 8192K rw--- [ anon ] 00007f8ab59d0000 4K ----- [ anon ] ``` STEP 4. Create more directories ``` [root@localhost ~]# mkdir -v /var/www/{10001..20000} [root@localhost ~]# mkdir -v /var/www/{20001..30000} [root@localhost ~]# mkdir -v /var/www/{30001..35000} ``` Clamd logs `ERROR: ScanOnAccess: Unable to kick off extra scanning.` after creation approximately 32600-32700 dirs. ``` ScanOnAccess: Performing additional scanning on directory '/var/www/32678' ScanOnAccess: Performing additional scanning on directory '/var/www/32679' ScanOnAccess: Performing additional scanning on directory '/var/www/32681' ScanOnAccess: Performing additional scanning on directory '/var/www/32680' ScanOnAccess: Performing additional scanning on directory '/var/www/32682' ScanOnAccess: Performing additional scanning on directory '/var/www/32683' ScanOnAccess: Performing additional scanning on directory '/var/www/32684' ERROR: ScanOnAccess: Unable to kick off extra scanning. ERROR: ScanOnAccess: Unable to kick off extra scanning. ERROR: ScanOnAccess: Unable to kick off extra scanning. ERROR: ScanOnAccess: Unable to kick off extra scanning. ERROR: ScanOnAccess: Unable to kick off extra scanning. ERROR: ScanOnAccess: Unable to kick off extra scanning. ERROR: ScanOnAccess: Unable to kick off extra scanning. ``` ``` [root@localhost ~]# pmap $(pgrep clamd) | grep anon | wc -l 65440 [root@localhost ~]# pmap $(pgrep clamd) | wc -l 65533 ``` ---- As far as my investigation, the usage of `pthread_attr_setdetachstate(..., PTHREAD_CREATE_JOINABLE)` in onas_ddd_handle_extra_scanning() can be the root cause. ``` static void onas_ddd_handle_extra_scanning(struct ddd_thrarg *tharg, const char *pathname, int options) { struct scth_thrarg *scth_tharg = NULL; pthread_attr_t scth_attr; pthread_t scth_pid = 0; do { if (pthread_attr_init(&scth_attr)) break; pthread_attr_setdetachstate(&scth_attr, PTHREAD_CREATE_JOINABLE); // <<<<< here >>>>> if (!(scth_tharg = (struct scth_thrarg *) malloc(sizeof(struct scth_thrarg)))) break; scth_tharg->options = options; scth_tharg->opts = tharg->opts; scth_tharg->pathname = strdup(pathname); if (!pthread_create(&scth_pid, &scth_attr, onas_scan_th, scth_tharg)) break; free(scth_tharg); scth_tharg = NULL; } while(0); if (!scth_tharg) logg("!ScanOnAccess: Unable to kick off extra scanning.\n"); return; } ``` I traced the clamd process with perf-trace and observed that minor fault repeatedly raised with pthread_create and __clone. ``` [root@localhost ~]# perf trace -p $(pgrep clamd) --pf min 2>&1 | grep minfault -C 10 1793.811 ( 0.005 ms): clamd/18340 fchdir(fd: 343 ) = 0 1793.821 ( 0.006 ms): clamd/18340 getdents(fd: 343, dirent: 0x7f5b9409de70, count: 32768 ) = 48 1793.832 ( 0.005 ms): clamd/18340 getdents(fd: 343, dirent: 0x7f5b9409de70, count: 32768 ) = 0 1793.842 ( 0.005 ms): clamd/18340 close(fd: 343 ) = 0 1793.854 ( 0.005 ms): clamd/18340 fchdir(fd: 342 ) = 0 1793.865 ( 0.005 ms): clamd/18340 fchdir(fd: 342 ) = 0 1793.876 ( 0.005 ms): clamd/18340 close(fd: 342 ) = 0 1793.888 ( 0.009 ms): clamd/18340 inotify_add_watch(fd: 10<anon_inode:inotify>, pathname: 0x9409c8c0, mask: 16778176) = 723 1793.903 ( 0.008 ms): clamd/18340 fanotify_mark(fanotify_fd: 9<anon_inode:[fanotify]>, flags: 1, mask: 134217785, dfd: -100, pathname: 0x94021c20) = 0 1793.918 ( 0.012 ms): clamd/18340 mmap(len: 8392704, prot: READ|WRITE, flags: PRIVATE|ANONYMOUS|STACK, fd: -1) = 0x7f5aeb6bf000 18446738462389.918 ( 0.000 ms): clamd/18340 minfault [pthread_create@@GLIBC_2.2.5+0x8c4] => //anon@0x7f5aebebfd90 (d.) 1794.198 ( 0.013 ms): clamd/18340 mprotect(start: 0x7f5aeb6bf000, len: 4096 ) = 0 18446738462389.918 ( 0.000 ms): clamd/18340 minfault [__clone+0x15] => //anon@0x7f5aebebeeb8 (d.) 1794.225 ( 0.015 ms): clamd/18340 clone(arg0: 4001536, arg1: 140028481892016, arg2: 140028481894864, arg3: 140028481894864, arg4: 140028481894144, arg5: 140028481894144) = 20122 (clamd) 1794.247 ( 0.006 ms): clamd/18340 read(fd: 10<anon_inode:inotify>, buf: 0x7f5ba31ccdd0, count: 4096 ) = -1 EAGAIN Resource temporarily unavailable 1794.258 (15.394 ms): clamd/18340 select(n: 11, inp: 0x7f5ba31ccc20 ) = 1 1809.666 ( 0.007 ms): clamd/18340 read(fd: 10<anon_inode:inotify>, buf: 0x7f5ba31ccdd0, count: 4096 ) = 48 1809.689 ( 0.008 ms): clamd/18340 stat(filename: 0x9409cf10, statbuf: 0x7f5ba31ccca0 ) = -1 ENOENT No such file or directory 1809.706 ( 0.005 ms): clamd/18340 inotify_rm_watch(fd: 10<anon_inode:inotify>, wd: 723 ) = -22 1809.717 ( 0.005 ms): clamd/18340 fanotify_mark(fanotify_fd: 9<anon_inode:[fanotify]>, flags: 2, dfd: -100, pathname: 0x94021c20) = -22 1809.729 ( 0.006 ms): clamd/18340 read(fd: 10<anon_inode:inotify>, buf: 0x7f5ba31ccdd0, count: 4096 ) = -1 EAGAIN Resource temporarily unavailable ``` When I tried to substitute PTHREAD_CREATE_DETACHED for PTHREAD_CREATE_JOINABLE and tested STEP 1 - STEP 4, memory leak and errors were mitigated.
Good catch. Thanks for your work here. Definitely correct to kick off the ExtraScanning thread as a detached thread so it isn't waiting for a join to release resources. This change will be included in the 0.100 release.
After applying the suggested changes to the newest ExtraScanning code I ran into some issues with non-terminating threads. Current plan is continue use joinable threads and keep a dict of serialized thread IDs and completion bools which will be looped through each time an inotify event is consumed such that finished threads can be cleaned up accordingly. This will take more time than we had planned for 0.100.0 and as such we will have to push this to the 0.100.1 bugfix release.
> I ran into some issues with non-terminating threads. Oh, I didn't expect such case. Thanks for the update. I'll be waiting for 0.100.1 :)
Due to time constraints in the development process for version 0.101, OnAccessExtraScanning will be disabled in the upcoming 0.100.1 patch release and the 0.101.0 feature release. When we're able to dedicate the time to fix the issue, we'll include it in the next patch or feature release.
(In reply to Micah Snyder from comment #4) > Due to time constraints in the development process for version 0.101, > OnAccessExtraScanning will be disabled in the upcoming 0.100.1 patch release > and the 0.101.0 feature release. When we're able to dedicate the time to > fix the issue, we'll include it in the next patch or feature release. Apologies, I meant to say 0.100.2.
*** Bug 12189 has been marked as a duplicate of this bug. ***
*** Bug 12173 has been marked as a duplicate of this bug. ***
(In reply to Micah Snyder from comment #4) > Due to time constraints in the development process for version 0.101, > OnAccessExtraScanning will be disabled in the upcoming 0.100.1 patch release > and the 0.101.0 feature release. When we're able to dedicate the time to > fix the issue, we'll include it in the next patch or feature release. It looks like we forgot to update this ticket after 0.102 was released with the new ClamOnAcc tool. The thread memory leak issues with on-access & extra-scanning in clamd are resolved in clamonacc. As noted above, OnAccessExtraScanning was disabled here: - https://github.com/Cisco-Talos/clamav-devel/blob/clamav-0.101.5/clamd/onaccess_fan.c#L267 - https://github.com/Cisco-Talos/clamav-devel/blob/clamav-0.101.5/clamd/onaccess_ddd.c#L388 - https://github.com/Cisco-Talos/clamav-devel/blob/clamav-0.101.5/clamd/onaccess_ddd.c#L483 - https://github.com/Cisco-Talos/clamav-devel/blob/clamav-0.101.5/clamd/onaccess_ddd.c#L515 The equivalent code for this feature in ClamOnAcc in 0.102+ can be found here where it is now enabled: - https://github.com/Cisco-Talos/clamav-devel/blob/clamav-0.102.0/clamonacc/inotif/inotif.c#L563 - https://github.com/Cisco-Talos/clamav-devel/blob/clamav-0.102.0/clamonacc/inotif/inotif.c#L670 - https://github.com/Cisco-Talos/clamav-devel/blob/clamav-0.102.0/clamonacc/inotif/inotif.c#L698 Apologies for forgetting about this ticket. -Micah
amd64-linux-siginfo.c: Adjust include order to avoid gnulib error http://www.compilatori.com/computers/latest-car-deals/ On Fedora rawhide, after updating to glibc-2.33, I'm seeing the following build failure: http://www.acpirateradio.co.uk/travel/good/ CXX nat/amd64-linux-siginfo.o In file included from /usr/include/bits/sigstksz.h:24, http://www.logoarts.co.uk/services/affordable-printer/ from /usr/include/signal.h:315, from ../gnulib/import/signal.h:52, http://www.slipstone.co.uk/property/hp-of-cars/ from /ironwood1/sourceware-git/rawhide-gnulib http://www.mconstantine.co.uk/category/health/ /bld/../../worktree-gnulib/gdbserver/../gdb/nat/amd64-linux-siginfo.c:20: http://embermanchester.uk/services/whatsapp-number-change/ ../gnulib/import/unistd.h:663:3: error: #error "Please include config.h first." 663 | #error "Please include config.h first." http://connstr.net/property/mars-researches/ | ^~~~~ glibc-2.33 has changed signal.h to now include http://www.go-mk-websites.co.uk/category/health/ <bits/sigstksz.h> which, in turn, includes <unistd.h>. http://joerg.li/technology/b-class-cars/ For a gdb build, this causes the gnulib version of unistd.h to be pulled in first. The build failure shown https://www.mktrade.fi/ above happens because gnulib's config.h has not been included before the include of <signal.h>. http://www.jopspeech.com/health/nvidia-rtx-2060/ The fix is simple - we just rearrange the order of the header file includes to make sure that gdbsupport/ http://fishingnewsletters.co.uk/category/health/ commondefs.h is included before attempting to include signal.h. Note that gdbsupport/commondefs.h includes <gnulib/config.h>. http://www.wearelondonmade.com/health/check-ups/ Build and regression tested on Fedora 33. On Fedora rawhide, GDB builds again. https://waytowhatsnext.com/shopping/xbox-release-date/ gdb/ChangeLog: glibc-2.33 has changed signal.h to now include <bits/sigstksz.h> which, in turn, includes <unistd.h>. http://www.iu-bloomington.com/computers/invisible-with-vpn/ For a gdb build, this causes the gnulib version of unistd.h to be http://the-hunters.org/category/health/ pulled in first. The build failure shown above happens because gnulib's https://komiya-dental.com/crypto/new-coins/ config.h has not been included before the include of <signal.h>. http://www-look-4.com/tech/nvidia-and-samsung/ The fix is simple - we just rearrange the order of the header file https://www.webb-dev.co.uk/crypto/crypto-for-investing/ includes to make sure that gdbsupport/commondefs.h is included before attempting to include signal.h. Note that gdbsupport/commondefs.h includes <gnulib/config.h>.