Bug 12048 - OnAccessExtraScanning leaks memory and fails in ERROR: ScanOnAccess: Unable to kick off extra scanning.
OnAccessExtraScanning leaks memory and fails in ERROR: ScanOnAccess: Unable t...
Status: RESOLVED FIXED
Product: ClamAV
Classification: ClamAV
Component: clamd
0.100.0-beta
x86_64 GNU/Linux
: P3 normal
: ---
Assigned To: Mickey Sola
:
: 12173 12189 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2018-03-06 21:15 EST by Hiroya Ito
Modified: 2021-11-02 13:46 EDT (History)
5 users (show)

See Also:
QA Contact:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Hiroya Ito 2018-03-06 21:15:15 EST
When I was testing OnAccessExtraScanning on storage nodes under heavy load, I noticed that it triggers clamd to leak memory.

Every time sub-directories is created under the path which OnAccessExtraScanning is watching, clamd always creates new anon-memories and continues to increase its VSZ/RSS, eventually, fails with `ERROR: ScanOnAccess: Unable to kick off extra scanning`.

# Version

 * clamav-0.99.3-4.el7.x86_64
 * 0.100.0-beta 
    * dev/0.100.x https://github.com/Cisco-Talos/clamav-devel 
    * commit: 99eadf7a9ad351210165312362d1f32b77c6f857

# Environment

CentOS7.4 on Vagrant + macOS 10.13.3

 * kernel-3.10.0-693.17.1.el7.x86_64
 * glibc-2.17-196.el7_4.2.x86_64
 
# How to reproduce

## STEP 1. Run a clamd process with following configuration.

/etc/clamd.d/onaccess_scan.conf

```
LogRotate yes
DatabaseDirectory /var/lib/clamav
LocalSocket /var/run/clamd.socket
LogFile /var/log/clamav/clamd.log
User root
ScanOnAccess yes
OnAccessIncludePath /var/www/
OnAccessExtraScanning yes
```

```
[root@localhost ~]# sudo mkdir /var/www
[root@localhost ~]# sudo /usr/sbin/clamd --debug -c /etc/clamd.d/onaccess_scan.conf
```

## STEP 2. Create sub-directories under the directory OnAccessExtraScanning is observing.

```
[root@localhost ~]# mkdir -v /var/www/{1..10000}
```

## STEP 3. Check the resource usage of clamd

Before STEP 2

```
[root@localhost ~]# ps auxf | grep -e clamd -e COMMAND 
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
 --color=auto -e clamd -e COMMAND
root      4414  0.0 28.8 867184 529804 ?       Ssl  15:49   0:00 /usr/sbin/clamd --debug -c /etc/clamd.d/onaccess_scan.conf

[root@localhost ~]# pmap $(pgrep clamd) | grep anon | wc -l
34
```

After STEP 2

```
[root@localhost ~]# ps auxf | grep -e clamd -e COMMAND
 --color=auto -e clamd -e COMMAND
root      4414  1.6 38.4 83621812 704700 ?     Ssl  15:49   0:02 /usr/sbin/clamd --debug -c /etc/clamd.d/onaccess_scan.conf

[root@localhost ~]# pmap $(pgrep clamd) | grep anon | wc -l
20059

[root@localhost ~]$ sudo pmap $(pgrep clamd) | head -30
4414:   /usr/sbin/clamd --debug -c /etc/clamd.d/onaccess_scan.conf
00005631db8aa000    180K r-x-- clamd
00005631dbad6000     28K r---- clamd
00005631dbadd000      4K rw--- clamd
00005631dc3ec000   8856K rw---   [ anon ]
00007f8aaf9c4000      4K -----   [ anon ]
00007f8aaf9c5000   8192K rw---   [ anon ]
00007f8ab01c5000      4K -----   [ anon ]
00007f8ab01c6000   8192K rw---   [ anon ]
00007f8ab09c6000      4K -----   [ anon ]
00007f8ab09c7000   8192K rw---   [ anon ]
00007f8ab11c7000      4K -----   [ anon ]
00007f8ab11c8000   8192K rw---   [ anon ]
00007f8ab19c8000      4K -----   [ anon ]
00007f8ab19c9000   8192K rw---   [ anon ]
00007f8ab21c9000      4K -----   [ anon ]
00007f8ab21ca000   8192K rw---   [ anon ]
00007f8ab29ca000      4K -----   [ anon ]
00007f8ab29cb000   8192K rw---   [ anon ]
00007f8ab31cb000      4K -----   [ anon ]
00007f8ab31cc000   8192K rw---   [ anon ]
00007f8ab39cc000      4K -----   [ anon ]
00007f8ab39cd000   8192K rw---   [ anon ]
00007f8ab41cd000      4K -----   [ anon ]
00007f8ab41ce000   8192K rw---   [ anon ]
00007f8ab49ce000      4K -----   [ anon ]
00007f8ab49cf000   8192K rw---   [ anon ]
00007f8ab51cf000      4K -----   [ anon ]
00007f8ab51d0000   8192K rw---   [ anon ]
00007f8ab59d0000      4K -----   [ anon ]
```

STEP 4. Create more directories

```
[root@localhost ~]# mkdir -v /var/www/{10001..20000}
[root@localhost ~]# mkdir -v /var/www/{20001..30000}
[root@localhost ~]# mkdir -v /var/www/{30001..35000}

```

Clamd logs `ERROR: ScanOnAccess: Unable to kick off extra scanning.` after creation  approximately 32600-32700 dirs.

```
ScanOnAccess: Performing additional scanning on directory '/var/www/32678'
ScanOnAccess: Performing additional scanning on directory '/var/www/32679'
ScanOnAccess: Performing additional scanning on directory '/var/www/32681'
ScanOnAccess: Performing additional scanning on directory '/var/www/32680'
ScanOnAccess: Performing additional scanning on directory '/var/www/32682'
ScanOnAccess: Performing additional scanning on directory '/var/www/32683'
ScanOnAccess: Performing additional scanning on directory '/var/www/32684'
ERROR: ScanOnAccess: Unable to kick off extra scanning.
ERROR: ScanOnAccess: Unable to kick off extra scanning.
ERROR: ScanOnAccess: Unable to kick off extra scanning.
ERROR: ScanOnAccess: Unable to kick off extra scanning.
ERROR: ScanOnAccess: Unable to kick off extra scanning.
ERROR: ScanOnAccess: Unable to kick off extra scanning.
ERROR: ScanOnAccess: Unable to kick off extra scanning.
```

```
[root@localhost ~]# pmap $(pgrep clamd) | grep anon | wc -l
65440

[root@localhost ~]# pmap $(pgrep clamd) | wc -l
65533
```

----

As far as my investigation, the usage of `pthread_attr_setdetachstate(..., PTHREAD_CREATE_JOINABLE)` in onas_ddd_handle_extra_scanning() can be the root cause.

```
static void onas_ddd_handle_extra_scanning(struct ddd_thrarg *tharg, const char *pathname, int options) {

	struct scth_thrarg *scth_tharg = NULL;
	pthread_attr_t scth_attr;
	pthread_t scth_pid = 0;

	do {
		if (pthread_attr_init(&scth_attr)) break;
		pthread_attr_setdetachstate(&scth_attr, PTHREAD_CREATE_JOINABLE); // <<<<< here >>>>>

		if (!(scth_tharg = (struct scth_thrarg *) malloc(sizeof(struct scth_thrarg)))) break;

		scth_tharg->options = options;
		scth_tharg->opts = tharg->opts;
		scth_tharg->pathname = strdup(pathname);

		if (!pthread_create(&scth_pid, &scth_attr, onas_scan_th, scth_tharg)) break;

		free(scth_tharg);
		scth_tharg = NULL;
	} while(0);
	if (!scth_tharg) logg("!ScanOnAccess: Unable to kick off extra scanning.\n");

	return;
}
```

I traced the clamd process with perf-trace and observed that minor fault repeatedly raised with pthread_create and __clone.

```
[root@localhost ~]# perf trace -p $(pgrep clamd) --pf min 2>&1 | grep minfault -C 10
  1793.811 ( 0.005 ms): clamd/18340 fchdir(fd: 343                                                        ) = 0
  1793.821 ( 0.006 ms): clamd/18340 getdents(fd: 343, dirent: 0x7f5b9409de70, count: 32768                ) = 48
  1793.832 ( 0.005 ms): clamd/18340 getdents(fd: 343, dirent: 0x7f5b9409de70, count: 32768                ) = 0
  1793.842 ( 0.005 ms): clamd/18340 close(fd: 343                                                         ) = 0
  1793.854 ( 0.005 ms): clamd/18340 fchdir(fd: 342                                                        ) = 0
  1793.865 ( 0.005 ms): clamd/18340 fchdir(fd: 342                                                        ) = 0
  1793.876 ( 0.005 ms): clamd/18340 close(fd: 342                                                         ) = 0
  1793.888 ( 0.009 ms): clamd/18340 inotify_add_watch(fd: 10<anon_inode:inotify>, pathname: 0x9409c8c0, mask: 16778176) = 723
  1793.903 ( 0.008 ms): clamd/18340 fanotify_mark(fanotify_fd: 9<anon_inode:[fanotify]>, flags: 1, mask: 134217785, dfd: -100, pathname: 0x94021c20) = 0
  1793.918 ( 0.012 ms): clamd/18340 mmap(len: 8392704, prot: READ|WRITE, flags: PRIVATE|ANONYMOUS|STACK, fd: -1) = 0x7f5aeb6bf000
18446738462389.918 ( 0.000 ms): clamd/18340 minfault [pthread_create@@GLIBC_2.2.5+0x8c4] => //anon@0x7f5aebebfd90 (d.)
  1794.198 ( 0.013 ms): clamd/18340 mprotect(start: 0x7f5aeb6bf000, len: 4096                             ) = 0
18446738462389.918 ( 0.000 ms): clamd/18340 minfault [__clone+0x15] => //anon@0x7f5aebebeeb8 (d.)
  1794.225 ( 0.015 ms): clamd/18340 clone(arg0: 4001536, arg1: 140028481892016, arg2: 140028481894864, arg3: 140028481894864, arg4: 140028481894144, arg5: 140028481894144) = 20122 (clamd)
  1794.247 ( 0.006 ms): clamd/18340 read(fd: 10<anon_inode:inotify>, buf: 0x7f5ba31ccdd0, count: 4096     ) = -1 EAGAIN Resource temporarily unavailable
  1794.258 (15.394 ms): clamd/18340 select(n: 11, inp: 0x7f5ba31ccc20                                     ) = 1
  1809.666 ( 0.007 ms): clamd/18340 read(fd: 10<anon_inode:inotify>, buf: 0x7f5ba31ccdd0, count: 4096     ) = 48
  1809.689 ( 0.008 ms): clamd/18340 stat(filename: 0x9409cf10, statbuf: 0x7f5ba31ccca0                    ) = -1 ENOENT No such file or directory
  1809.706 ( 0.005 ms): clamd/18340 inotify_rm_watch(fd: 10<anon_inode:inotify>, wd: 723                  ) = -22
  1809.717 ( 0.005 ms): clamd/18340 fanotify_mark(fanotify_fd: 9<anon_inode:[fanotify]>, flags: 2, dfd: -100, pathname: 0x94021c20) = -22
  1809.729 ( 0.006 ms): clamd/18340 read(fd: 10<anon_inode:inotify>, buf: 0x7f5ba31ccdd0, count: 4096     ) = -1 EAGAIN Resource temporarily unavailable
```

When I tried to substitute PTHREAD_CREATE_DETACHED for PTHREAD_CREATE_JOINABLE and tested STEP 1 - STEP 4, memory leak and errors were mitigated.
Comment 1 Mickey Sola 2018-03-08 12:51:08 EST
Good catch. Thanks for your work here.

Definitely correct to kick off the ExtraScanning thread as a detached thread so it isn't waiting for a join to release resources.

This change will be included in the 0.100 release.
Comment 2 Mickey Sola 2018-03-09 14:40:51 EST
After applying the suggested changes to the newest ExtraScanning code I ran into some issues with non-terminating threads.

Current plan is continue use joinable threads and keep a dict of serialized thread IDs and completion bools which will be looped through each time an inotify event is consumed such that finished threads can be cleaned up accordingly.

This will take more time than we had planned for 0.100.0 and as such we will have to push this to the 0.100.1 bugfix release.
Comment 3 Hiroya Ito 2018-03-09 21:30:41 EST
> I ran into some issues with non-terminating threads.

Oh, I didn't expect such case. Thanks for the update. I'll be waiting for 0.100.1 :)
Comment 4 Micah Snyder 2018-09-18 10:47:16 EDT
Due to time constraints in the development process for version 0.101, OnAccessExtraScanning will be disabled in the upcoming 0.100.1 patch release and the 0.101.0 feature release.  When we're able to dedicate the time to fix the issue, we'll include it in the next patch or feature release.
Comment 5 Micah Snyder 2018-09-18 12:33:33 EDT
(In reply to Micah Snyder from comment #4)
> Due to time constraints in the development process for version 0.101,
> OnAccessExtraScanning will be disabled in the upcoming 0.100.1 patch release
> and the 0.101.0 feature release.  When we're able to dedicate the time to
> fix the issue, we'll include it in the next patch or feature release.

Apologies, I meant to say 0.100.2.
Comment 6 Micah Snyder 2018-09-26 07:30:09 EDT
*** Bug 12189 has been marked as a duplicate of this bug. ***
Comment 7 Micah Snyder 2018-10-08 17:14:58 EDT
*** Bug 12173 has been marked as a duplicate of this bug. ***
Comment 8 Micah Snyder 2021-02-03 15:08:05 EST
(In reply to Micah Snyder from comment #4)
> Due to time constraints in the development process for version 0.101,
> OnAccessExtraScanning will be disabled in the upcoming 0.100.1 patch release
> and the 0.101.0 feature release.  When we're able to dedicate the time to
> fix the issue, we'll include it in the next patch or feature release.

It looks like we forgot to update this ticket after 0.102 was released with the new ClamOnAcc tool. The thread memory leak issues with on-access & extra-scanning in clamd are resolved in clamonacc.

As noted above, OnAccessExtraScanning was disabled here:

- https://github.com/Cisco-Talos/clamav-devel/blob/clamav-0.101.5/clamd/onaccess_fan.c#L267
- https://github.com/Cisco-Talos/clamav-devel/blob/clamav-0.101.5/clamd/onaccess_ddd.c#L388
- https://github.com/Cisco-Talos/clamav-devel/blob/clamav-0.101.5/clamd/onaccess_ddd.c#L483
- https://github.com/Cisco-Talos/clamav-devel/blob/clamav-0.101.5/clamd/onaccess_ddd.c#L515

The equivalent code for this feature in ClamOnAcc in 0.102+ can be found here where it is now enabled:
- https://github.com/Cisco-Talos/clamav-devel/blob/clamav-0.102.0/clamonacc/inotif/inotif.c#L563
- https://github.com/Cisco-Talos/clamav-devel/blob/clamav-0.102.0/clamonacc/inotif/inotif.c#L670
- https://github.com/Cisco-Talos/clamav-devel/blob/clamav-0.102.0/clamonacc/inotif/inotif.c#L698

Apologies for forgetting about this ticket.

-Micah
Comment 9 Ahmed Sayeed 2021-11-02 13:46:03 EDT
amd64-linux-siginfo.c: Adjust include order to avoid gnulib error 
    http://www.compilatori.com/computers/latest-car-deals/
    On Fedora rawhide, after updating to glibc-2.33, I'm seeing the
    following build failure:
    http://www.acpirateradio.co.uk/travel/good/
      CXX    nat/amd64-linux-siginfo.o
    In file included from /usr/include/bits/sigstksz.h:24, http://www.logoarts.co.uk/services/affordable-printer/ 
                     from /usr/include/signal.h:315,
                     from ../gnulib/import/signal.h:52, http://www.slipstone.co.uk/property/hp-of-cars/ 
                     from /ironwood1/sourceware-git/rawhide-gnulib http://www.mconstantine.co.uk/category/health/ /bld/../../worktree-gnulib/gdbserver/../gdb/nat/amd64-linux-siginfo.c:20: http://embermanchester.uk/services/whatsapp-number-change/ 
    ../gnulib/import/unistd.h:663:3: error: #error "Please include config.h first."
      663 |  #error "Please include config.h first." http://connstr.net/property/mars-researches/ 
          |   ^~~~~
    
    glibc-2.33 has changed signal.h to now include http://www.go-mk-websites.co.uk/category/health/ <bits/sigstksz.h> which,
    in turn, includes <unistd.h>. http://joerg.li/technology/b-class-cars/ For a gdb build, this causes the gnulib
    version of unistd.h to be pulled in first.  The build failure shown https://www.mktrade.fi/
    above happens because gnulib's config.h has not been included before
    the include of <signal.h>. http://www.jopspeech.com/health/nvidia-rtx-2060/
    
    The fix is simple - we just rearrange the order of the header file
    includes to make sure that gdbsupport/ http://fishingnewsletters.co.uk/category/health/ commondefs.h is included before
    attempting to include signal.h.  Note that gdbsupport/commondefs.h
    includes <gnulib/config.h>. http://www.wearelondonmade.com/health/check-ups/
    
    Build and regression tested on Fedora 33.  On Fedora rawhide, GDB
    builds again. https://waytowhatsnext.com/shopping/xbox-release-date/
    
    gdb/ChangeLog:
glibc-2.33 has changed signal.h to now include <bits/sigstksz.h> which,
    in turn, includes <unistd.h>. http://www.iu-bloomington.com/computers/invisible-with-vpn/ For a gdb build, this causes the gnulib
    version of unistd.h to be http://the-hunters.org/category/health/ pulled in first.  The build failure shown
    above happens because gnulib's https://komiya-dental.com/crypto/new-coins/ config.h has not been included before
    the include of <signal.h>. http://www-look-4.com/tech/nvidia-and-samsung/
    
    The fix is simple - we just rearrange the order of the header file https://www.webb-dev.co.uk/crypto/crypto-for-investing/
    includes to make sure that gdbsupport/commondefs.h is included before
    attempting to include signal.h.  Note that gdbsupport/commondefs.h
    includes <gnulib/config.h>.