Bug 1020077

Summary: fstrim.service should not run fstrim -a
Product: [openSUSE] openSUSE Tumbleweed Reporter: Andreas Schneider <asn>
Component: BasesystemAssignee: Stanislav Brabec <sbrabec>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P5 - None CC: aschnell, sbrabec, sweet_f_a
Version: Current   
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---

Description Andreas Schneider 2017-01-16 07:41:12 UTC
The fstrim.service of the util-linux-systemd package should not run 'fstrim -a'.

If you install openSUSE Tumbleweed is chooses btrfs for the root filesystem and creates subvolumes.

For example:

/dev/sda3 on /tmp type btrfs (rw,relatime,ssd,space_cache,subvolid=264,subvol=/@/tmp)
/dev/sda3 on /var/lib/mysql type btrfs (rw,relatime,ssd,space_cache,subvolid=272,subvol=/@/var/lib/mysql)
/dev/sda3 on /var/lib/pgsql type btrfs (rw,relatime,ssd,space_cache,subvolid=274,subvol=/@/var/lib/pgsql)
/dev/sda3 on /var/lib/machines type btrfs (rw,relatime,ssd,space_cache,subvolid=269,subvol=/@/var/lib/machines)
/dev/sda3 on /opt type btrfs (rw,relatime,ssd,space_cache,subvolid=262,subvol=/@/opt)
/dev/sda3 on /usr/local type btrfs (rw,relatime,ssd,space_cache,subvolid=265,subvol=/@/usr/local)
/dev/sda3 on /var/lib/named type btrfs (rw,relatime,ssd,space_cache,subvolid=273,subvol=/@/var/lib/named)
/dev/sda3 on /var/log type btrfs (rw,relatime,ssd,space_cache,subvolid=275,subvol=/@/var/log)
/dev/sda3 on /var/tmp type btrfs (rw,relatime,ssd,space_cache,subvolid=278,subvol=/@/var/tmp)
/dev/sda3 on /var/cache type btrfs (rw,relatime,ssd,space_cache,subvolid=266,subvol=/@/var/cache)
/dev/sda3 on /srv type btrfs (rw,relatime,ssd,space_cache,subvolid=263,subvol=/@/srv)
/dev/sda3 on /var/lib/mailman type btrfs (rw,relatime,ssd,space_cache,subvolid=270,subvol=/@/var/lib/mailman)
/dev/sda3 on /.snapshots type btrfs (rw,relatime,ssd,space_cache,subvolid=258,subvol=/@/.snapshots)
/dev/sda3 on /var/crash type btrfs (rw,relatime,ssd,space_cache,subvolid=267,subvol=/@/var/crash)
/dev/sda3 on /var/lib/mariadb type btrfs (rw,relatime,ssd,space_cache,subvolid=271,subvol=/@/var/lib/mariadb)
/dev/sda3 on /boot/grub2/x86_64-efi type btrfs (rw,relatime,ssd,space_cache,subvolid=261,subvol=/@/boot/grub2/x86_64-efi)
/dev/sda3 on /var/spool type btrfs (rw,relatime,ssd,space_cache,subvolid=277,subvol=/@/var/spool)
/dev/sda3 on /var/lib/libvirt/images type btrfs (rw,relatime,ssd,space_cache,subvolid=268,subvol=/@/var/lib/libvirt/images)

From the fstrim manpage:

    "fstrim is used on a mounted filesystem"

So 'fstrim -a' trims each subvolume of /dev/sda3. This means it trims /dev/sda3 in the exmaple more than 15 times!

Please implement a more intelligent solution for trimming. A script which works on the output of 'lsblk -o MOUNTPOINT,FSTYPE' output would be more intelligent.
Comment 1 Stanislav Brabec 2017-01-17 16:46:09 UTC
fstrim -a has a de-duplication heuristics.

https://git.kernel.org/cgit/utils/util-linux/util-linux.git/commit/?id=8dd51c183396c4bf6344d08b01fb84658e0a8eb4

https://git.kernel.org/cgit/utils/util-linux/util-linux.git/commit/?id=e05a3400e79e8cba6aca8725fd53d2b9753f0cf3

But adding a printf just before the the FITRIM ioctl() really shows that it calls FITRIM for each sub-volume.

We should not invent strange workarounds, but fix fstrim -a instead.
Comment 2 Stanislav Brabec 2017-01-17 17:20:51 UTC
I just consulted it with a btrfs developer. The patch is not significant for the function of TRIM. Call of ioctl(,FITRIM,) on any path and any sub-volume has the same effect: TRIM on the whole volume. This fact simplifies the possible fix.
Comment 3 Stanislav Brabec 2017-01-17 19:56:29 UTC
Debugging the code, I found:

For duplicates to be removed, uniq_fs_source_cmp() sets eq = 1 (equal = duplicate), but then want inside mnt_table_uniq_fs() remains 1 (wanted = unique).

But the code at least partially works, because it identified match and de-duplicated:

gvfsd-fuse on /run/user/10027/gvfs type fuse.gvfsd-fuse (rw,nosuid,nodev,relatime,user_id=10027,group_id=100)
gvfsd-fuse on /var/run/user/10027/gvfs type fuse.gvfsd-fuse (rw,nosuid,nodev,relatime,user_id=10027,group_id=100)

Now I have to find a reason why it happens.
Comment 4 Andreas Schneider 2017-01-17 21:44:51 UTC
Well, fixing fstrim is even better, thanks for looking into this.
Comment 5 Ruediger Meier 2017-01-17 22:22:13 UTC
Currently I have no time to look into this. But I remember a similar issue with multiple bind mounts where "fstrim -a" should de-duplicate mount points. This was discussed on ul mailing list one or two years ago.
 
Maybe forward this bug to util-linux.
Comment 6 Stanislav Brabec 2017-01-18 14:23:35 UTC
Ruediger Meier: I am working on it.

De-duplication of bind mounts is explicitly mentioned in the first change in the comment 1. And it even mentions btrfs, but it does not work for it for some reason.
Comment 7 Stanislav Brabec 2017-01-25 16:37:40 UTC
I found the root of the problem:

sys-utils/fstrim.c: uniq_fs_source_cmp() contains this code:

const char *aroot = mnt_fs_get_root(a),
	   *broot = mnt_fs_get_root(b);
if (!aroot || !broot)
	eq = 0;
else if (strcmp(aroot, broot) != 0)
	eq = 0;

In case of btrfs, it may look?
mnt_fs_streq_srcpath() founds a match.
mnt_fs_get_root() returns:
aroot = /usr/local
broot = /var/crash

These strings don't match => return "not match"

Whole this function was added to de-duplicate bind mounts. There it makes sense:

for gvfsd-fuse and gvfsd-fuse
aroot = /
broot = /
=> match

We have similar problems in mount/libmount in past. There we attempted to detect whether a particular btrfs sub-volume is already mounted. Here we will need a similar code.
Comment 8 Stanislav Brabec 2017-01-25 17:36:43 UTC
I am unsure whether root comparison makes sense at all in fstrim -a.

Patch assuming that it does not make sense was sent to the upstream for review:

[PATCH 0/1] fstrim: de-duplication of btrfs subvolumes
http://marc.info/?l=util-linux-ng&m=148536542717051&w=2

[PATCH 1/1] fstrim: de-duplicate btrfs sub-volumes
http://marc.info/?l=util-linux-ng&m=148536543717059&w=2
Comment 9 Stanislav Brabec 2017-01-26 17:21:00 UTC
The code author Karel Zak just confirmed my suspicion that the root comparison makes no sense.

Once the patch will appear in the upstream, I'll backport it.

http://marc.info/?l=util-linux-ng&m=148543544505102&w=2
Comment 10 Andreas Schneider 2017-01-26 17:22:51 UTC
Thanks for the update. Looking forward to a fix. fstrim runs again on Monday ;-)
Comment 12 Swamp Workflow Management 2017-02-23 11:10:08 UTC
SUSE-SU-2017:0553-1: An update that solves two vulnerabilities and has 11 fixes is now available.

Category: security (important)
Bug References: 1008965,1012504,1012632,1019332,1020077,1023041,947494,966891,978993,982331,983164,987176,988361
CVE References: CVE-2016-5011,CVE-2017-2616
Sources used:
SUSE Linux Enterprise Server for SAP 12 (src):    python-libmount-2.25-24.10.3, util-linux-2.25-24.10.1, util-linux-systemd-2.25-24.10.1
SUSE Linux Enterprise Server 12-LTSS (src):    python-libmount-2.25-24.10.3, util-linux-2.25-24.10.1, util-linux-systemd-2.25-24.10.1
Comment 13 Swamp Workflow Management 2017-02-23 11:13:05 UTC
SUSE-SU-2017:0554-1: An update that solves one vulnerability and has 6 fixes is now available.

Category: security (important)
Bug References: 1008965,1012504,1012632,1019332,1020077,1020985,1023041
CVE References: CVE-2017-2616
Sources used:
SUSE Linux Enterprise Workstation Extension 12-SP2 (src):    util-linux-2.28-44.3.1
SUSE Linux Enterprise Software Development Kit 12-SP2 (src):    util-linux-2.28-44.3.1
SUSE Linux Enterprise Server for Raspberry Pi 12-SP2 (src):    python-libmount-2.28-44.3.3, util-linux-2.28-44.3.1, util-linux-systemd-2.28-44.3.3
SUSE Linux Enterprise Server 12-SP2 (src):    python-libmount-2.28-44.3.3, util-linux-2.28-44.3.1, util-linux-systemd-2.28-44.3.3
SUSE Linux Enterprise Desktop 12-SP2 (src):    python-libmount-2.28-44.3.3, util-linux-2.28-44.3.1, util-linux-systemd-2.28-44.3.3
Comment 14 Swamp Workflow Management 2017-02-23 11:14:32 UTC
SUSE-SU-2017:0555-1: An update that solves one vulnerability and has 5 fixes is now available.

Category: security (important)
Bug References: 1008965,1012504,1012632,1019332,1020077,1023041
CVE References: CVE-2017-2616
Sources used:
SUSE Linux Enterprise Workstation Extension 12-SP1 (src):    util-linux-2.25-40.1
SUSE Linux Enterprise Software Development Kit 12-SP1 (src):    util-linux-2.25-40.1
SUSE Linux Enterprise Server 12-SP1 (src):    python-libmount-2.25-40.2, util-linux-2.25-40.1, util-linux-systemd-2.25-40.1
SUSE Linux Enterprise Desktop 12-SP1 (src):    python-libmount-2.25-40.2, util-linux-2.25-40.1, util-linux-systemd-2.25-40.1
Comment 15 Swamp Workflow Management 2017-03-02 14:15:17 UTC
openSUSE-SU-2017:0589-1: An update that solves one vulnerability and has 6 fixes is now available.

Category: security (important)
Bug References: 1008965,1012504,1012632,1019332,1020077,1020985,1023041
CVE References: CVE-2017-2616
Sources used:
openSUSE Leap 42.2 (src):    python-libmount-2.28-10.2, util-linux-2.28-10.1, util-linux-systemd-2.28-10.1
Comment 16 Swamp Workflow Management 2017-03-02 14:16:44 UTC
openSUSE-SU-2017:0590-1: An update that solves one vulnerability and has 5 fixes is now available.

Category: security (important)
Bug References: 1008965,1012504,1012632,1019332,1020077,1023041
CVE References: CVE-2017-2616
Sources used:
openSUSE Leap 42.1 (src):    python-libmount-2.25-21.1, util-linux-2.25-21.1, util-linux-systemd-2.25-21.1
Comment 17 Stanislav Brabec 2017-03-02 17:56:49 UTC
Now it should be fixed in Leap 42.1, Leap 42.2, SLE12, SLE12 SP1, SLE12 SP2 and Tumbleweed.
Comment 18 Andreas Schneider 2017-03-06 07:40:50 UTC
I'm sorry, but this is not solved!

magrathea:~ # rpm -q util-linux
util-linux-2.29.1-1.1.x86_64

magrathea:~ # fstrim -av
/home/asn/storage: 368.2 GiB (395347783680 bytes) trimmed
/home: 413.7 GiB (444230541312 bytes) trimmed
/usr/local: 46.9 GiB (50348015616 bytes) trimmed
/var/crash: 46.9 GiB (50341838848 bytes) trimmed
/var/lib/mysql: 46.9 GiB (50343002112 bytes) trimmed
/srv: 46.9 GiB (50343165952 bytes) trimmed
/boot/grub2/i386-pc: 46.9 GiB (50343690240 bytes) trimmed
^C
magrathea:~ # mount
[...]
/dev/sda3 on /var/lib/named type btrfs (rw,relatime,ssd,space_cache,subvolid=273,subvol=/@/var/lib/named)
/dev/sda3 on /var/tmp type btrfs (rw,relatime,ssd,space_cache,subvolid=278,subvol=/@/var/tmp)
/dev/sda3 on /.snapshots type btrfs (rw,relatime,ssd,space_cache,subvolid=258,subvol=/@/.snapshots)
/dev/sda3 on /var/spool type btrfs (rw,relatime,ssd,space_cache,subvolid=277,subvol=/@/var/spool)
/dev/sda3 on /var/log type btrfs (rw,relatime,ssd,space_cache,subvolid=275,subvol=/@/var/log)
/dev/sda3 on /boot/grub2/i386-pc type btrfs (rw,relatime,ssd,space_cache,subvolid=260,subvol=/@/boot/grub2/i386-pc)
/dev/sda3 on /srv type btrfs (rw,relatime,ssd,space_cache,subvolid=263,subvol=/@/srv)
/dev/sda3 on /var/lib/mysql type btrfs (rw,relatime,ssd,space_cache,subvolid=272,subvol=/@/var/lib/mysql)
/dev/sda3 on /var/crash type btrfs (rw,relatime,ssd,space_cache,subvolid=267,subvol=/@/var/crash)
/dev/sda3 on /usr/local type btrfs (rw,relatime,ssd,space_cache,subvolid=265,subvol=/@/usr/local)
/dev/mapper/cr_home on /home type xfs (rw,relatime,attr2,inode64,noquota)
/dev/sdd1 on /home/asn/storage type btrfs (rw,relatime,ssd,space_cache,subvolid=257,subvol=/@)
tmpfs on /run/user/1000 type tmpfs (rw,nosuid,nodev,relatime,size=3288472k,mode=700,uid=1000,gid=100)
fusectl on /sys/fs/fuse/connections type fusectl (rw,relatime)
gvfsd-fuse on /run/user/1000/gvfs type fuse.gvfsd-fuse (rw,nosuid,nodev,relatime,user_id=1000,group_id=100)


It still calls trim on each subvolume ...
Comment 19 Stanislav Brabec 2017-03-10 17:33:31 UTC
Please wait for release of util-linux-2.29.2. util-linux-2.29.1 does not have the fix yet. The update is on its way:

https://build.opensuse.org/package/show/Base:System/util-linux
(Feel free to test the package from there.)
Comment 20 Stanislav Brabec 2017-06-12 14:27:13 UTC
I have a similar setup, and FITRIM is called only twice in Tumbleweed with util-linux-2.29.2 (once for /, once for /home), so I assume that it is fixed.