Bugzilla – Bug 1017461
btrfs balance renders system unresponsive and eventually even kills WiFi when quota is enabled
Last modified: 2019-08-15 13:15:53 UTC
I installed openSUSE Leap 42.2 with btrfs as root. Now, performing a btrfs balance or a snapper cleanup takes "ages" while there is no or little disk activity but btrfs or btrfs-transaction constantly hogs one CPU. There is plenty of unallocated space (28 out of 40 GiB). The system becomes very unresponsive and even loses its WiFi connection until next reboot. Thus, btrfsmaintenance will nearly kill my system every week! After disabling btrfs quota everything works fine! Thus, enabling the experimental btrfs quota feature for snapper was a really, really bad idea. IMHO this is critical bug, if it happens to other user.
FWIW, I'm facing the same issue on 42.2.
Me too :-) I don't lose the wifi connection, but the system is extremely unresponsive. Every now and then, it doesn't react at all for several seconds. The whole spook usually takes about 30 minutes. Then, everything works fine again.
Created attachment 708135 [details] kernel logs After 6 hours the job has finished.
Hi guys! Yes, this is a **very** serious problem. I have already reported that my system is unresponsive every time btrfs maintenance starts, and I am using Tumbleweed. I posted to the mailing list, but received just one answer: https://lists.opensuse.org/opensuse-factory/2016-09/msg00130.html Indeed, when I **disabled** quotas here, then the freeze stopped. Thanks for that workaround! Actually, a btrfs developer (Chris Murphy) has already warned that the quota feature is not stable in btrfs and must not be used by default on production systems: https://lists.opensuse.org/opensuse-factory/2016-09/msg00032.html However, some openSUSE developers contradicted Chris, specially Richard Brown: https://lists.opensuse.org/opensuse-factory/2016-09/msg00085.html Hence, nobody took the advice and quotas were enabled by default in Leap 42.2. Maybe now with this bug, which I can confirm that is happening in **all** my machines with quotas enabled (HP Workstation, Dell laptop, and a Macbook), this problem can be revisited. Furthermore, disabling quota fixes it also in all my machines.
By the way, is it possible to change the bug title to "btrfs balance renders system unresponsive and eventually even kills WiFi when quota is enabled" ?
(In reply to Ronan Chagas from comment #4) > Hi guys! > > Yes, this is a **very** serious problem. I have already reported that my > system is unresponsive every time btrfs maintenance starts, and I am using > Tumbleweed. I posted to the mailing list, but received just one answer: > > https://lists.opensuse.org/opensuse-factory/2016-09/msg00130.html > > Indeed, when I **disabled** quotas here, then the freeze stopped. Thanks for > that workaround! Actually, a btrfs developer (Chris Murphy) has already > warned that the quota feature is not stable in btrfs and must not be used by > default on production systems: > > https://lists.opensuse.org/opensuse-factory/2016-09/msg00032.html > > However, some openSUSE developers contradicted Chris, specially Richard > Brown: > > https://lists.opensuse.org/opensuse-factory/2016-09/msg00085.html > > Hence, nobody took the advice and quotas were enabled by default in Leap > 42.2. > > Maybe now with this bug, which I can confirm that is happening in **all** my > machines with quotas enabled (HP Workstation, Dell laptop, and a Macbook), > this problem can be revisited. Furthermore, disabling quota fixes it also in > all my machines. Hmmm, I fear quotas are enabled because of snapper(8). It seems to use them for some clean-up policies.
Guys, I also sent a message to opensuse-factory mailing list to spread the information about this bug: https://lists.opensuse.org/opensuse-factory/2017-01/msg00022.html I think this is very serious and we must revisit it as soon as possible.
I have been trying to recreate this issue (especially the trace in comment #3) but have not succeeded so far. Richard: Does btrfs check report your filesystem is healthy? Ronan: Are you getting these backtraces in the kernel log as well? btrfs balance is a relatively I/O intensive operation because it has to move around chunks. However, if the tree is balanced frequently, then each balance should not take as much time.
(In reply to Goldwyn Rodrigues from comment #9) > I have been trying to recreate this issue (especially the trace in comment > #3) but have not succeeded so far. Well, I don't expect this to be reproducible within a few minutes. It happened here in my build server after an uptime of more than two weeks. > Richard: Does btrfs check report your filesystem is healthy? The check upon boot reports it as healthy. Since it is my rootfs I cannot run the check directly. > Ronan: Are you getting these backtraces in the kernel log as well? > > btrfs balance is a relatively I/O intensive operation because it has to move > around chunks. However, if the tree is balanced frequently, then each > balance should not take as much time. The reporter here seem to observe the opposite. ;-)
Yes, as the reporter, I actually do: As I said, I just installed Leap 42.2 on an SSD and the same day, balancing my system goes havoc when processing at least one chunk. So, it is easily reproducible for me by increasing the filter. There is no trace in the logs besides my killed WiFi. Except one time there was a: "NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [btrfs:4090]" in "btrfs_qgroup_trace_extent_nolock+...".
How many subvolumes does the affected file system have?
It is just a default Leap 42.2 installation. jan@karl:~> mount | grep subvol | wc -l 20
Thanks. There's an issue with discovering backreferences to extents with large numbers of subvolumes that would look like this. It's not the case here.
(In reply to Jeff Mahoney from comment #12) > How many subvolumes does the affected file system have? As with Jan, 42.2 default installation. The only difference is that I'm using snapper on / and /home subvolumes.
Sorry, I should've been more clear: "Subvolumes" in this context includes all snapshots.
(In reply to Jeff Mahoney from comment #16) > Sorry, I should've been more clear: "Subvolumes" in this context includes > all snapshots. In my case: spankyham:~ # btrfs subvolume list -a / | wc -l 87 If you need more infos, just ask. :-)
At least 40, maybe 80 after the default installation due to installing "missing" packages.
Thanks. It'd need to be much higher for it to matter for that particular problem.
> It happened here in my build server after an uptime of more than two weeks. i seem to remember reading that build directories (along with VMs DBs) are one of the situations in which disabling COW/snapshots is advisabe?
(In reply to nicholas cunliffe from comment #20) > > It happened here in my build server after an uptime of more than two weeks. > > i seem to remember reading that build directories (along with VMs DBs) are > one of the situations in which disabling COW/snapshots is advisabe? Let's wait what the btrfs developers say, there a lot of hearsay available on this topic. I expect btrfs to work with any workload, sure disabling COW could bring more performance but it shouldn't be mandatory for every non-trivial load.
> It happened here in my build server after an uptime of more than two weeks. > FYI, to me, it happens about once a day (e.g. right now). Maybe snapper is cleaning up old snapshots (as mentioned in comment #6).
This is primarily caused with the patches for qgroup accounting (btrfs: qgroup: Fix qgroup data leaking by using subtree tracing) correction which calls btrfs_qgroup_trace_subtree() twice, one for the src tree and once for dest tree. This function is CPU intensive which causes the system to stall. We would have to investigate other ways to perform this correctly.
(In reply to Goldwyn Rodrigues from comment #23) > This is primarily caused with the patches for qgroup accounting (btrfs: > qgroup: Fix qgroup data leaking by using subtree tracing) correction which > calls btrfs_qgroup_trace_subtree() twice, one for the src tree and once for > dest tree. This function is CPU intensive which causes the system to stall. > We would have to investigate other ways to perform this correctly. What do you suggest as workaround until the root cause is fixed? Can I disable quotas? I'm not sure whether this will harm snapper.
(In reply to Richard Weinberger from comment #24) > > What do you suggest as workaround until the root cause is fixed? > Can I disable quotas? I'm not sure whether this will harm snapper. If you don't have a need for quotas, I'd suggest to disable quotas until we find a working solution to fix this. Thanks for understanding.
(In reply to Goldwyn Rodrigues from comment #25) > (In reply to Richard Weinberger from comment #24) > > > > What do you suggest as workaround until the root cause is fixed? > > Can I disable quotas? I'm not sure whether this will harm snapper. > > If you don't have a need for quotas, I'd suggest to disable quotas until we > find a working solution to fix this. Thanks for understanding. This was not my question. The question was whether it will harm snapper. Both snapper and quotas are enabled by default on 42.2, _I_ don't need quotas, but my fear is that some openSUSE component (i.e. snapper) will fail badly when I disable quotas.
(In reply to Richard Weinberger from comment #26) > > This was not my question. The question was whether it will harm snapper. > Both snapper and quotas are enabled by default on 42.2, _I_ don't need > quotas, > but my fear is that some openSUSE component (i.e. snapper) will fail > badly when I disable quotas. Hi Richard, I am using a Leap 42.2 without quotas for a very long time. It was a 42.1 that was updated. I have never seen any problems at all related with snapper. IIRC, the only feature you will miss in snapper will be the ability to auto clean snapshot. Please, someone correct me if I am wrong.(In reply to Goldwyn Rodrigues from comment #9) > Ronan: Are you getting these backtraces in the kernel log as well? Hi Goldwyn, sorry I was kind of offline last couple of days. Yes, I am seeing those backtraces in kernel log when quotas are enabled. After disabling it, they seem to be gone.
(In reply to Richard Weinberger from comment #26) > This was not my question. The question was whether it will harm snapper. > Both snapper and quotas are enabled by default on 42.2, _I_ don't need > quotas, > but my fear is that some openSUSE component (i.e. snapper) will fail > badly when I disable quotas. No, I don't think it will affect snapper or any other component.
> [...] > Can I disable quotas? I'm not sure whether this will harm snapper. It actually will if you used quotas before: # snapper cleanup number quota not working (preparing quota failed) # snapper get-config | grep QGROUP QGROUP | 1/0 This fixes it: # snapper set-config QGROUP= However, I do not know to renable it! Maybe you need the original value of QGROUP.
(In reply to Jan Ritzerfeld from comment #29) > > [...] > > Can I disable quotas? I'm not sure whether this will harm snapper. > > It actually will if you used quotas before: > # snapper cleanup number > quota not working (preparing quota failed) > > # snapper get-config | grep QGROUP > QGROUP | 1/0 > > This fixes it: > # snapper set-config QGROUP= > > However, I do not know to renable it! Maybe you need the original value of > QGROUP. Yeah, same here. I didn't enable quotas in snapper, this seems to be a default setting... Well done.</sarcasm>
(In reply to Jan Ritzerfeld from comment #29) > [...] > This fixes it: > # snapper set-config QGROUP= Well, no. It only worked here because snapper seems to cache some of its config, changes directly in the config file will take some time to apply. So, man snapper is correct and the LIMIT variables must not have ranges without quotas: # snapper set-config QGROUP= NUMBER_LIMIT=10 NUMBER_LIMIT_IMPORTANT=10
Subvolume quotas the the mechanism Btrfs uses to track extent ownership. Snapper uses it to make informed decisions about how much space will be freed if a given snapshot is removed.
*kind ping* :-) Is there any patch I can test so far?
Signing up to follow this issue, as today my system is totally freezing up and it may explain why every Monday I have issues getting the system up.
I'm seeing similar issue and they seem to have increased since I enabled snapper (including quota) on my /home some days ago (btrfs-transaction takes 100% cpu for a while and blocks any IO on /home). The issue seems to be correlated when resuming back from suspend, but I'm not sure. I noticed some potential interesting infos in my logs : BTRFS info (device sda3): qgroup scan completed (inconsistency flag cleared) and: kernel: ------------[ cut here ]------------ kernel: WARNING: CPU: 3 PID: 3211 at ../fs/btrfs/qgroup.c:2923 btrfs_qgroup_free_meta+0x87/0x90 [btrfs]() kernel: Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf kernel: ptp mei_me iTCO_wdt ansi_cprng iTCO_vendor_support regmap_i2c snd_timer parport_pc mei cfg80211 8250_fintek btrtl dell_laptop pcspkr pps_core btbcm aesni_intel sn kernel: i2c_algo_bit kernel: usbcore drm_kms_helper usb_common syscopyarea sysfillrect sdhci_pci sysimgblt fb_sys_fops drm i2c_hid video sdhci_acpi sdhci mmc_core button sg dm_multipath dm_mo kernel: CPU: 3 PID: 3211 Comm: snapperd Tainted: G W 4.4.36-8-default #1 kernel: Hardware name: Dell Inc. Latitude E7250/0TVD2T, BIOS A15 12/26/2016 kernel: 0000000000000000 ffffffff81327b17 0000000000000000 ffffffffa056a168 kernel: ffffffff8107e841 kernel: ffff8803fe76b800 0000000000008000 ffff8803f7089c0c kernel: 00000000000c0000 kernel: ffff8803f7089db8 kernel: ffffffffa0552fd7 ffff8803fe76b800 kernel: Call Trace: kernel: [<ffffffff81019ea9>] dump_trace+0x59/0x320 kernel: [<ffffffff8101a26a>] show_stack_log_lvl+0xfa/0x180 kernel: [<ffffffff8101b011>] show_stack+0x21/0x40 kernel: [<ffffffff81327b17>] dump_stack+0x5c/0x85 kernel: [<ffffffff8107e841>] warn_slowpath_common+0x81/0xb0 kernel: [<ffffffffa0552fd7>] btrfs_qgroup_free_meta+0x87/0x90 [btrfs] kernel: [<ffffffffa04d5270>] btrfs_delalloc_reserve_metadata+0x200/0x4a0 [btrfs] kernel: [<ffffffffa04fbb2a>] __btrfs_buffered_write+0x17a/0x5b0 [btrfs] kernel: [<ffffffffa04ff376>] btrfs_file_write_iter+0x176/0x540 [btrfs] kernel: [<ffffffff81204f39>] __vfs_write+0xa9/0x100 kernel: [<ffffffff8120562d>] vfs_write+0x9d/0x190 kernel: [<ffffffff812062f2>] SyS_write+0x42/0xa0 kernel: [<ffffffff8160a8f2>] entry_SYSCALL_64_fastpath+0x16/0x71 kernel: DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x16/0x71 kernel: kernel: Leftover inexact backtrace: kernel: ---[ end trace d4465d6cbfeeee27 ]--- kernel: ------------[ cut here ]------------ as well as: kernel: ------------[ cut here ]------------ kernel: WARNING: CPU: 3 PID: 425 at ../fs/btrfs/qgroup.c:2923 btrfs_qgroup_free_meta+0x87/0x90 [btrfs]() kernel: Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf kernel: ptp mei_me iTCO_wdt ansi_cprng iTCO_vendor_support regmap_i2c snd_timer parport_pc mei cfg80211 8250_fintek btrtl dell_laptop pcspkr pps_core btbcm aesni_intel sn kernel: i2c_algo_bit usbcore drm_kms_helper usb_common syscopyarea sysfillrect sdhci_pci sysimgblt fb_sys_fops drm i2c_hid video sdhci_acpi sdhci mmc_core button sg dm_mu kernel: CPU: 3 PID: 425 Comm: systemd-journal Tainted: G W 4.4.36-8-default #1 Feb 04 13:25:17 latitude.par.novell.com kernel: Hardware name: Dell Inc. Latitude E7250/0TVD2T, BIOS A15 12/26/2016 kernel: 0000000000000000 ffffffff81327b17 0000000000000000 ffffffffa056a168 kernel: ffffffff8107e841 ffff8803fe76b800 000000000002c000 ffff8803fe76b800 kernel: 000000000002c000 ffff8803fd4a81d0 ffffffffa0552fd7 ffffffffffffffe4 kernel: Call Trace: kernel: [<ffffffff81019ea9>] dump_trace+0x59/0x320 kernel: [<ffffffff8101a26a>] show_stack_log_lvl+0xfa/0x180 kernel: [<ffffffff8101b011>] show_stack+0x21/0x40 kernel: [<ffffffff81327b17>] dump_stack+0x5c/0x85 kernel: [<ffffffff8107e841>] warn_slowpath_common+0x81/0xb0 kernel: [<ffffffffa0552fd7>] btrfs_qgroup_free_meta+0x87/0x90 [btrfs] kernel: [<ffffffffa04e9997>] start_transaction+0x3c7/0x4e0 [btrfs] kernel: [<ffffffffa04f72c7>] btrfs_rename2+0x157/0x7b0 [btrfs] kernel: [<ffffffff81211783>] vfs_rename+0x4b3/0x810 kernel: [<ffffffff8121675e>] SyS_rename+0x35e/0x3c0 kernel: [<ffffffff8160a8f2>] entry_SYSCALL_64_fastpath+0x16/0x71 kernel: DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x16/0x71
Just for the record I never suspend my system. It is always a complete boot cycle.
I have same problem with freeze With and without QGROUP= Normaly Sundays (accident?) In top i see btrfs balance with 100% or btrfs transacti with 100% This pass to interrupt any input. When this happend when screensaver is on no login is possible. the problems will take as long as btrfs runs. circa 1h. I think this is not only high. This is a critical bug.
(In reply to Eric Schirra from comment #39) > I have same problem with freeze > With and without QGROUP= > Normaly Sundays (accident?) > > In top i see btrfs balance with 100% or btrfs transacti with 100% > This pass to interrupt any input. > > When this happend when screensaver is on no login is possible. > > the problems will take as long as btrfs runs. > circa 1h. Just to confirm, did you disable the quotas in BTRFS? You can check this by running the command: btrfs qgroup show / All my problems related to this bug have gone after I disabled quotas. As I pointed out in my comment #4, btrfs devs warned sometime ago that quota is an unstable feature and we should avoid using it. However, it seems that you will lost a YaST feature if you disable quotas (something related to auto-clean snapshots IIRC). > I think this is not only high. > This is a critical bug. I totally agree. If this bug is so hard to fix and depends on upstream, we should really start to think about disable quotas by default in Leap at least.
Okay, i have now disable quota with: btrfs quota disable / Now a manuell "btrfs blance start /" does no more stop my pc and inputs. Will now seen what happened when cron will run btrfsmaintenance. (Think it will be on Sunday.) Then i will post my experience.
Here i have found same problem in gentoo. With kernel 4.4.6 and 4.8.0 https://www.reddit.com/r/btrfs/comments/4qz1qd/problems_with_btrfs_quota/
So. It seems that after disable quota the problem is gonne. Not only that it freeze the pc for some time, in my problem, i have damaged my filesystem, because i have not wait for long, long time. In my opinion quota should be immediately disable! And this is a critical bug!.
I've had a similar problem since installing Tumbleweed in November. However, whenever I run `sudo btrfs quota disable /` my system becomes unresponsive and I force a reboot after 10 or 15 minutes. What does that command do exactly? Does it just need time to run?
(In reply to Christopher Brodt from comment #44) > I've had a similar problem since installing Tumbleweed in November. However, > whenever I run `sudo btrfs quota disable /` my system becomes unresponsive > and I force a reboot after 10 or 15 minutes. What does that command do > exactly? Does it just need time to run? Hi Christopher, This command was executed here in seconds. Are you sure that no other btrfs maintenance command is being executed when you are trying to disable quotas? Furthermore, how many snapshots do you have?
I've got 22 snapshots. I'm not aware of any other maintenance commands running, but I did notice this when viewing the qgroups: cbrodt@cbrodt-traitify2 ~: sudo btrfs qgroup show / WARNING: rescan is running, qgroup data may be incorrect That message is always there, so maybe that's blocking it?
(In reply to Christopher Brodt from comment #46) > I've got 22 snapshots. I'm not aware of any other maintenance commands > running, but I did notice this when viewing the qgroups: > > cbrodt@cbrodt-traitify2 ~: sudo btrfs qgroup show / > WARNING: rescan is running, qgroup data may be incorrect > > That message is always there, so maybe that's blocking it? Can you please post the output of `btrfs quota rescan -s /`?
Here you go rescan operation running (current key 11898896385)
(In reply to Christopher Brodt from comment #48) > Here you go > > rescan operation running (current key 11898896385) This explains what you are seeing, I think. You have a rescan operation running and it must be finished to disable quotas. I never saw this kind of problem (I have already disabled quotas in 6 machines). Maybe another user can tell you how to safety stop the rescan.
Interestingly, the rescan operation is in the exact same key. I've not suspended or rebooted in at least 12 hours. So it really seems like it's never going to finish?
(In reply to Christopher Brodt from comment #50) > Interestingly, the rescan operation is in the exact same key. I've not > suspended or rebooted in at least 12 hours. So it really seems like it's > never going to finish? Maybe, this is very strange. I have no idea what is going on. Can you try to reboot?
restart doesn't change anything; btrfs reports same rescan operation running
(In reply to Christopher Brodt from comment #52) > restart doesn't change anything; btrfs reports same rescan operation running How many snapshots do you have on this system? There is an algorithmic problem with qgroups that we're working to resolve (as the focus of this report) that means that as the number of references to an extent rises, the runtime for accounting them goes up exponentially.
to ask the obvious - have you done a scrub? i think the question of large number of snapshots has been asked before and came up negative. could the same problem be caused by heavy fragmentation? the level of data collection is a tragedy - no info on rollbacks, ssd vs hd, snapshots, <insert parameter here> .... appears to me a game of blind mans bluff.
It's a dell xps 13 9360 with an SSD. I have not run a scrub; what command should I use? I'm not really sure about your other concerns, what is your question concerning rollbacks?; I did one months ago when I had an issue with a TW snapshot, but that's been resolved. The number of snapshots on my system is the same as I posted previously.
I've that problem too. I can't disable quotas because there is a rescan operation running (and this operation is stuck in same key for days). In my case scrub said that there isn't any problem.
I've also experienced this bug on Tumbleweed. I formatted my system a few days ago and created GPT partitions to enable UEFI. After booting, I started seeing random slowdowns. I noticed the issue starting too happen frequently after I rolled back to an earlier snapshot. I'm not sure if this related or not but btrfs-transacti has been eating one of my cores for over 2-3 hours on and off.
openSUSE-SU-2017:0907-1: An update that solves 11 vulnerabilities and has 41 fixes is now available. Category: security (important) Bug References: 1007959,1007962,1008842,1011913,1012910,1013994,1015609,1017461,1017641,1018263,1018419,1019163,1019618,1020048,1022785,1023866,1024015,1025235,1025683,1026405,1026462,1026505,1026509,1026692,1026722,1027054,1027066,1027179,1027189,1027190,1027195,1027273,1027565,1027575,1028017,1028041,1028158,1028217,1028325,1028372,1028415,1028819,1028895,1029220,1029986,1030573,1030575,951844,968697,969755,982783,998106 CVE References: CVE-2016-10200,CVE-2016-2117,CVE-2016-9191,CVE-2017-2596,CVE-2017-2636,CVE-2017-6214,CVE-2017-6345,CVE-2017-6346,CVE-2017-6347,CVE-2017-6353,CVE-2017-7184 Sources used: openSUSE Leap 42.2 (src): kernel-debug-4.4.57-18.3.1, kernel-default-4.4.57-18.3.1, kernel-docs-4.4.57-18.3.2, kernel-obs-build-4.4.57-18.3.1, kernel-obs-qa-4.4.57-18.3.1, kernel-source-4.4.57-18.3.1, kernel-syms-4.4.57-18.3.1, kernel-vanilla-4.4.57-18.3.1
So, this issue should now be fixed by the following upstream commit? commit fb235dc06fac9eaa4408ade9c8b20d45d63c89b7 Author: Qu Wenruo <quwenruo@cn.fujitsu.com> Date: Wed Feb 15 10:43:03 2017 +0800 btrfs: qgroup: Move half of the qgroup accounting time out of commit trans Just as Filipe pointed out, the most time consuming parts of qgroup are btrfs_qgroup_account_extents() and btrfs_qgroup_prepare_account_extents(). Which both call btrfs_find_all_roots() to get old_roots and new_roots ulist. What makes things worse is, we're calling that expensive btrfs_find_all_roots() at transaction committing time with TRANS_STATE_COMMIT_DOING, which will blocks all incoming transaction. Such behavior is necessary for @new_roots search as current btrfs_find_all_roots() can't do it correctly so we do call it just before switch commit roots. However for @old_roots search, it's not necessary as such search is based on commit_root, so it will always be correct and we can move it out of transaction committing. This patch moves the @old_roots search part out of commit_transaction(), so in theory we can half the time qgroup time consumption at commit_transaction(). But please note that, this won't speedup qgroup overall, the total time consumption is still the same, just reduce the performance stall. Cc: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
(In reply to Richard Weinberger from comment #59) > So, this issue should now be fixed by the following upstream commit? > [...] 1. The time and CPU power needed for a simple "btrfs balance" is still ridiculously high, and 2. the performance stall is still there, even if less frequently. I updated the kernel and re-enabled quotas (not that easy). And even meta data balancing still 1. takes 15 minutes while completely hogging 1 CPU using a laptop on battery (recipe for a disaster), and 2. frequently delays starting shell commands, causes severe WiFi packet loss, and locks up the system for several seconds. Unfortunately, this issue is not fixed.
(In reply to Richard Weinberger from comment #59) > So, this issue should now be fixed by the following upstream commit? It's only part of the fix. The soft lockups are prevented by d8422ba334f (btrfs: backref: Fix soft lockup in __merge_refs function). (In reply to Jan Ritzerfeld from comment #60) > I updated the kernel and re-enabled quotas (not that easy). Issuing `snapper setup-quota' not easy? > And even meta > data balancing still > 1. takes 15 minutes while completely hogging 1 CPU using a laptop on battery > (recipe for a disaster), and > 2. frequently delays starting shell commands, causes severe WiFi packet > loss, and locks up the system for several seconds. Same here with Tumbleweed.
(In reply to Libor Pechacek from comment #61) > (In reply to Richard Weinberger from comment #59) > > So, this issue should now be fixed by the following upstream commit? > > It's only part of the fix. The soft lockups are prevented by d8422ba334f > (btrfs: backref: Fix soft lockup in __merge_refs function). Hmm, is this commit included in openSUSE-SU-2017:0907-1? > (In reply to Jan Ritzerfeld from comment #60) > > I updated the kernel and re-enabled quotas (not that easy). > > Issuing `snapper setup-quota' not easy? > [...] Sure, but that doesn't work because a "snapper cleanup number" then says "quota not working (preparing quota failed)". I had to manually assign the correct qgroup to the snapshot subvolumes already taken without an qgroup. snapper only did this automatically for the first snapshot without an qgroup. Took me an hour to figure that out...
(In reply to Jan Ritzerfeld from comment #62) > Hmm, is this commit included in openSUSE-SU-2017:0907-1? AFAICT yes: http://kernel.suse.com/cgit/kernel/log/?h=rpm-4.4.57-18.3&ofs=50 Also feel free to inspect the package change log (rpm -q -changelog kernel-default-4.4.57-18.3.1), which should contain a record named "btrfs: backref: Fix soft lockup in __merge_refs function" and a reference to this Bugzilla. > Sure, but that doesn't work because a "snapper cleanup number" then says > "quota not working (preparing quota failed)". I see. I didn't know about these dark corners. Is that perhaps something for a bug report?
(In reply to Libor Pechacek from comment #63) > [...] > AFAICT yes: http://kernel.suse.com/cgit/kernel/log/?h=rpm-4.4.57-18.3&ofs=50 Many thanks for your help! > Also feel free to inspect the package change log (rpm -q -changelog > kernel-default-4.4.57-18.3.1), which should contain a record named "btrfs: > backref: Fix soft lockup in __merge_refs function" and a reference to this > Bugzilla. That's what I thought I did. And yes, it is included. I didn't find it because the changelog is not ordered by date. First entry date is 2017-02-19 and last 2009-03-04. However, the record you mentioned is dated 2017-03-27 and found in line 36776?! > > Sure, but that doesn't work because a "snapper cleanup number" then says > > "quota not working (preparing quota failed)". > > I see. I didn't know about these dark corners. Me too! I already noticed that I was not able to re-enable them in Comment #29. :) > Is that perhaps something for a bug report? Maybe https://github.com/openSUSE/snapper/issues/257? Because of this issue, at least the exception message "preparing quota failed" was added in https://github.com/openSUSE/snapper/issues/259.
The problem is still present. My 8 Core-Server still renders total unusable for a very long time.
btrfs-cleaner hogs completely a single cpu in kernel space: spankyham:~ # top top - 20:11:09 up 1:23, 1 user, load average: 3,85, 4,44, 3,85 Tasks: 187 total, 2 running, 185 sleeping, 0 stopped, 0 zombie %Cpu0 : 0,0 us, 0,0 sy, 0,0 ni,100,0 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st %Cpu1 : 0,0 us, 0,0 sy, 0,0 ni,100,0 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st %Cpu2 : 0,0 us,100,0 sy, 0,0 ni, 0,0 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st %Cpu3 : 0,0 us, 0,0 sy, 0,0 ni,100,0 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st %Cpu4 : 0,0 us, 0,0 sy, 0,0 ni,100,0 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st %Cpu5 : 0,0 us, 0,0 sy, 0,0 ni,100,0 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st %Cpu6 : 0,0 us, 0,0 sy, 0,0 ni,100,0 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st %Cpu7 : 0,0 us, 0,0 sy, 0,0 ni,100,0 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st KiB Mem: 16403976 total, 5765432 used, 10638544 free, 3340 buffers KiB Swap: 0 total, 0 used, 0 free. 4155444 cached Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 409 root 20 0 0 0 0 R 99,67 0,000 38:52.67 btrfs-cleaner 1 root 20 0 37436 5632 4028 S 0,000 0,034 0:01.92 systemd 2 root 20 0 0 0 0 S 0,000 0,000 0:00.00 kthreadd 3 root 20 0 0 0 0 S 0,000 0,000 0:00.00 ksoftirqd/0 5 root 0 -20 0 0 0 S 0,000 0,000 0:00.00 kworker/0:0H 7 root 20 0 0 0 0 S 0,000 0,000 0:00.67 rcu_sched 8 root 20 0 0 0 0 S 0,000 0,000 0:00.00 rcu_bh 9 root rt 0 0 0 0 S 0,000 0,000 0:00.18 migration/0 spankyham:~ # cat /proc/409/stack [<ffffffffa02de2f0>] __btrfs_find_all_roots+0xc0/0x130 [btrfs] [<ffffffffa02de3d0>] btrfs_find_all_roots+0x50/0x70 [btrfs] [<ffffffffa02e1fa0>] btrfs_qgroup_trace_extent_post+0x20/0x40 [btrfs] [<ffffffffa02e2216>] btrfs_qgroup_trace_leaf_items+0x116/0x140 [btrfs] [<ffffffffa02e23fc>] btrfs_qgroup_trace_subtree+0x1bc/0x340 [btrfs] [<ffffffffa025ed03>] do_walk_down+0x363/0x540 [btrfs] [<ffffffffa025dc6d>] walk_down_proc+0x2ad/0x2e0 [btrfs] [<ffffffffa025ef99>] walk_down_tree+0xb9/0xf0 [btrfs] [<ffffffffa02615b4>] btrfs_drop_snapshot+0x384/0x800 [btrfs] [<ffffffffa02d372b>] btrfs_kill_all_delayed_nodes+0x4b/0x100 [btrfs] [<ffffffffa0278af5>] btrfs_clean_one_deleted_snapshot+0xb5/0x110 [btrfs] [<ffffffffa02708b8>] cleaner_kthread+0x1a8/0x230 [btrfs] [<ffffffffa0270710>] cleaner_kthread+0x0/0x230 [btrfs] [<ffffffff8109d3d8>] kthread+0xc8/0xe0 [<ffffffff8109d310>] kthread+0x0/0xe0 [<ffffffff8160b2cf>] ret_from_fork+0x3f/0x70 [<ffffffff8109d310>] kthread+0x0/0xe0 [<ffffffffffffffff>] 0xffffffffffffffff
The update made the situation *much* worse. Right know I'm facing the following situation, the system is idle but many threads are blocked. Maybe a locking bug? top - 20:23:11 up 1 day, 1:35, 1 user, load average: 70,91, 67,76, 62,45 Tasks: 291 total, 1 running, 289 sleeping, 0 stopped, 1 zombie %Cpu0 : 0,0 us, 0,3 sy, 0,0 ni, 99,7 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st %Cpu1 : 0,0 us, 0,0 sy, 0,0 ni,100,0 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st %Cpu2 : 0,0 us, 0,0 sy, 0,0 ni,100,0 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st %Cpu3 : 0,0 us, 0,0 sy, 0,0 ni,100,0 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st %Cpu4 : 0,0 us, 0,0 sy, 0,0 ni,100,0 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st %Cpu5 : 0,0 us, 0,0 sy, 0,0 ni,100,0 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st %Cpu6 : 0,0 us, 0,0 sy, 0,0 ni,100,0 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st %Cpu7 : 0,0 us, 0,0 sy, 0,0 ni,100,0 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st spankyham:~ # ps -axe -o state |grep D | wc -l 71 I attached the current kernel stack traces of all blocked threads, maybe this helps you.
Created attachment 719833 [details] stack traces of blocked threads
As mentioned, fb235dc06 is not expected to be a complete fix. However looking at the stacks you may be encountering a regression. Could you run with fb235dc06 reverted?
(In reply to Edmund Nadolski from comment #69) > As mentioned, fb235dc06 is not expected to be a complete fix. However > looking at the stacks you may be encountering a regression. Could you run > with fb235dc06 reverted? Sure. Will take 2-3 days.
(In reply to Richard Weinberger from comment #70) > (In reply to Edmund Nadolski from comment #69) > > As mentioned, fb235dc06 is not expected to be a complete fix. However > > looking at the stacks you may be encountering a regression. Could you run > > with fb235dc06 reverted? > > Sure. Will take 2-3 days. With that commit reverted I don't see the lockup anymore, although, as expected, btrfs-balance still consumes a lot of cpu. The system has an uptime of 36h and a typical work load.
Thanks for running this. Could you pls. open a new bug to track the lockup/regression, and I will go ahead and revert the change.
Filed 1033885 to track the regression/revert.
openSUSE-SU-2017:1140-1: An update that solves 10 vulnerabilities and has 49 fixes is now available. Category: security (important) Bug References: 1010032,1012452,1012829,1013887,1014136,1017461,1019614,1021424,1021762,1022340,1023287,1027153,1027512,1027616,1027974,1028027,1028217,1028415,1028883,1029514,1029634,1030070,1030118,1030213,1031003,1031052,1031147,1031200,1031206,1031208,1031440,1031512,1031555,1031579,1031662,1031717,1031831,1032006,1032141,1032345,1032400,1032581,1032673,1032681,1032803,1033117,1033281,1033336,1033340,1033885,1034048,1034419,1034671,1034902,970083,986362,986365,988065,993832 CVE References: CVE-2016-4997,CVE-2016-4998,CVE-2017-2671,CVE-2017-7187,CVE-2017-7261,CVE-2017-7294,CVE-2017-7308,CVE-2017-7374,CVE-2017-7616,CVE-2017-7618 Sources used: openSUSE Leap 42.2 (src): kernel-debug-4.4.62-18.6.1, kernel-default-4.4.62-18.6.1, kernel-docs-4.4.62-18.6.2, kernel-obs-build-4.4.62-18.6.1, kernel-obs-qa-4.4.62-18.6.1, kernel-source-4.4.62-18.6.1, kernel-syms-4.4.62-18.6.1, kernel-vanilla-4.4.62-18.6.1
SUSE-SU-2017:1183-1: An update that solves 16 vulnerabilities and has 69 fixes is now available. Category: security (important) Bug References: 1007959,1007962,1008842,1010032,1011913,1012382,1012910,1013994,1014136,1015609,1017461,1017641,1018263,1018419,1019163,1019614,1019618,1020048,1021762,1022340,1022785,1023866,1024015,1025683,1026024,1026405,1026462,1026505,1026509,1026692,1026722,1027054,1027066,1027153,1027179,1027189,1027190,1027195,1027273,1027616,1028017,1028027,1028041,1028158,1028217,1028325,1028415,1028819,1028895,1029220,1029514,1029634,1029986,1030118,1030213,1031003,1031052,1031200,1031206,1031208,1031440,1031481,1031579,1031660,1031662,1031717,1031831,1032006,1032673,1032681,897662,951844,968697,969755,970083,977572,977860,978056,980892,981634,982783,987899,988281,991173,998106 CVE References: CVE-2016-10200,CVE-2016-2117,CVE-2016-9191,CVE-2017-2596,CVE-2017-2671,CVE-2017-6074,CVE-2017-6214,CVE-2017-6345,CVE-2017-6346,CVE-2017-6347,CVE-2017-6353,CVE-2017-7187,CVE-2017-7261,CVE-2017-7294,CVE-2017-7308,CVE-2017-7374 Sources used: SUSE Linux Enterprise Workstation Extension 12-SP2 (src): kernel-default-4.4.59-92.17.3 SUSE Linux Enterprise Software Development Kit 12-SP2 (src): kernel-docs-4.4.59-92.17.8, kernel-obs-build-4.4.59-92.17.3 SUSE Linux Enterprise Server for Raspberry Pi 12-SP2 (src): kernel-default-4.4.59-92.17.3, kernel-source-4.4.59-92.17.2, kernel-syms-4.4.59-92.17.2 SUSE Linux Enterprise Server 12-SP2 (src): kernel-default-4.4.59-92.17.3, kernel-source-4.4.59-92.17.2, kernel-syms-4.4.59-92.17.2 SUSE Linux Enterprise Live Patching 12 (src): kgraft-patch-SLE12-SP2_Update_7-1-2.3 SUSE Linux Enterprise High Availability 12-SP2 (src): kernel-default-4.4.59-92.17.3 SUSE Linux Enterprise Desktop 12-SP2 (src): kernel-default-4.4.59-92.17.3, kernel-source-4.4.59-92.17.2, kernel-syms-4.4.59-92.17.2 OpenStack Cloud Magnum Orchestration 7 (src): kernel-default-4.4.59-92.17.3
Test affected: [osd#926447#step/dns_srv/14](https://openqa.suse.de/tests/926447#step/dns_srv/14)
Hi, I voted 5 points for this issue. Using thumbleweed as my main development laptop. In the last 2/3 weeks I have encountered many hickups where the desktop locks up for =-10 minutes. Currently running the latest version.
This is an autogenerated message for OBS integration: This bug (1017461) was mentioned in https://build.opensuse.org/request/show/504376 42.3 / kernel-source
Hi, I encountered the same issue (AMD Ryzen, M.2 PCIE SSD, Leap 42.2) with btrfs as root fs. The number of snapshots is: # btrfs subvolume list -a | wc -l 55 I cannot disable quotas because there is always a rescan operation running: # btrfs quota rescan -a / rescan operation running (current key 0) (The "key" does not change) "btrfs scrub" showed no errors. Any ideas how I at least can disable the btrfs quotas ?
SUSE-SU-2017:1853-1: An update that solves 15 vulnerabilities and has 162 fixes is now available. Category: security (important) Bug References: 1003581,1004003,1011044,1012060,1012382,1012422,1012452,1012829,1012910,1012985,1013561,1013887,1015342,1015452,1017461,1018885,1020412,1021424,1022266,1022595,1023287,1025461,1026570,1027101,1027512,1027974,1028217,1028310,1028340,1028883,1029607,1030057,1030070,1031040,1031142,1031147,1031470,1031500,1031512,1031555,1031717,1031796,1032141,1032339,1032345,1032400,1032581,1032803,1033117,1033281,1033336,1033340,1033885,1034048,1034419,1034635,1034670,1034671,1034762,1034902,1034995,1035024,1035866,1035887,1035920,1035922,1036214,1036638,1036752,1036763,1037177,1037186,1037384,1037483,1037669,1037840,1037871,1037969,1038033,1038043,1038085,1038142,1038143,1038297,1038458,1038544,1038842,1038843,1038846,1038847,1038848,1038879,1038981,1038982,1039214,1039348,1039354,1039700,1039864,1039882,1039883,1039885,1039900,1040069,1040125,1040182,1040279,1040351,1040364,1040395,1040425,1040463,1040567,1040609,1040855,1040929,1040941,1041087,1041160,1041168,1041242,1041431,1041810,1042286,1042356,1042421,1042517,1042535,1042536,1042863,1042886,1043014,1043231,1043236,1043347,1043371,1043467,1043488,1043598,1043912,1043935,1043990,1044015,1044082,1044120,1044125,1044532,1044767,1044772,1044854,1044880,1044912,1045154,1045235,1045286,1045307,1045467,1045568,1046105,1046434,1046589,799133,863764,922871,939801,966170,966172,966191,966321,966339,971975,988065,989311,990058,990682,993832,995542 CVE References: CVE-2017-1000365,CVE-2017-1000380,CVE-2017-7346,CVE-2017-7487,CVE-2017-7616,CVE-2017-7618,CVE-2017-8890,CVE-2017-8924,CVE-2017-8925,CVE-2017-9074,CVE-2017-9075,CVE-2017-9076,CVE-2017-9077,CVE-2017-9150,CVE-2017-9242 Sources used: SUSE Linux Enterprise Workstation Extension 12-SP2 (src): kernel-default-4.4.74-92.29.1 SUSE Linux Enterprise Software Development Kit 12-SP2 (src): kernel-docs-4.4.74-92.29.3, kernel-obs-build-4.4.74-92.29.1 SUSE Linux Enterprise Server for Raspberry Pi 12-SP2 (src): kernel-default-4.4.74-92.29.1, kernel-source-4.4.74-92.29.1, kernel-syms-4.4.74-92.29.1 SUSE Linux Enterprise Server 12-SP2 (src): kernel-default-4.4.74-92.29.1, kernel-source-4.4.74-92.29.1, kernel-syms-4.4.74-92.29.1 SUSE Linux Enterprise Live Patching 12 (src): kgraft-patch-SLE12-SP2_Update_10-1-4.1 SUSE Linux Enterprise High Availability 12-SP2 (src): kernel-default-4.4.74-92.29.1 SUSE Linux Enterprise Desktop 12-SP2 (src): kernel-default-4.4.74-92.29.1, kernel-source-4.4.74-92.29.1, kernel-syms-4.4.74-92.29.1 OpenStack Cloud Magnum Orchestration 7 (src): kernel-default-4.4.74-92.29.1
SUSE-SU-2017:1990-1: An update that solves 43 vulnerabilities and has 282 fixes is now available. Category: security (important) Bug References: 1000092,1003077,1003581,1004003,1007729,1007959,1007962,1008842,1009674,1009718,1010032,1010612,1010690,1011044,1011176,1011913,1012060,1012382,1012422,1012452,1012829,1012910,1012985,1013001,1013561,1013792,1013887,1013994,1014120,1014136,1015342,1015367,1015452,1015609,1016403,1017164,1017170,1017410,1017461,1017641,1018100,1018263,1018358,1018385,1018419,1018446,1018813,1018885,1018913,1019061,1019148,1019163,1019168,1019260,1019351,1019594,1019614,1019618,1019630,1019631,1019784,1019851,1020048,1020214,1020412,1020488,1020602,1020685,1020817,1020945,1020975,1021082,1021248,1021251,1021258,1021260,1021294,1021424,1021455,1021474,1021762,1022181,1022266,1022304,1022340,1022429,1022476,1022547,1022559,1022595,1022785,1022971,1023101,1023175,1023287,1023762,1023866,1023884,1023888,1024015,1024081,1024234,1024508,1024938,1025039,1025235,1025461,1025683,1026024,1026405,1026462,1026505,1026509,1026570,1026692,1026722,1027054,1027066,1027101,1027153,1027179,1027189,1027190,1027195,1027273,1027512,1027565,1027616,1027974,1028017,1028027,1028041,1028158,1028217,1028310,1028325,1028340,1028372,1028415,1028819,1028883,1028895,1029220,1029514,1029607,1029634,1029986,1030057,1030070,1030118,1030213,1030573,1031003,1031040,1031052,1031142,1031147,1031200,1031206,1031208,1031440,1031470,1031500,1031512,1031555,1031579,1031662,1031717,1031796,1031831,1032006,1032141,1032339,1032345,1032400,1032581,1032673,1032681,1032803,1033117,1033281,1033287,1033336,1033340,1033885,1034048,1034419,1034635,1034670,1034671,1034762,1034902,1034995,1035024,1035866,1035887,1035920,1035922,1036214,1036638,1036752,1036763,1037177,1037186,1037384,1037483,1037669,1037840,1037871,1037969,1038033,1038043,1038085,1038142,1038143,1038297,1038458,1038544,1038842,1038843,1038846,1038847,1038848,1038879,1038981,1038982,1039348,1039354,1039700,1039864,1039882,1039883,1039885,1039900,1040069,1040125,1040182,1040279,1040351,1040364,1040395,1040425,1040463,1040567,1040609,1040855,1040929,1040941,1041087,1041160,1041168,1041242,1041431,1041810,1042200,1042286,1042356,1042421,1042517,1042535,1042536,1042863,1042886,1043014,1043231,1043236,1043347,1043371,1043467,1043488,1043598,1043912,1043935,1043990,1044015,1044082,1044120,1044125,1044532,1044767,1044772,1044854,1044880,1044912,1045154,1045235,1045286,1045307,1045340,1045467,1045568,1046105,1046434,1046589,799133,863764,870618,922871,951844,966170,966172,966191,966321,966339,968697,969479,969755,970083,971975,982783,985561,986362,986365,987192,987576,988065,989056,989311,990058,990682,991273,993832,995542,995968,998106 CVE References: CVE-2016-10200,CVE-2016-2117,CVE-2016-4997,CVE-2016-4998,CVE-2016-7117,CVE-2016-9191,CVE-2017-1000364,CVE-2017-1000365,CVE-2017-1000380,CVE-2017-2583,CVE-2017-2584,CVE-2017-2596,CVE-2017-2636,CVE-2017-2671,CVE-2017-5551,CVE-2017-5576,CVE-2017-5577,CVE-2017-5897,CVE-2017-5970,CVE-2017-5986,CVE-2017-6074,CVE-2017-6214,CVE-2017-6345,CVE-2017-6346,CVE-2017-6347,CVE-2017-6353,CVE-2017-7184,CVE-2017-7187,CVE-2017-7261,CVE-2017-7294,CVE-2017-7308,CVE-2017-7346,CVE-2017-7374,CVE-2017-7487,CVE-2017-7616,CVE-2017-7618,CVE-2017-8890,CVE-2017-9074,CVE-2017-9075,CVE-2017-9076,CVE-2017-9077,CVE-2017-9150,CVE-2017-9242 Sources used: SUSE Linux Enterprise Real Time Extension 12-SP2 (src): kernel-rt-4.4.74-7.10.1, kernel-rt_debug-4.4.74-7.10.1, kernel-source-rt-4.4.74-7.10.1, kernel-syms-rt-4.4.74-7.10.1
I am experiencing a similar problem on a freshly installed Leap 42.3. The btrfs-transacti process makes the system completely unresponsive for about 10 to 15 min. It happened already 3 times since the install 3 days ago, that is, typically once a day. I am correlating this with the automatic software update which apparently triggers snapper into action and then btrfs. I have changed the software check to happen only once a month to see if it eases the problem, but I would welcome any other workaround as this is being very disruptive. The machine is a Dell Inspiron Inspiron 5448 and has as disk a Samsung SSD 850 EVO 1TB. I am happy to provide more system info or do some tests if it is of any help.
(In reply to Gerald Weber from comment #84) > I am experiencing a similar problem on a freshly installed Leap 42.3. > > The btrfs-transacti process makes the system completely unresponsive for > about 10 to 15 min. It happened already 3 times since the install 3 days > ago, that is, typically once a day. I am correlating this with the automatic > software update which apparently triggers snapper into action and then > btrfs. I have changed the software check to happen only once a month to see > if it eases the problem, but I would welcome any other workaround as this is > being very disruptive. > > The machine is a Dell Inspiron Inspiron 5448 and has as disk a Samsung SSD > 850 EVO 1TB. > > I am happy to provide more system info or do some tests if it is of any help. Hi Gerald, They only workaround I know so far is to disable quotas in btrfs. I don't know if it is acceptable to you, but in all my Tumbleweed machines the problem went away after this.
I also have this problem on 2 notebooks, it took a while to figure out that the brtfs was the case on my thinkpad t460p, the freezes are just embarrassing when open the notebook for a presentation or showing something, nat a advertisement for Suse at all! never realized what the problem was, system for short unusable, wlan instable, ... thinkpad x121 the system becomes unusable for quite some time. I first thought that gnome-was crashing, but finally I was able to run top while the system was unusable, than I saw what was eating 1 CPU. That the whole system freezes .... not good. Having this problem, I think about reinstalling with ext4, do not need btrfs features on those notebooks anyway. But problems like this make me think if I shall use openSuse Leap in general on those devices where I want stability... btw, latest Leap 42.3
Harald, thank you for sharing your observations. I changed back the version to Leap 42.2 though as IMHO the convention is to use the version field to mark the first version of a product in which the bug was seen. And it's not a multi-selection field.
https://openqa.suse.de/tests/1172771#step/force_cron_run/7 shows our try to reproduce the same issues within openQA tests on SLE15. It's important to keep in mind that this bug is affecting in a similar manner also later versions of the distribution, as reported e.g. openSUSE Leap 42.3, as well as SLE in the corresponding versions, e.g. SLE 12 SP3 as well as now SLE15. enadolski@suse.com: Can you clarify what are you plans on this bug as it had been open for quite some time and in the meantime there had been maintenance updates which relate to this bug (e.g. looking at comment 82).
A code change to address this issue is now in upstream Linux and has been ported to SLE15 and SLE12-SP2. Marking this as resolved, upstream.
correction: moving to 'fixed'
since I have no permission at all to see what the solution is, I can only hope that it will work better than previous fixes when will it be delivered to Leap 42.3 I have meanwhile reinstalled my t460p with ext4, but my x121e is still with btrfs, but quota disabled, what seems to work better If I get an update I can re enable quota and see if something changed
(In reply to Edmund Nadolski from comment #90) > A code change to address this issue is now in upstream Linux and has been > ported to SLE15 and SLE12-SP2. Marking this as resolved, upstream. Which upstream commit is fixing this issue? I'm facing the problem also on machines where I run the latest upstream kernel from git...
Upstream commits are: 01747e9 btrfs: clean up extraneous computations in add_delayed_refs 3ec4d32 btrfs: allow backref search checks for shared extents 9dd14fd btrfs: add cond_resched() calls when resolving backrefs 0014275 btrfs: backref, add tracepoints for prelim_ref insertion and merging 6c336b2 btrfs: add a node counter to each of the rbtrees 86d5f99 btrfs: convert prelimary reference tracking to use rbtrees f695424 btrfs: remove ref_tree implementation from backref.c bb739cf btrfs: btrfs_check_shared should manage its own transaction e0c476b btrfs: backref, cleanup __ namespace abuse 4dae077 btrfs: backref, add unode_aux_to_inode_list helper 73980be btrfs: backref, constify some arguments 9a35b63 btrfs: constify tracepoint arguments 1cbb1f4 btrfs: struct-funcs, constify readers
(In reply to Edmund Nadolski from comment #94) > Upstream commits are: > > 01747e9 btrfs: clean up extraneous computations in add_delayed_refs > 3ec4d32 btrfs: allow backref search checks for shared extents > 9dd14fd btrfs: add cond_resched() calls when resolving backrefs > 0014275 btrfs: backref, add tracepoints for prelim_ref insertion and merging > 6c336b2 btrfs: add a node counter to each of the rbtrees > 86d5f99 btrfs: convert prelimary reference tracking to use rbtrees > f695424 btrfs: remove ref_tree implementation from backref.c > bb739cf btrfs: btrfs_check_shared should manage its own transaction > e0c476b btrfs: backref, cleanup __ namespace abuse > 4dae077 btrfs: backref, add unode_aux_to_inode_list helper > 73980be btrfs: backref, constify some arguments > 9a35b63 btrfs: constify tracepoint arguments > 1cbb1f4 btrfs: struct-funcs, constify readers Thanks for the list! The upstream kernel I used on the said machines didn't have this commits. Updating now...
I am running openSUSE Leap 42.3 with linux 4.13.1-1.gc0b7e1f-default and just noticed that my machine was (nearly) unresponsive for some minutes. iotop showed me: Total DISK READ : 15.21 K/s | Total DISK WRITE : 115.88 M/s Actual DISK READ: 64.64 K/s | Actual DISK WRITE: 9.48 M/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND 7808 be/4 root 0.00 B/s 365.03 K/s 0.00 % 99.99 % [kworker/u8:2] 2070 be/4 root 0.00 B/s 0.00 B/s 0.00 % 99.99 % [kworker/u8:16] 2111 be/4 root 0.00 B/s 0.00 B/s 0.00 % 36.61 % [kworker/u8:57] 7243 be/4 root 0.00 B/s 15.21 K/s 0.00 % 33.73 % [kworker/u8:0] 2098 be/4 root 15.21 K/s 91.26 K/s 0.00 % 0.00 % [kworker/u8:44] 2104 be/4 root 0.00 B/s 365.03 K/s 0.00 % 0.00 % [kworker/u8:50] 2071 be/4 root 0.00 B/s 486.71 K/s 0.00 % 0.00 % [kworker/u8:17] 30994 be/4 root 0.00 B/s 15.21 K/s 0.00 % 0.00 % [kworker/u8:8] 7244 be/4 root 0.00 B/s 365.03 K/s 0.00 % 0.00 % [kworker/u8:1] so quite some kworker threads putting a lot of IO on my system. I assume at the same time one or multiple of the cron-jobs "btrfs-scrub", "btrfs-balance", "btrfs-trim" were running. Was this expected to be fixed in linux 4.13.1-1.gc0b7e1f-default, do I need to update or is this now a followup issue which needs to be solved in the cron job files?
I also use 42.3 (4.4.85-22-default kernel) and can confirm that the problem is there. Standard Installation on SSD. Every monday morning my system is (nearly) unrespinsive for about half an hour with an btrfs process taking 100% cpu. Since the bug status is "RESOLVED FIXED" but according to the last posts the problem persists even in 4.13 kernel - what am i supposed to do to get rid of it? Is it safe to simply disable quotas? (I am not using these knowingly but they are enabled by default and snapper does - so i am afraid to simply turn them off and i would prefer to fix the system instead if there is a working solution coming. So what is the state here?
The patches listed in comment #94 have been merged into upstream 4.14. Otherwise one of the kernels mentioned in the previous comment has them. It should be safe to disable quotas as far as btrfs itself is concerned.
(In reply to Edmund Nadolski from comment #98) > The patches listed in comment #94 have been merged into upstream 4.14. > Otherwise one of the kernels mentioned in the previous comment has them. > > It should be safe to disable quotas as far as btrfs itself is concerned. I would not recommend disabling quotas (in case you mean btrfs qgroups) as IIUC they are implicitly used to prevent snapshots filling up the hard disk by cleaning them up if they reach *their* quota. To me it seems the issue is not really resolved even though I think the patches provided in the kernel by enadolski@suse.com should help. I guess one has to look at a more whole system level. Would it make sense to lower the I/O prio of background jobs?
I have disabled btrfs quotas on the x121e which I have not reinstalled and this system works since than without issues. So it seems that btrfs quotas, for system you boot only from time to time, is a serious problem. But now the question is, how do I clean the snapshots by hand? or turn of the snapshots, I do not need this on this machine, I mean, this is a notebook I mostly use to listen music from or connect to a hdmi tv display to watch something, it has different requirements than some server or production workstation. btrfs with all these features is obviously not the most optial default for such a system
(In reply to Oliver Kurz from comment #99) > To me it seems the issue is not really resolved even though I think the > patches provided in the kernel by enadolski@suse.com should help. I guess > one has to look at a more whole system level. I am restoring the previous status as I am not clear of the justification to re-open -- considering that the indicated patches evidently were not even run, it is not shown that a problem still exists. These patches have demonstrated a 50% improvement in btrfs backref performance, so if further symptoms are observed there may well be other causes (not necessarily even in the fs - as you mention the whole system would need to be looked at). In that case the best way forward is to please open a new BZ including all relevant info so that it can be properly investigated (and without potential obfuscation from the previous issue).
A few things: Qgroups can be safely disabled on openSUSE systems and snapshots will still be cleaned up. The functionality that handles cleanup based on percent of capacity occupied will not be available but cleanup by time or count will work fine. Balance and qgroups has some shortcomings. The biggest thing is that we shouldn't need to do qgroup accounting at all during balance but the internals aren't set up to allow that. That's a project that needs work in the future. Ed's patches will have decreased the CPU overhead substantially, especially with lots of snapshots, but it's still not perfect. Lastly, Ed, have you pushed these patches to the applicable branches? I don't see the patches there. Until they've landed, it's premature to call this issue resolved.
(In reply to Edmund Nadolski from comment #101) > […] > I am restoring the previous status as I am not clear of the justification to > re-open -- considering that the indicated patches evidently were not even > run, it is not shown that a problem still exists. > > These patches have demonstrated a 50% improvement in btrfs backref > performance, so if further symptoms are observed there may well be other > causes (not necessarily even in the fs - as you mention the whole system > would need to be looked at). In that case the best way forward is to please > open a new BZ including all relevant info so that it can be properly > investigated (and without potential obfuscation from the previous issue). Errr, I am not sure what your intention is. I am pretty sure that I run a kernel with the patches you mentioned checking with `rpm -q --changelog kernel-default`. As I stated I think your contributions improved the situation. Ok, I don't want to annoy you so I created another bug for the "btrfs maintenance scripts review": https://bugzilla.opensuse.org/show_bug.cgi?id=1063638
(In reply to Harald Achitz from comment #100) > I have disabled btrfs quotas on the x121e which I have not reinstalled and > this system works since than without issues. > So it seems that btrfs quotas, for system you boot only from time to time, > is a serious problem. > But now the question is, how do I clean the snapshots by hand? or turn of > the snapshots, I do not need this on this machine, I mean, this is a > notebook I mostly use to listen music from or connect to a hdmi tv display > to watch something, it has different requirements than some server or > production workstation. btrfs with all these features is obviously not the > most optial default for such a system Snapper is a management tools in openSUSE, it will help you to remove the snapshots.
Recently, I encounter a btrfs snapshots remove/cleanup issue with a huge snapshots list. if you have over hundreds snapshots, remove snapshot would let the system freezing for a while by btrfs-transaction 100% cpu cost. it is quota unrelated. According the upstream explication: http://www.spinics.net/lists/linux-btrfs/msg57956.html. it reasonable that as a result of this, the work to create a snapshot only depends on the complexity of the directory structure within the subvolume, while the work to delete it depends on both that and how much the snapshot has changed from the parent subvolume.
I conducted the following steps to verify: * On a low-performing older machine with rotating notebook disk * Install a clean SLES 12 SP3 with default settings (btrfs, subvolumes, qgroups, etc.) * Confirm the SUSE kernel has the mentioned patches included * Wait for the btrfs cron jobs to kick in at the next */15 minute interval * While observing the system processes with `top` and `ps` I could type in the gnome editor, move the mouse, etc., without problems
Oliver, is this a meaningful test? I mean, on a new installed system where there is nothing do to... this sounds not like something I would like to see as a test for my enterprise Linux. What about waiting until the system is in some real life notebook state, like updates, sleep, updates, (fat updates, like kernel, ...., so that huge snapshots exist) , wait that something is to do, than do some disk intensive tasks, and than start the btrfs thing. than see if everything is still smooth, as on you newly installed system. automate this , have it as regression test, so that future patches btrfs fixes will not re trigger the test. (maybe I am naive, but this is what I would expect from something that calls itself enterprise linux)
(In reply to Harald Achitz from comment #109) > Oliver, is this a meaningful test? I mean, on a new installed system where > there is nothing do to... this sounds not like something I would like to see > as a test for my enterprise Linux. What I conducted was just a very simple bug verification test run which does not mean there are more tests which could eventually lead to more information - and already did, this is why we (still) have other bugs in the same domain, e.g. the three "see also" bugs. With my comment being the 110th in row I think we should give our great kernel developers and contributors the achievement of "VERIFIED FIXED" at least on this bug ;) > automate this , have it as regression test, so that future patches btrfs > fixes will not re trigger the test. (maybe I am naive, but this is what I > would expect from something that calls itself enterprise linux) no, you are not naive - this is what we do with automated tests on top of my very limited verification :) The original problem for exactly *this* bug was confirmed exactly on a "freshly installed system" hence the verification in a comparable environment. But there are more and longer running tests also on openqa.opensuse.org as well.
SUSE-SU-2017:3267-1: An update that solves 5 vulnerabilities and has 56 fixes is now available. Category: security (important) Bug References: 1012382,1017461,1020645,1022595,1022600,1022914,1022967,1025461,1028971,1030061,1034048,1037890,1052593,1053919,1055493,1055567,1055755,1055896,1056427,1058135,1058410,1058624,1059051,1059465,1059863,1060197,1060985,1061017,1061046,1061064,1061067,1061172,1061451,1061831,1061872,1062520,1062962,1063460,1063475,1063501,1063509,1063520,1063667,1063695,1064206,1064388,1064701,964944,966170,966172,966186,966191,966316,966318,969474,969475,969476,969477,971975,974590,996376 CVE References: CVE-2017-12153,CVE-2017-13080,CVE-2017-14489,CVE-2017-15265,CVE-2017-15649 Sources used: SUSE Linux Enterprise Real Time Extension 12-SP2 (src): kernel-rt-4.4.95-21.1, kernel-rt_debug-4.4.95-21.1, kernel-source-rt-4.4.95-21.1, kernel-syms-rt-4.4.95-21.1
openSUSE-SU-2017:3358-1: An update that solves 16 vulnerabilities and has 67 fixes is now available. Category: security (important) Bug References: 1010201,1012382,1012829,1017461,1021424,1022595,1022914,1024412,1027301,1030061,1031717,1037890,1046107,1050060,1050231,1053919,1056003,1056365,1056427,1056979,1057199,1058135,1060333,1060682,1061756,1062941,1063026,1063516,1064701,1064926,1065180,1065600,1065639,1065692,1065717,1065866,1066045,1066192,1066213,1066223,1066285,1066382,1066470,1066471,1066472,1066573,1066606,1066629,1067105,1067132,1067494,1067888,1068671,1068978,1068980,1068982,1069270,1069496,1069702,1069793,1069942,1069996,1070006,1070145,1070535,1070767,1070771,1070805,1070825,1070964,1071231,1071693,1071694,1071695,1071833,963575,964944,966170,966172,974590,979928,989261,996376 CVE References: CVE-2017-1000405,CVE-2017-1000410,CVE-2017-11600,CVE-2017-12193,CVE-2017-15115,CVE-2017-16528,CVE-2017-16536,CVE-2017-16537,CVE-2017-16646,CVE-2017-16939,CVE-2017-16994,CVE-2017-17448,CVE-2017-17449,CVE-2017-17450,CVE-2017-7482,CVE-2017-8824 Sources used: openSUSE Leap 42.2 (src): kernel-debug-4.4.103-18.41.1, kernel-default-4.4.103-18.41.1, kernel-docs-4.4.103-18.41.1, kernel-obs-build-4.4.103-18.41.1, kernel-obs-qa-4.4.103-18.41.1, kernel-source-4.4.103-18.41.1, kernel-syms-4.4.103-18.41.1, kernel-vanilla-4.4.103-18.41.1
SUSE-SU-2017:3410-1: An update that solves 16 vulnerabilities and has 92 fixes is now available. Category: security (important) Bug References: 1010201,1012382,1012829,1017461,1020645,1021424,1022595,1022600,1022914,1024412,1025461,1027301,1028971,1030061,1031717,1034048,1037890,1046107,1050060,1050231,1053919,1055567,1056003,1056365,1056427,1056979,1057199,1058135,1059863,1060333,1060682,1060985,1061451,1061756,1062520,1062941,1062962,1063026,1063460,1063475,1063501,1063509,1063516,1063520,1063695,1064206,1064701,1064926,1065180,1065600,1065639,1065692,1065717,1065866,1066045,1066192,1066213,1066223,1066285,1066382,1066470,1066471,1066472,1066573,1066606,1066629,1067105,1067132,1067494,1067888,1068671,1068978,1068980,1068982,1069270,1069793,1069942,1069996,1070006,1070145,1070535,1070767,1070771,1070805,1070825,1070964,1071231,1071693,1071694,1071695,1071833,963575,964944,966170,966172,966186,966191,966316,966318,969474,969475,969476,969477,971975,974590,979928,989261,996376 CVE References: CVE-2017-1000410,CVE-2017-11600,CVE-2017-12193,CVE-2017-15115,CVE-2017-15265,CVE-2017-16528,CVE-2017-16536,CVE-2017-16537,CVE-2017-16645,CVE-2017-16646,CVE-2017-16994,CVE-2017-17448,CVE-2017-17449,CVE-2017-17450,CVE-2017-7482,CVE-2017-8824 Sources used: SUSE Linux Enterprise Workstation Extension 12-SP2 (src): kernel-default-4.4.103-92.53.1 SUSE Linux Enterprise Software Development Kit 12-SP2 (src): kernel-docs-4.4.103-92.53.1, kernel-obs-build-4.4.103-92.53.1 SUSE Linux Enterprise Server for Raspberry Pi 12-SP2 (src): kernel-default-4.4.103-92.53.1, kernel-source-4.4.103-92.53.1, kernel-syms-4.4.103-92.53.1 SUSE Linux Enterprise Server 12-SP2 (src): kernel-default-4.4.103-92.53.1, kernel-source-4.4.103-92.53.1, kernel-syms-4.4.103-92.53.1 SUSE Linux Enterprise Live Patching 12 (src): kgraft-patch-SLE12-SP2_Update_16-1-3.3.1 SUSE Linux Enterprise High Availability 12-SP2 (src): kernel-default-4.4.103-92.53.1 SUSE Linux Enterprise Desktop 12-SP2 (src): kernel-default-4.4.103-92.53.1, kernel-source-4.4.103-92.53.1, kernel-syms-4.4.103-92.53.1 OpenStack Cloud Magnum Orchestration 7 (src): kernel-default-4.4.103-92.53.1
I still see this issue every Monday when the btrfs balance timer kicks in on my Tumbleweed installation: first the btrfs balance (both -musage and -dusage) needs 100% CPU on one core for several minutes, each directly followed by a btrfs transaction also taking 100% CPU. But what's really locking up the system is the heavy I/O apparently. As it's a 2-weeks old default installation I have snapper and quotas enabled and haven't yet tried to disable the quotas. Anything I shall try or logs I shall provide? I also still have plenty of space left, so it shouldn't be related to the usual "disk full" problems. REOPEN this bug or create a separate one?
Hm, I think we are getting further by having a new bug where you can describe how the problem can be reproduced for yourself and also clearly stating which version of things - especially kernel - you run so that people can see that the fixes for *this* are in.