Bug 1021307

Summary: journald frequently crashes with coredump of the journal daemon itself
Product: [openSUSE] openSUSE Distribution Reporter: Siegbert Sadowski <ssadowski>
Component: BasesystemAssignee: David Sterba <dsterba>
Status: RESOLVED WONTFIX QA Contact: E-mail List <qa-bugs>
Severity: Critical    
Priority: P5 - None CC: dsterba, fbui, forgotten_0Iyw_XgfeX, hare, jack, jeffm, mgorman, ssadowski, tiwai
Version: Leap 42.2   
Target Milestone: ---   
Hardware: x86-64   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: Journal logfile
journald log 2017-01-28 till the crash
Journal log after the crash reboot
'perf' output before the crash occured
Perf output as of 11-02-2017
Datasheet of Samsung SSD
Output of iostat -x 5
Terminal output copies of 'top' and 'journalctl -ef' around crash time
Corrected terminal outputs 'top' and 'journald-ef'
/etc/fstab
Another system lockup, journal excerpt
Output of lsattr for journal files

Description Siegbert Sadowski 2017-01-22 11:12:57 UTC
User-Agent:       Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0
Build Identifier: 

Hello community,

I installed Leap 42.2 fresh on a new SSD. After configuration (with Yast) the system hangs 1-2 times per day. The journal log file shows always the same pattern:

Jan 19 22:02:35 serv2 systemd-journald[15033]: System journal (/var/log/journal/) is currently using 784.0M.
                                               Maximum allowed usage is set to 4.0G.
                                               Leaving at least 4.0G free (of currently available 44.7G of space).
                                               Enforced usage limit is thus 4.0G, of which 3.2G are still available.
Jan 19 22:02:35 serv2 systemd-journald[15033]: Journal started
Jan 19 22:00:01 serv2 cron[15019]: pam_unix(crond:session): session opened for user root by (uid=0)
Jan 19 22:00:01 serv2 systemd[1]: Started Session 17 of user root.
Jan 19 22:02:35 serv2 CRON[15019]: pam_unix(crond:session): session closed for user root
Jan 19 22:02:17 serv2 systemd[1]: systemd-journald.service: Watchdog timeout (limit 3min)!
Jan 19 22:02:35 serv2 systemd[1]: Starting Flush Journal to Persistent Storage...
Jan 19 22:02:35 serv2 systemd[1]: Started Flush Journal to Persistent Storage.
Jan 19 22:02:35 serv2 systemd-coredump[15025]: Detected coredump of the journal daemon itself, diverted to /var/lib/systemd/coredump/core.systemd-journal.0.0c14f

Then when I try to reboot it takes 30-45 minutes(!) for the system to shut down.
There are no other error messages in the log. 
Can you help me?

Many Thanks in advance.

Reproducible: Couldn't Reproduce

Steps to Reproduce:
1.monitored the system with journalctl -ef
2.after a couple of hours the above described log entry occurs
3.
Comment 1 Siegbert Sadowski 2017-01-22 12:19:28 UTC
Created attachment 711149 [details]
Journal logfile
Comment 2 Takashi Iwai 2017-01-23 09:33:15 UTC
Sounds more like an issue of journald itself.  Reassigned to systemd guys.
Feel free to reassign back once when it turned out to be really a kernel issue.
Comment 3 Franck Bui 2017-01-27 14:48:50 UTC
Hi,

It seems that journald is busy in writing logs, it's so busy that it forgot to send back the keep-alive ping to PID1. As a consequence systemd thought that journald was down and killed it with SIGABRT.

Could you attach the whole content of 'journalctl -b' preferably after the first crash of journald ?

Could you also send the same output but with debug logs enabled ? To do so please boot with the kernel options : "debug printk.devkmsg=on"

BTW do you use BTRFS ?

Thanks.
Comment 4 Siegbert Sadowski 2017-01-28 22:23:41 UTC
Created attachment 711994 [details]
journald log 2017-01-28 till the crash
Comment 5 Siegbert Sadowski 2017-01-28 22:25:40 UTC
Created attachment 711995 [details]
Journal log after the crash reboot
Comment 6 Siegbert Sadowski 2017-01-28 22:40:51 UTC
I tried to upload the compressed journal log file until the last crash occured. However the file size is 12 MB and it failed the upload limit of 10 MB.
Comment 7 Siegbert Sadowski 2017-01-28 22:48:00 UTC
I tried to upload the compressed journal log file until the last crash occured. However the file size is 12 MB and it failed the upload limit of 10 MB. Therefore please delete Comment 4.

Answer to your question w.r.t. filesystem: Yes, the system partion is BTRFS
Comment 8 Siegbert Sadowski 2017-01-28 23:51:35 UTC
Hi,

many thanks for your quick response. I restarted as recommended the system with boot option debug printk.devkmsg=on. The journal output became very verbose and generated a huge log file (88MB) until the last crash occured. Unfortunately the upload of the compressed file failed due to the file size limit of 10MB while the compressed file is 12 MB. Please delete my duplicate comment 4.

However I noticed 2 things with the debug option turned on.

1. Although the log output is much larger the system was stable for at least twice the usual crash rate.

2. A commonality, the crashes occur often around 21:30, +/- 15 minutes. I haven't found so far a cron job in this time frame.

My thoughts: For me the analysis of Franck Bui is very logical. The short response time of my SSD sytem disk keeps journald busy while writing the log to 'persistent storage'.

BTW: I will be out of house for the next 7 days and will be back on 06.02.2017.

Best regards,
Siegbert Sadowski
Comment 9 Franck Bui 2017-01-30 16:37:16 UTC
@Siegbert: I probably should have started with this but is there any process/service doing heavy logging or heavy disk IOs while this happening ?

Currently my suspicion is that syncing the journal to your (SSD) disk (formatted with BTRFS) takes a while (more than 3 min) for some reasons.

@Mel, do you have an idea how this could be traced without interfering with the timings of journald ?

It might be possible by using perf or do you think strace should be enough for this case ?

Maybe there are some other tools.

Basicallly journald mmaps the journal file for appending new records in the journal file and calls fsync() to write out the data into the disk every 5 min or so.

Thanks.
Comment 10 Mel Gorman 2017-01-30 19:41:35 UTC
(In reply to Franck Bui from comment #9)
> @Mel, do you have an idea how this could be traced without interfering with
> the timings of journald ?
> 

sudo perf -a trace -e fdatasync,fsync,msync,sync,sync_file_range,syncfs

While that's running it'll track all sync-related syscalls globally. If you want to target journald, you'll need to specify the pid with -p. Hit ctrl-c when you're done and there should be a breakdown of the sync calls.

Crude example of tracing like this

# dd if=/dev/zero of=zerofile ibs=1048576 count=500;
  sudo perf trace -e fdatasync,fsync,msync,sync,sync_file_range,syncfs sync;
  rm zerofile
500+0 records in
1024000+0 records out
524288000 bytes (524 MB, 500 MiB) copied, 0.772471 s, 679 MB/s
  1023.708 (1023.708 ms): sync/24234 sync(   ) = 0

That is showing that the sync command after an async dd took 1023.708 ms to complete. You can monitor indefinitely by specifying no command or record for a duration by running "sleep X" where X is the number of seconds you want to record data for.

> It might be possible by using perf or do you think strace should be enough
> for this case ?
> 

perf can do it with lower interference. If there are a high number of sync calls then you may need to increase the mmap space with -m to avoid lost events. The same can be achieved with strace but you specified that the interference with journald be minimised and strace is very heavy
Comment 11 Franck Bui 2017-01-31 18:21:31 UTC
Thanks Mel for the hint.

@Siegbert, once you'll be back and ready to debug, could you install perf and then try:

> $ perf trace -e --duration=1000 -e fdatasync,fsync,msync,sync,sync_file_range,syncfs --pid=$(pgrep systemd-journal)

and then reproduce the issue.

Once reproduced, please post the output of perf.

Thanks.
Comment 12 Siegbert Sadowski 2017-02-10 21:23:15 UTC
Created attachment 713752 [details]
'perf' output before the crash occured
Comment 13 Siegbert Sadowski 2017-02-10 21:30:05 UTC
Hello,

I'm back again since 06.02.2017. Since then I tried to catch the output of the the recommended perf tool. The last days the systemd crash occured very soon after system start just before I was able to start the perf tool. After that no crash occured for the whole eveneing. This evenening I caught the crash event. 

Unfortunately the perf tool does not allow to redirect its output to a file. Therfore I copied the screen output to the appended file. Unexpectedly the output of perf terminated with '-----' although at this point in time nobody was at the computer / keyboard.

In addition I append the output of a parallel rung journald -ef.

Thanks for your help.

Regards 
Siegbert.
Comment 14 Siegbert Sadowski 2017-02-11 20:13:22 UTC
Created attachment 713777 [details]
Perf output as of 11-02-2017
Comment 15 Franck Bui 2017-02-21 15:32:40 UTC
(In reply to Siegbert Sadowski from comment #14)
> Created attachment 713777 [details]
> Perf output as of 11-02-2017

From your attachment:

>  21978396.110 (303864.579 ms): fsync(fd: 20</var/log/journal/565b8fb65ab16a395a2a210a586a86d8/system.journal>) = 0

This means journald was blocked more than 5 min when calling fsync().

@Mel, could you (or someone from the kernel team) explain why Siegbert is facing such latencies ?
Comment 16 Siegbert Sadowski 2017-02-21 20:17:47 UTC
Hello,

my two cents to add, maybe the most recent observations help. 

Since I always have 'journald -ef' runing in a terminal window, to see what's going on in the system, it became evident that the journald core dump almost always was preceded by logrotate. At other locations in the internet I found hints that e.g. in Unbuntu the compression of the logfiles caused similar trouble.

Many thanks for your help,

Siegbert.
Comment 17 Mel Gorman 2017-02-23 11:21:26 UTC
(In reply to Franck Bui from comment #15)
> (In reply to Siegbert Sadowski from comment #14)
> > Created attachment 713777 [details]
> > Perf output as of 11-02-2017
> 
> From your attachment:
> 
> >  21978396.110 (303864.579 ms): fsync(fd: 20</var/log/journal/565b8fb65ab16a395a2a210a586a86d8/system.journal>) = 0
> 
> This means journald was blocked more than 5 min when calling fsync().
> 
> @Mel, could you (or someone from the kernel team) explain why Siegbert is
> facing such latencies ?

I'm adding Jan Kara to the cc in case there are follow up questions as I'm travelling a lot over the next three weeks and for one of those weeks, I'll be completely unavailable.

The specs of the SSD are not included but it's reasonable to assume that it can write roughly 300GB/sec under ideal conditions. 5 minutes to sync data would imply that it's syncing approximately 90GB of data. Is it possible that it's syncing that much data? Check the size of the system.journal file being synced.

However, there are other factors that could delay it. If CFQ is used and there are a large number of parallel reads then the writes and sync can be starved for quite some time. Are there heavy reads going on in parallel? While it's possible to get an in-depth view of what is going on, it may be unnecessary. The first hint can be got by running iostat -x 5 in the background and look at the stats around the time of the stall. Check if there are a large number of reads/writes happening at the same time and what the r_await and w_await stats are which will indicate if there are large amounts of IO going on at the same time.

While I mention CFQ, using the noop scheduler could also be a problem as all other pending reads and writes must complete before the sync.

If starvation is an issue and CFQ or noop is used then the obvious recommendation would be to try deadline. This may delay some reads when there is intensive writing but fsync times are likely to be reduced.

IIt is possible to identify exactly what is doing the writing but I suspect it'll be a distraction. Regardless of how long it takes journald to sync a file when the system is busy, it should not be segfaulting.
Comment 18 Jan Kara 2017-02-23 16:33:27 UTC
As Mel says gathering 'iostat -x 5' and seeing the numbers around stall time is a good idea. We'll get idea how much the disk is busy around that time. You can check which IO scheduler you are using in /sys/block/<device>/queue/scheduler however on 42.2 for non-rotating disk it should already be set to 'deadline'. But it doesn't hurt to verify it on your machine.

Finally, I'm also adding David to CC as a btrfs guy. Maybe he has an idea why fsync(2) can take 5 minutes for the root btrfs filesystem.
Comment 19 Franck Bui 2017-02-24 15:11:05 UTC
Thanks Mel for answering !

(In reply to Mel Gorman from comment #17)
> 
> The specs of the SSD are not included but it's reasonable to assume that it
> can write roughly 300GB/sec under ideal conditions. 5 minutes to sync data
> would imply that it's syncing approximately 90GB of data. Is it possible
> that it's syncing that much data? Check the size of the system.journal file
> being synced.

I can't see how this could realistically happen: since the journal file is mmap'ed that would mean that the huge amount of data that need to be synced lives in RAM, no ?

Also normally the journal files are limited in size and the default values should limit the size of one journal file to less than 1G.

> IIt is possible to identify exactly what is doing the writing but I suspect
> it'll be a distraction. Regardless of how long it takes journald to sync a
> file when the system is busy, it should not be segfaulting.

Actually journald is not segfaulting: it's rather being terminated with SIGABRT by systemd because the later noticed that journald was stuck for too long (see WatchdogSec= option in systemd.service(5) man page for details).
Comment 20 Franck Bui 2017-02-24 15:13:24 UTC
@Siegbert, could you show the information asked by Mel and Jara (see comment #18)?
Comment 21 Franck Bui 2017-02-24 15:14:11 UTC
(In reply to Franck Bui from comment #20)
> @Siegbert, could you show the information asked by Mel and Jara (see comment
> #18)?

Sorry I meant "Mel and Jan".
Comment 22 Mel Gorman 2017-02-24 15:49:32 UTC
(In reply to Franck Bui from comment #19)
> Thanks Mel for answering !
> 
> (In reply to Mel Gorman from comment #17)
> > 
> > The specs of the SSD are not included but it's reasonable to assume that it
> > can write roughly 300GB/sec under ideal conditions. 5 minutes to sync data
> > would imply that it's syncing approximately 90GB of data. Is it possible
> > that it's syncing that much data? Check the size of the system.journal file
> > being synced.
> 
> I can't see how this could realistically happen: since the journal file is
> mmap'ed that would mean that the huge amount of data that need to be synced
> lives in RAM, no ?
> 

The dirty data is in RAM. To fsync it, the pages have to be cleaned by writing all the data to disk

> Also normally the journal files are limited in size and the default values
> should limit the size of one journal file to less than 1G.
> 

Ok

> > IIt is possible to identify exactly what is doing the writing but I suspect
> > it'll be a distraction. Regardless of how long it takes journald to sync a
> > file when the system is busy, it should not be segfaulting.
> 
> Actually journald is not segfaulting: it's rather being terminated with
> SIGABRT by systemd because the later noticed that journald was stuck for too
> long (see WatchdogSec= option in systemd.service(5) man page for details).

Then that may need to be re-examined. There are a number of reasons why an fsync can be delayed such as read or write starvation. It is somewhat unexpected that journald would get terminated because there was a large number of random read/writers starving IO in general.
Comment 23 Siegbert Sadowski 2017-02-24 21:04:15 UTC
Created attachment 715475 [details]
Datasheet of Samsung SSD

The sda device is a Samsung SSD 850 PRO with 126GB capacity
Comment 24 Siegbert Sadowski 2017-02-24 21:42:55 UTC
Hello,

as described in the beginning, the crashdump occurs 1-2 times / day. Yesterday it came before I was able to read my email. As recommended I started 'iostat -x 5 > /root/2017-02-24_iostat-log.txt &'. Please find attached the compressed log file. For better understanding: sda is the Samsung SSD systemdisk, sdb+sdc form a raid1 md127. The CPU is a Celeron G1820 @2.7GHz with 2 cores and 8GB main memory.

In addition I opened on a remote machine several terminal windows for monitoring. One of the monitoring windows was 'top' in order see the processes at coredump time. To my surprise disk I/O was very low but one of the cores was completly busy with process 'btrfs-transacti'. In file '2017-02-24_log.txt' you will find terminal window copies of this behaviour. The first 'top' was just before the coredump occured in 'journalctl -ef'. The 2nd 'top' shows that 'btrfs-transacti' is still very busy after about 15 minutes after the event of journald core dump.

In file '2017-02-24_directories.txt' you will find the 'ls -alt' of /var/log/journal'. This looks to me reasonable, no excessive file size.

One other thing I noticed. Yesterday and today the crash occured at 19:45 when cron started. Crontab contains only the standard jobs, no own extensions.

I hope this helps.

Best regards,
Siegbert.
Comment 25 Siegbert Sadowski 2017-02-24 21:44:51 UTC
Created attachment 715481 [details]
Output of iostat -x 5
Comment 26 Siegbert Sadowski 2017-02-24 21:53:46 UTC
Created attachment 715482 [details]
Terminal output copies of 'top' and 'journalctl -ef' around crash time

The first 'top' shows the situation before the core dump, the 2nd after ~15 minutes. In between is the screen shot of the journalctl -ef output at/around the crash time.
Comment 27 Siegbert Sadowski 2017-02-24 22:00:30 UTC
Created attachment 715483 [details]
Corrected terminal outputs 'top' and 'journald-ef'

I'm sorry,

but the title of the previous attachement 715482 should be 2017-02-24_directories.txt.
Comment 28 Mel Gorman 2017-02-25 15:29:26 UTC
(In reply to Siegbert Sadowski from comment #24)
> Hello,
> 
> as described in the beginning, the crashdump occurs 1-2 times / day.
> Yesterday it came before I was able to read my email. As recommended I
> started 'iostat -x 5 > /root/2017-02-24_iostat-log.txt &'. Please find
> attached the compressed log file. For better understanding: sda is the
> Samsung SSD systemdisk, sdb+sdc form a raid1 md127. The CPU is a Celeron
> G1820 @2.7GHz with 2 cores and 8GB main memory.
> 

Ok, thanks for that. Looking at it, there was some moderate activity for the most of the lock. After a bit over an hour, there is a major spike in read activity followed by a large spike in write activity with short bursts of reads/writes after that. Nothing unusual about that but crucially, the average wait times for IO completion are very slow. There is no indication from this data that the disk is saturated in any way.

The disk in question is also disk with decent performance characteristics. It should be able to write 1G of data in under 2 seconds.

> In addition I opened on a remote machine several terminal windows for
> monitoring. One of the monitoring windows was 'top' in order see the
> processes at coredump time. To my surprise disk I/O was very low but one of
> the cores was completly busy with process 'btrfs-transacti'. In file
> '2017-02-24_log.txt' you will find terminal window copies of this behaviour.
> The first 'top' was just before the coredump occured in 'journalctl -ef'.
> The 2nd 'top' shows that 'btrfs-transacti' is still very busy after about 15
> minutes after the event of journald core dump.
> 

I think the issue here is that btrfs-transaction is completely locked up and stalls all IO while it figures out what to do. IIRC, that process going to 100% implies it's a fragmentation issue but btrfs is far from my strong suite.

David, I'm afraid you'll need to take a look. The history is in the comments but the short summary is that an fsync on a btrfs filesystem takes minutes to complete and systemd responds by killing journald as it's "not responding". During the stall, there is no indication from iostat that the SSD is saturated but btrfs-transaction is pegged at 100% CPU usage. It's Leap 42.2 and kernel version 4.4.49-16-default (according to iostat). Is this something you're already familiar with? If not, any suggestions on how to debug and fix this?
Comment 29 Mel Gorman 2017-02-25 15:31:20 UTC
Siegbert, I should have added that David will probably want to see the output of the mount command to see what mount options btrfs is using.
Comment 30 Siegbert Sadowski 2017-02-25 18:23:19 UTC
Created attachment 715500 [details]
/etc/fstab

Hello,

Mel, did you mean the fstab?
Comment 31 Mel Gorman 2017-02-25 21:17:23 UTC
(In reply to Siegbert Sadowski from comment #30)
> Created attachment 715500 [details]
> /etc/fstab
> 
> Hello,
> 
> Mel, did you mean the fstab?

It's helpful but I really did mean the mount command to see what default options are set for btrfs. I expect they are standard but I've learned not to make assumptions.
Comment 32 Siegbert Sadowski 2017-02-25 21:44:48 UTC
Hello Mel,

sorry for my misunderstanding. In the past I've used mount only for active mounting of a device. Here is the output of mount:
----
serv2:/home/siegbert # mount
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
devtmpfs on /dev type devtmpfs (rw,nosuid,size=4018380k,nr_inodes=1004595,mode=755)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
tmpfs on /run type tmpfs (rw,nosuid,nodev,mode=755)
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
/dev/sda3 on / type btrfs (rw,relatime,ssd,space_cache,subvolid=259,subvol=/@/.snapshots/1/snapshot)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=28,pgrp=1,timeout=0,minproto=5,maxproto=5,direct)
mqueue on /dev/mqueue type mqueue (rw,relatime)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime)
debugfs on /sys/kernel/debug type debugfs (rw,relatime)
/dev/sda3 on /var/log type btrfs (rw,relatime,ssd,space_cache,subvolid=275,subvol=/@/var/log)
/dev/sda3 on /var/cache type btrfs (rw,relatime,ssd,space_cache,subvolid=266,subvol=/@/var/cache)
/dev/sda3 on /var/lib/mysql type btrfs (rw,relatime,ssd,space_cache,subvolid=272,subvol=/@/var/lib/mysql)
/dev/sda2 on /home type xfs (rw,relatime,attr2,inode64,noquota)
/dev/sda3 on /usr/local type btrfs (rw,relatime,ssd,space_cache,subvolid=265,subvol=/@/usr/local)
/dev/sda3 on /srv type btrfs (rw,relatime,ssd,space_cache,subvolid=263,subvol=/@/srv)
/dev/sda3 on /var/crash type btrfs (rw,relatime,ssd,space_cache,subvolid=267,subvol=/@/var/crash)
/dev/sda3 on /var/lib/libvirt/images type btrfs (rw,relatime,ssd,space_cache,subvolid=268,subvol=/@/var/lib/libvirt/images)
/dev/sda3 on /.snapshots type btrfs (rw,relatime,ssd,space_cache,subvolid=258,subvol=/@/.snapshots)
/dev/sda3 on /var/lib/machines type btrfs (rw,relatime,ssd,space_cache,subvolid=269,subvol=/@/var/lib/machines)
/dev/sda3 on /tmp type btrfs (rw,relatime,ssd,space_cache,subvolid=264,subvol=/@/tmp)
/dev/sda3 on /boot/grub2/x86_64-efi type btrfs (rw,relatime,ssd,space_cache,subvolid=261,subvol=/@/boot/grub2/x86_64-efi)
/dev/sda3 on /var/spool type btrfs (rw,relatime,ssd,space_cache,subvolid=277,subvol=/@/var/spool)
/dev/sda3 on /var/lib/mariadb type btrfs (rw,relatime,ssd,space_cache,subvolid=271,subvol=/@/var/lib/mariadb)
/dev/sda3 on /var/tmp type btrfs (rw,relatime,ssd,space_cache,subvolid=278,subvol=/@/var/tmp)
/dev/sda3 on /boot/grub2/i386-pc type btrfs (rw,relatime,ssd,space_cache,subvolid=260,subvol=/@/boot/grub2/i386-pc)
/dev/sda3 on /opt type btrfs (rw,relatime,ssd,space_cache,subvolid=262,subvol=/@/opt)
/dev/sda3 on /var/lib/pgsql type btrfs (rw,relatime,ssd,space_cache,subvolid=274,subvol=/@/var/lib/pgsql)
/dev/sda3 on /var/opt type btrfs (rw,relatime,ssd,space_cache,subvolid=276,subvol=/@/var/opt)
/dev/sda3 on /var/lib/named type btrfs (rw,relatime,ssd,space_cache,subvolid=273,subvol=/@/var/lib/named)
/dev/sda3 on /var/lib/mailman type btrfs (rw,relatime,ssd,space_cache,subvolid=270,subvol=/@/var/lib/mailman)
/dev/md127 on /local type ext3 (rw,relatime,data=ordered)
proc on /var/lib/named/proc type proc (ro,nosuid,nodev,noexec,relatime)
proc on /var/lib/dhcp/proc type proc (ro,relatime)
tmpfs on /run/user/0 type tmpfs (rw,nosuid,nodev,relatime,size=805184k,mode=700)
fusectl on /sys/fs/fuse/connections type fusectl (rw,relatime)
tracefs on /sys/kernel/debug/tracing type tracefs (rw,relatime)
tmpfs on /run/user/1000 type tmpfs (rw,nosuid,nodev,relatime,size=805184k,mode=700,uid=1000,gid=100)
serv2:/home/siegbert #
-----

BTW: like the last days the core dump occured today shortly after the 19:45 cron started. I didn't any changes to the standard Leap 42.2 cron. 

Siegbert.
Comment 33 Franck Bui 2017-03-09 14:35:30 UTC
@David, I'm reassigning this bug to you as some clues suggest an issue in btrfs.
Comment 35 Siegbert Sadowski 2017-03-26 22:16:13 UTC
Hello,

since my BTRFS problem is now open for more than 2 month it looks like that this is a more serious problem with no quick solution. Therfore I decided to change my system from BTRFS to the stable EXT4 next weekend (01.04.2017). Until then, if you need more logs for debugging please let me know.

The journald coredump crashes still occur reproducible with the daily cron.

Maybe BTRFS needs some more development time to become mature enough for real usage.

Siegbert Sadowski
Comment 36 Jeff Mahoney 2017-03-26 22:48:24 UTC
The "issue" is that journald has a pathological workload for btrfs. Tiny writes plus fsync plus a relatively constant stream means a few things happen.  We're constantly CoWing for each tiny write due to the fsync, we're getting very fragmented, and we're getting slowed down in the transaction path.  The fix we've deployed elsewhere for this exact problem is to turn off data cow on the journal files.

This is a well known interaction and the only issue is that it wasn't identified as this sooner.
Comment 37 Franck Bui 2017-03-27 10:42:09 UTC
systemd is supposed to set the FS_NOCOW file flag for its journal files since sometime now. At least it's true on v228 on which this bug was reported (agains Leap 42.2).

Siegbert, did you disable that somehow ? could you check that the flag is set on the journal files ?
Comment 38 Siegbert Sadowski 2017-03-28 19:30:51 UTC
I didn't do any changes to system configuration files. I'm sorry, but I'm not that deep in the internals of either systemd and journald. Can you help me to find this FS_NOCOW flag? At least in the /etc/systemd configuration folder I couldn't find any hint.

Thanks,
Siegbert
Comment 39 Forgotten User 0Iyw_XgfeX 2017-03-30 08:12:11 UTC
I had similar issue shortly after kernel update to 4.4.49-16.1 .
At the end my FS (btrfs) became unmountable. I failed to fix it with btrfs utils and decided to reinstall the system on ext4. Surprisingly, random crashes and freezes continued, this time without FS corruption though. Then I rolled back kernel (to 4.4.27-2.1 from the installer) and since then there was no one single glitch.

PS. Before kernel rollback I've checked RAM and HD, everything was ok
Comment 40 Franck Bui 2017-03-30 09:34:44 UTC
(In reply to Siegbert Sadowski from comment #38)
> Can you help me to find this FS_NOCOW flag? 

 $ lsattr -R /var/log/journal

see man lsattr(1)
Comment 41 Forgotten User 0Iyw_XgfeX 2017-03-30 11:00:53 UTC
Created attachment 719297 [details]
Another system lockup, journal excerpt

Update: it just happened again. The system became unresponsive during photorec session. After reset, that's what i've found in journal.
Comment 42 Siegbert Sadowski 2017-03-30 18:20:50 UTC
Created attachment 719368 [details]
Output of lsattr for journal files

Thanks for your replies.
The 'C' attribute is set for all journal files.

Siegbert.
Comment 43 Franck Bui 2017-04-03 12:42:49 UTC
(In reply to Siegbert Sadowski from comment #42)
> The 'C' attribute is set for all journal files.

Then it seems that the kernel guys have some homework ;)
Comment 44 Siegbert Sadowski 2017-04-17 17:50:30 UTC
Hello all,

it looks like the recent SuSE updates (week of 10.Apr.2017) solved the problem. I monitored the system now for 1 week and no new crashes occured. 

Thanks a lot to all of you for your help.

Siegbert.
Comment 45 Tomáš Chvátal 2018-04-17 14:00:10 UTC
This is automated batch bugzilla cleanup.

The openSUSE 42.2 changed to end-of-life (EOL [1]) status. As such
it is no longer maintained, which means that it will not receive any
further security or bug fix updates.
As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
openSUSE, or you can still observe it under openSUSE Leap 15.0, please
feel free to reopen this bug against that version (see the "Version"
component in the bug fields), or alternatively open
a new ticket.

Thank you for reporting this bug and we are sorry it could not be fixed
during the lifetime of the release.

[1] https://en.opensuse.org/Lifetime