Bug 931787

Summary: dracut stop booting after btrfs rootfilesystem went full
Product: [openSUSE] openSUSE Distribution Reporter: Diego Ercolani <diego.ercolani>
Component: BootloaderAssignee: Thomas Renninger <trenn>
Status: RESOLVED INVALID QA Contact: Jiri Srain <jsrain>
Severity: Major    
Priority: P5 - None CC: dsterba, trenn
Version: 13.2   
Target Milestone: ---   
Hardware: x86-64   
OS: openSUSE 13.2   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: a tar of all the files regarding this issue
rpm -Va showing some lack of libraries (not involved in boot process)

Description Diego Ercolani 2015-05-21 08:22:14 UTC
Created attachment 634966 [details]
a tar of all the files regarding this issue

I have a system that is using snapper to take periodic snapshots on a btrfs filesystem configured with volumes.
This btrfs is mapped on a raid1 partition (managed by mdadm as it is able to raise errors via mail in case of degradation)

I experienced the classic problem with btrfs, as time goes by "df -h" show empty space while for btrfs the volume go full.

To manage this situation I added another partition to btrfs and reclaimed the space with "btrfs balance start /", I made a new initrd with "mkinitrd" and then rebooted.

The dracut process hang (without dropping to an emergency shell) during rootfs mount asking, on the console appear that dracut is trying to mount the correct uuid but for some reason don't work.

So, I booted with a rescue disk, removed some snapper generated subvolumes to "make space" and the removed also the secondary btrfs volume (thinking that this would recover from the "boot hang") with "btrfs device delete <volume> <path>".

I regenerated the initrd with mkinitrd and rebooted obtaining the same issue:
the boot process hangs trying to mount rootfs.

So I set the dracut commmandline to drop to a shell during boot pre-mount stage:
rd.break=pre-mount

This dropped correctly to shell and generated the diagnostic logs that I attached (rdsosreport-pre-mount.txt).
From commandline I issued the mount command:
mount /dev/disk/by-uuid/c591f436-cc33-42b1-a272-1fc85386e2cb /sysroot/
without any problem, then control-d and then the dracut droped again to the emergency shell and generated the other log I attached (rdsosreport-pre-mount-2.txt).
After I simply issue control-d and then start the boot process  presenting the banner Opensuse 13.2 but problems didn't solve as this time systemd complain about the fact that cannot mount (?) the root filesystem (that I mount during dracut emergency console) and drop me to a emergency login where I find that the rootfs is correctly mounted and where I started (manually) the network and the ssh daemon to access from remote (I attache also the dump of the journal regarding this issue where at line you can see the issue:

May 21 08:42:24 casaregno systemd[1]: Timed out waiting for device dev-disk-by\x2dlabel-rootfs.device.

Here is the environment:
[/etc/fstab]
...
LABEL=rootfs /                    btrfs       compress        1 1
LABEL=rootfs /usr/local btrfs subvol=usr/local,compress 0 0
LABEL=rootfs /boot/grub2/i386-pc btrfs subvol=boot/grub2/i386-pc 0 0
#LABEL=rootfs /boot/grub2/x86_64-efi btrfs subvol=boot/grub2/x86_64-efi 0 0
#LABEL=rootfs /home btrfs subvol=home 0 0
LABEL=rootfs /opt btrfs subvol=opt 0 0
LABEL=rootfs /srv btrfs subvol=srv 0 0
LABEL=rootfs /tmp btrfs subvol=tmp 0 0
LABEL=rootfs /var/crash btrfs subvol=var/crash 0 0
#LABEL=rootfs /var/lib/mailman btrfs subvol=var/lib/mailman 0 0
#LABEL=rootfs /var/lib/named btrfs subvol=var/lib/named 0 0
#LABEL=rootfs /var/lib/pgsql btrfs subvol=var/lib/pgsql 0 0
LABEL=rootfs /var/log btrfs subvol=var/log 0 0
LABEL=rootfs /var/opt btrfs subvol=var/opt 0 0
LABEL=rootfs /var/spool btrfs subvol=var/spool 0 0
LABEL=rootfs /var/tmp btrfs subvol=var/tmp 0 0
LABEL=rootfs /.snapshots btrfs subvol=.snapshots 0 0
...


and the environment:
dracut-037-17.9.1.x86_64
kernel-desktop-devel-3.16.7-21.1.x86_64
kernel-devel-3.16.7-21.1.noarch
kernel-desktop-3.16.7-21.1.x86_64
btrfsmaintenance-0.1-1.1.noarch
btrfsprogs-3.16.2-4.1.x86_64
plymouth-dracut-0.9.0-1.1.x86_64
kernel-source-3.16.7-21.1.noarch
libbtrfs0-3.16.2-4.1.x86_64
kernel-macros-3.16.7-21.1.noarch
kernel-firmware-20141122git-5.1.noarch
Comment 1 Diego Ercolani 2015-05-21 08:28:56 UTC
Probably this bug is tiled with bug 905615
Comment 2 Diego Ercolani 2015-05-21 15:58:08 UTC
I tried to replace the "new generated" initrd with the initrd that worked until 19th of May (before the filesystem got full) (as, as far as I know I have the same disk geometry and hardware like before)
I have the same issue. So something happened to the fileystem?!?!

I tried to examine the filesystem from the rescue disk and everithing seems fine, I can access everywhere and every subvolume.

Please, as I have the system completely down, before reinstall all (fortunately I have backup) someone can address me the debugging that could be helpful also for other users?
Comment 3 David Sterba 2015-05-21 17:13:27 UTC
There are no apparent errors that would be related to failed mount of the root filesystem.

Logs in rdsosreport-pre-mount.txt.gz contain loading of btrfs module and first device scan, the mount attempt is not there.

rdsosreport-pre-mount-2.txt.gz shows a successful mount, then systemd drops to the emergency shell.

journal.gz seems from the POV of a filesystem.
Comment 4 David Sterba 2015-05-21 17:25:22 UTC
(In reply to David Sterba from comment #3)
> journal.gz seems from the POV of a filesystem.

... seems ok ...

There are some errors on usb device 11-1, but otherwise nothing suspicious.

[52.130683] casaregno kernel: BTRFS info (device md127): disk space caching is enabled
[79.076469] casaregno systemd[1]: Started Dracut Emergency Shell.

The timestamp delta is 27, this looks like some timeout, but without further details.

(In reply to Diego Ercolani from comment #2)
> I tried to replace the "new generated" initrd with the initrd that worked
> until 19th of May (before the filesystem got full) (as, as far as I know I
> have the same disk geometry and hardware like before)
> I have the same issue. So something happened to the fileystem?!?!

Hm, I'd try the same. It's still possible that the failed update did left some package updated in half and this stops the boot.

You can try to verify the installed files against rpm database by 'rpm -vVa' ("verbose verify of all packages") and look for "missing" or wrong md5 checksum
Comment 5 Diego Ercolani 2015-05-21 19:05:46 UTC
Created attachment 635087 [details]
rpm -Va showing some lack of libraries (not involved in boot process)
Comment 6 Diego Ercolani 2015-05-21 19:07:21 UTC
For the usb device, yes, my motherboard is some kind of crap...
System Information
	Manufacturer: Gigabyte Technology Co., Ltd.
	Product Name: GA-990XA-UD3

I think it's the usb3 controller...
casaregno:~ # lsusb -s 11:1 -v

Bus 011 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Device Descriptor:
  bLength                18
  bDescriptorType         1
  bcdUSB               1.10
  bDeviceClass            9 Hub
  bDeviceSubClass         0 Unused
  bDeviceProtocol         0 Full speed (or root) hub
  bMaxPacketSize0        64
  idVendor           0x1d6b Linux Foundation
  idProduct          0x0001 1.1 root hub
  bcdDevice            3.16
  iManufacturer           3 (error)
  iProduct                2 OHCI PCI host controller
  iSerial                 1 0000:00:16.0
  bNumConfigurations      1
  Configuration Descriptor:
    bLength                 9
    bDescriptorType         2
    wTotalLength           25
    bNumInterfaces          1
    bConfigurationValue     1
    iConfiguration          0 
    bmAttributes         0xe0
      Self Powered
      Remote Wakeup
    MaxPower                0mA
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        0
      bAlternateSetting       0
      bNumEndpoints           1
      bInterfaceClass         9 Hub
      bInterfaceSubClass      0 Unused
      bInterfaceProtocol      0 Full speed (or root) hub
      iInterface              0 
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x81  EP 1 IN
        bmAttributes            3
          Transfer Type            Interrupt
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0002  1x 2 bytes
        bInterval             255
Hub Descriptor:
  bLength               9
  bDescriptorType      41
  nNbrPorts             4
  wHubCharacteristic 0x0002
    No power switching (usb 1.0)
    Ganged overcurrent protection
  bPwrOn2PwrGood        2 * 2 milli seconds
  bHubContrCurrent      0 milli Ampere
  DeviceRemovable    0x00
  PortPwrCtrlMask    0xff
 Hub Port Status:
   Port 1: 0000.0101 power connect
   Port 2: 0000.0100 power
   Port 3: 0000.0100 power
   Port 4: 0000.0100 power
Device Status:     0x0001
  Self Powered

For the test you suggested, I tried, effectively there where some missing libraries as you can see in file rpmVa.log I attached. Anyway I recovered the missing libraries, regenerated the initrd but nothing changed
Comment 7 Diego Ercolani 2015-05-21 19:51:36 UTC
Solved!
This is what I did:
1. create a logical volume to receive the btrfs root filesystem created formatting the device
2. create all the subvolume structure in the new ntrfs volume
3. copied all the files with a "cp -ax" from every subvolume
4. mount -o bind all the /dev /sys /*** to the new rootfs
5. mkinitrd
6. grub2-install /dev/sda; grub2-install /dev/sdb
7. grub2-mkconfig -o /boot/grub2/grub.cfg
8. reboot

All is working now in the new partition

So my conclusion is: there is something weird in the original btrfs partition that boot process cannot understand.

Since I resolved the issue on my own, do you think I can trash the old partition or you need me to try to understand what has gone wrong? (but please point out what to do)
Comment 8 David Sterba 2015-05-26 09:36:00 UTC
Thanks for the offer, nothing from me. I don't think the filesystem is corrupted or damaged in another way, that would be indicated by some messages already, and you were able to manually mount it.
Comment 9 Thomas Renninger 2015-05-27 16:13:16 UTC
Ok. I am not a fs specialist...
In future I'd like to postpone subsystem specific bugs (lvm, multipath, btrfs,...)
to submaintainers...
David: For btrfs that would be you ;)
       Now worries, there are not many, but all dracut bugs counted up, it's
       a lot of work with a lot specialized needed knowledge...


So what shall we do with this one?
Is it a won't fix as everything works now and we never will find out what happened or is there something we can/should do?
Comment 10 Thomas Renninger 2015-06-01 13:36:45 UTC
David: Do we still have to do something here?
Comment 11 David Sterba 2015-07-23 12:31:34 UTC
I have no ideas where to look futher.
Comment 12 Thomas Renninger 2015-11-11 15:21:03 UTC
See comment #8. David expects the fs was really broken, not only full...