Bug 925873

Summary: resume fails after hibernate from Cinnamon, succeeds after hibernate from command line
Product: [openSUSE] openSUSE Distribution Reporter: Marc Schütz <schuetzm>
Component: BasesystemAssignee: systemd maintainers <systemd-maintainers>
Status: RESOLVED WONTFIX QA Contact: Kristyna Streitova <kstreitova>
Severity: Normal    
Priority: P5 - None CC: arvidjaar, bwiedemann, crrodriguez, fbui, forgotten_cAXlJ_FoSf, hare, kstreitova, maintenance, mchang, sbrabec, schuetzm, seife, systemd-maintainers, thomas.blume, tiwai, trenn, wbauer
Version: 13.2Flags: werner: needinfo? (arvidjaar)
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Bug Depends on: 936265    
Bug Blocks: 917221    
Attachments: screenshot
screenshot with debug info
rdsosreport.txt
blurred screenshot
journal of boot with debugging after successful resume
journal of boot with debugging after failed resume
output of "/usr/bin/systemd-sleep-grub pre"
journal of boot after /usr/bin/systemd-sleep-grub pre
journal of failed boot after /usr/lib/systemd/systemd-sleep hibernate
journal of next boot after /usr/lib/systemd/systemd-sleep hibernate (normal boot)
screenshot during failed boot
initrd as requested

Description Marc Schütz 2015-04-03 10:51:54 UTC
Created attachment 629922 [details]
screenshot

After installing systemd 210-25.16.1 via update, resume after hibernate from the Cinnamon shutdown dialog hangs, while everything works flawlessly when I use "echo -n disk > /sys/power/state". (Previously I had used systemd 210-879.1 from tsaupe repo IIRC, because of another bug which has since been resolved.)

In my setup, I have encrypted /home and swap partitions. It correctly asks me for the password (in Plymouth), but then everything stops; there's only the Plymouth splash screen, which doesn't react to <Esc>, <Ctrl+Fn> or <Ctrl+Alt+Del>.

I managed to circumvent the splash screen by pressing escape early; see the attached screen shot. The last thing printed is "Starting dracut pre-mount hook...", after that, even the cursor disappears.
Comment 1 Bernhard Wiedemann 2015-04-03 18:01:25 UTC
maybe related to the changes from bug 919095 
that landed in systemd-210-25.16.1?


for better debugging you can try changing the bootloader options
from splash=silent to splash=verbose

and you could try to get more detailed logs via
https://wiki.freedesktop.org/www/Software/systemd/Debugging/
Comment 2 Marc Schütz 2015-04-04 12:27:51 UTC
Created attachment 630006 [details]
screenshot with debug info

Here's a new screenshot including debug info. Not very helpful AFAICS...
Comment 3 Thomas Blume 2015-04-08 09:35:53 UTC
(In reply to Marc Schütz from comment #2)
> Created attachment 630006 [details]
> screenshot with debug info
> 
> Here's a new screenshot including debug info. Not very helpful AFAICS...

Seems that /bin/dracut-pre-mount hangs.

Please add:

rd.break=pre-mount rd.debug debug

to the kernel command line and boot.
When the system reached the emergency shell, get:

/run/initramfs/rdsosreport.txt

and attach it.
Afterwards, please execute:

/bin/dracut-pre-mount

manually from the initrd.
Does this work?
If so, just type: exit.
Does your machine boot then?
Comment 4 Marc Schütz 2015-04-08 18:07:55 UTC
Created attachment 630399 [details]
rdsosreport.txt
Comment 5 Marc Schütz 2015-04-08 18:10:36 UTC
(In reply to Thomas Blume from comment #3)

(see attachment)

> Afterwards, please execute:
> 
> /bin/dracut-pre-mount
> 
> manually from the initrd.
> Does this work?

It prints tons of debug messages to the screen and hanges, but I can kill it with Ctrl+C; I then get back to the emergency shell. Will have to experiment where exactly it hangs.

> If so, just type: exit.
> Does your machine boot then?

Yes, it resumes correctly then, even when I don't try to run dracut-pre-mount.
Comment 6 Marc Schütz 2015-04-08 18:24:35 UTC
Hmm... I ran the commands from dracut-pre-mount manually one by one. This is the line that needs to be interrupted by Ctrl+C to get back to the shell:

getarg 'rd.break=pre-mount' 'rdbreak=pre-mount' && emergency_shell -n pre-mount "Break pre-mount"

I don't think that's the cause, because it shouldn't even be triggered without the rd.break arg. Except if "getarg" is the culprit?

Anyway, the next command is:

source_hook pre-mount

When I run this one, it starts resuming and finishes successfully.
Comment 7 Thomas Blume 2015-04-09 07:35:54 UTC
(In reply to Marc Schütz from comment #6)
> Hmm... I ran the commands from dracut-pre-mount manually one by one. This is
> the line that needs to be interrupted by Ctrl+C to get back to the shell:
> 
> getarg 'rd.break=pre-mount' 'rdbreak=pre-mount' && emergency_shell -n
> pre-mount "Break pre-mount"
>
> I don't think that's the cause, because it shouldn't even be triggered
> without the rd.break arg. Except if "getarg" is the culprit?

The getarg function is from /lib/dracut-lib.sh and calls _dogetarg().
It must hang there somehwere.
When you are in the emergency shell, please run:

. /lib/dracut-lib.sh
_dogetarg 'rd.break=pre-mount' 'rdbreak=pre-mount'


and attach the output.

> Anyway, the next command is:
> 
> source_hook pre-mount
> 
> When I run this one, it starts resuming and finishes successfully.

ok
Comment 8 Marc Schütz 2015-04-09 09:13:30 UTC
Created attachment 630467 [details]
blurred screenshot
Comment 9 Marc Schütz 2015-04-09 09:27:44 UTC
(In reply to Thomas Blume from comment #7)
> When you are in the emergency shell, please run:
> 
> . /lib/dracut-lib.sh
> _dogetarg 'rd.break=pre-mount' 'rdbreak=pre-mount'
> 
> 
> and attach the output.
> 

Unfortunately the screenshot became blurred, but it didn't hang anyway. I will try to source /dracut-state.sh first, maybe this will trigger the bug...
Comment 10 Marc Schütz 2015-04-09 09:55:38 UTC
(In reply to Marc Schütz from comment #9)
> (In reply to Thomas Blume from comment #7)
> > When you are in the emergency shell, please run:
> > 
> > . /lib/dracut-lib.sh
> > _dogetarg 'rd.break=pre-mount' 'rdbreak=pre-mount'
> > 
> > 
> > and attach the output.
> > 
> 
> Unfortunately the screenshot became blurred, but it didn't hang anyway. I
> will try to source /dracut-state.sh first, maybe this will trigger the bug...

Nope, the hang happens in start_emergency_shell (or similar). The last command is "systemctl start dracut-emergency.service". When I press Ctrl+C, it says that "systemd-tty-ask" was killed. (I don't see any indication that it actually asks for anything, and I wouldn't know why - the passphrase for the encrypted swap partition had already been entered before going into the emergency shell.)

Anyway, I suspect that this is unrelated to the actual problem, it seems more a side effect of the emergency shell.
Comment 11 Thomas Blume 2015-04-09 11:31:16 UTC
(In reply to Marc Schütz from comment #10)
> 
> Nope, the hang happens in start_emergency_shell (or similar). The last
> command is "systemctl start dracut-emergency.service". When I press Ctrl+C,
> it says that "systemd-tty-ask" was killed. (I don't see any indication that
> it actually asks for anything, and I wouldn't know why - the passphrase for
> the encrypted swap partition had already been entered before going into the
> emergency shell.)
> 

Argh, sorry, I overlooked something in the rdsosreport. 
It shows that the kernel couldn't find the hibernation image:

-->--
[   16.883030] kaim kernel: PM: Starting manual resume from disk
[   16.884657] kaim kernel: PM: Hibernation image partition 254:0 present
[   16.884658] kaim kernel: PM: Looking for hibernation image.
[   16.886335] kaim kernel: PM: Image not found (code -22)
[   16.886335] kaim kernel: PM: Hibernation image not present or could not be loaded.
--<--

Only after this I can see dracut reporting the initqueue hooks as finished:

-->--
   16.883146] kaim dracut-initqueue[365]: /lib/dracut-lib.sh@432(check_finished): . /lib/dracut/hooks/initqueue/finished/00resume.sh
[   16.894094] kaim dracut-initqueue[365]: //lib/dracut/hooks/initqueue/finished/00resume.sh@1(source): echo 254:0
[   16.903208] kaim dracut-initqueue[365]: /lib/dracut-lib.sh@430(check_finished): for f in '$hookdir/initqueue/finished/*.sh'
[   16.914483] kaim dracut-initqueue[365]: /lib/dracut-lib.sh@431(check_finished): '[' /lib/dracut/hooks/initqueue/finished/90-crypt.sh = '/lib/dracut/hooks/initqueue/finished/*.sh' ']'
[   16.927973] kaim dracut-initqueue[365]: /lib/dracut-lib.sh@432(check_finished): '[' -e /lib/dracut/hooks/initqueue/finished/90-crypt.sh ']'
[   16.939852] kaim dracut-initqueue[365]: /lib/dracut-lib.sh@432(check_finished): . /lib/dracut/hooks/initqueue/finished/90-crypt.sh
[   16.951343] kaim dracut-initqueue[365]: //lib/dracut/hooks/initqueue/finished/90-crypt.sh@1(source): '[' -e /dev/disk/by-id/dm-uuid-CRYPT-LUKS1-425fd9fafeb94526b10feaa733605791-swap ']'
[   16.964492] kaim dracut-initqueue[365]: /lib/dracut-lib.sh@430(check_finished): for f in '$hookdir/initqueue/finished/*.sh'
[   16.976379] kaim dracut-initqueue[365]: /lib/dracut-lib.sh@431(check_finished): '[' '/lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-uuid\x2f4c63085f-b79c-427d-bd4f-46ad072de4af.sh' = '/lib/dracut/hooks/initqueue/finished/*.sh' ']'
[   16.988222] kaim dracut-initqueue[365]: /lib/dracut-lib.sh@432(check_finished): '[' -e '/lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-uuid\x2f4c63085f-b79c-427d-bd4f-46ad072de4af.sh' ']'
[   16.999956] kaim dracut-initqueue[365]: /lib/dracut-lib.sh@432(check_finished): . '/lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-uuid\x2f4c63085f-b79c-427d-bd4f-46ad072de4af.sh'
[   17.011505] kaim dracut-initqueue[365]: //lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-uuid\x2f4c63085f-b79c-427d-bd'[' -e /dev/disk/by-uuid/4c63085f-b79c-427d-bd4f-46ad072de4af ']'
[   17.022788] kaim dracut-initqueue[365]: /lib/dracut-lib.sh@430(check_finished): for f in '$hookdir/initqueue/finished/*.sh'
[   17.032039] kaim dracut-initqueue[365]: /lib/dracut-lib.sh@431(check_finished): '[' '/lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fmapper\x2fswap.sh' = '/lib/dracut/hooks/initqueue/finished/*.sh' ']'
[   17.039594] kaim dracut-initqueue[365]: /lib/dracut-lib.sh@432(check_finished): '[' -e '/lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fmapper\x2fswap.sh' ']'
[   17.043625] kaim dracut-initqueue[365]: /lib/dracut-lib.sh@432(check_finished): . '/lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fmapper\x2fswap.sh'
[   17.043752] kaim dracut-initqueue[365]: //lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fmapper\x2fswap.sh@1(source): '[' -e /dev/mapper/swap ']'
[   17.043872] kaim dracut-initqueue[365]: /lib/dracut-lib.sh@434(check_finished): return 0
-->--

I guess the hibernation image is found afterwards.
That's why it works if you just type 'exit' from the emergency shell.

Obviously, the kernel shouldn't look for the hibernate image before dracut returns from check_finished.
We have a race condition here.
Comment 12 Marc Schütz 2015-04-09 11:37:24 UTC
(In reply to Thomas Blume from comment #11)
> Argh, sorry, I overlooked something in the rdsosreport. 
> It shows that the kernel couldn't find the hibernation image:
> 
> -->--
> [   16.883030] kaim kernel: PM: Starting manual resume from disk
> [   16.884657] kaim kernel: PM: Hibernation image partition 254:0 present
> [   16.884658] kaim kernel: PM: Looking for hibernation image.
> [   16.886335] kaim kernel: PM: Image not found (code -22)
> [   16.886335] kaim kernel: PM: Hibernation image not present or could not
> be loaded.
> --<--

I actually saw this, but thought it's ok because the image is encrypted...
Comment 13 Thomas Blume 2015-04-09 11:51:23 UTC
(In reply to Marc Schütz from comment #12)
> (In reply to Thomas Blume from comment #11)
> > Argh, sorry, I overlooked something in the rdsosreport. 
> > It shows that the kernel couldn't find the hibernation image:
> > 
> > -->--
> > [   16.883030] kaim kernel: PM: Starting manual resume from disk
> > [   16.884657] kaim kernel: PM: Hibernation image partition 254:0 present
> > [   16.884658] kaim kernel: PM: Looking for hibernation image.
> > [   16.886335] kaim kernel: PM: Image not found (code -22)
> > [   16.886335] kaim kernel: PM: Hibernation image not present or could not
> > be loaded.
> > --<--
> 
> I actually saw this, but thought it's ok because the image is encrypted...

Does this mean that the image is encrypted separately from the swap device?
If so, the initial password request from plymouth would only decrypt the swap device, but not the hibernation image on this device.
Later on, plymouth seems to be blocked or stopped, so it doesn't show the prompt for decrypting the image.
Comment 14 Marc Schütz 2015-04-09 13:46:50 UTC
(In reply to Thomas Blume from comment #13)
> (In reply to Marc Schütz from comment #12)
> > I actually saw this, but thought it's ok because the image is encrypted...
> 
> Does this mean that the image is encrypted separately from the swap device?
> If so, the initial password request from plymouth would only decrypt the
> swap device, but not the hibernation image on this device.
> Later on, plymouth seems to be blocked or stopped, so it doesn't show the
> prompt for decrypting the image.

No, it's hibernating directly to the (encrypted) swap device, sorry for the confusion. Your race condition theory is probably correct.

Now, there must be some other difference between "echo disk > /sys/power/state" and hibernating from the Cinnamon menu, because the former works. It's probably this difference that triggers the race condition.
Comment 15 Thomas Blume 2015-04-10 14:58:58 UTC
(In reply to Marc Schütz from comment #14)
> 
> Now, there must be some other difference between "echo disk >
> /sys/power/state" and hibernating from the Cinnamon menu, because the former
> works. It's probably this difference that triggers the race condition.


While trying to reproduce the issue, I've found something else in the rdsosreport log.
You pass the resume parameter as:

resume=/dev/mapper/swap

Does this really match your encrypted swap device?
If configured via YaST, the encrypted swap should rather look like this:

/dev/mapper/cr_swap

Can you please check?
Comment 16 Marc Schütz 2015-04-10 15:28:39 UTC
(In reply to Thomas Blume from comment #15)
> While trying to reproduce the issue, I've found something else in the
> rdsosreport log.
> You pass the resume parameter as:
> 
> resume=/dev/mapper/swap
> 
> Does this really match your encrypted swap device?
> If configured via YaST, the encrypted swap should rather look like this:
> 
> /dev/mapper/cr_swap
> 
> Can you please check?

Yes, it's correct, I set it up manually:

/etc/crypttab:
swap    /dev/disk/by-id/ata-WDC_WD10EZRX-00A8LB0_WD-WMC1U7189353-part1
home    /dev/disk/by-id/ata-WDC_WD10EZRX-00A8LB0_WD-WMC1U7189353-part6

# ls -l /dev/mapper/
total 0
crw------- 1 root root 10, 236 Apr  9 10:15 control
lrwxrwxrwx 1 root root       7 Apr  9 10:15 home -> ../dm-1
lrwxrwxrwx 1 root root       7 Apr  9 10:15 swap -> ../dm-0
Comment 17 Thomas Blume 2015-04-29 08:30:16 UTC
I could probably reproduce the issue now in a virtual machine.
Resume after hibernate hung with encrypted swap.
When I afterwards resetted the machine, it came up correctly.

The hang happened, after sysroot mount, when the system was trying to activate the swap (swapon).
Commenting out the swap entry in /etc/fstab fixed the hang on resume.
When doing a manual swapon after resume has been finished, I see this:

-->--
# swapon /dev/mapper/cr_swap 
swapon: /dev/mapper/cr_swap: software suspend data detected. Rewriting the swap signature.
--<--

Thinking of it, it is obviously not a goot idea to modify a swap device during an resume operation.
The swap device shouldn't be changed at all while a resume is in progress.
The swap activation should be blocked until resume has finished.

Could you please check wheter commenting out the swap entry in fstab also fixes it on your machine?
Comment 18 Thomas Blume 2015-04-29 10:28:16 UTC
I should probably mention that with swap disabled in fstab, you will need to do a manual swapon on your swap device in order to start hibernation at all.
Comment 19 Marc Schütz 2015-04-30 17:48:23 UTC
(In reply to Thomas Blume from comment #17)
> Could you please check wheter commenting out the swap entry in fstab also
> fixes it on your machine?

Can you tell me exactly what I should change?

Just commenting out the swap entry from the running system and then suspending resulted in the hang again. Then I rebooted, and ran `mkinitrd`, but after suspend it just rebooted normally and didn't even try to resume.
Comment 20 Thomas Blume 2015-05-04 06:48:11 UTC
(In reply to Marc Schütz from comment #19)
> (In reply to Thomas Blume from comment #17)
> > Could you please check wheter commenting out the swap entry in fstab also
> > fixes it on your machine?
> 
> Can you tell me exactly what I should change?
> 
> Just commenting out the swap entry from the running system and then
> suspending resulted in the hang again. Then I rebooted, and ran `mkinitrd`,
> but after suspend it just rebooted normally and didn't even try to resume.

This is bad. It means I haven't reproduced the issue you are seeing.
For me, it just startet working after commenting out the swap in the running system.
You shouldn't need to run mkinitrd, because the swapon is normally done after the initrd.
Could you provide a debug boot log from the system booting up normally after a suspend?
Comment 21 Marc Schütz 2015-05-16 10:04:08 UTC
Created attachment 634497 [details]
journal of boot with debugging after successful resume

Sorry for the delay, here is the output of

    journalctl -b

of a boot with

    systemd.log_level=debug systemd.log_target=console rd.debug debug

after a suspend/resume with

    echo disk > /sys/power/state
Comment 22 Thomas Blume 2015-05-19 08:40:02 UTC
(In reply to Marc Schütz from comment #21)
> Created attachment 634497 [details]
> journal of boot with debugging after successful resume
> 
> Sorry for the delay, here is the output of
> 
>     journalctl -b
> 
> of a boot with
> 
>     systemd.log_level=debug systemd.log_target=console rd.debug debug
> 
> after a suspend/resume with
> 
>     echo disk > /sys/power/state

So, the difference between your (working) method and suspend from Cinnamon is, that you are not using the systemd commands.
Can you please now suspend the machine via the commands:

/usr/bin/systemd-sleep-grub pre

and then:

/usr/lib/systemd/systemd-sleep hibernate

and check wheter this works?
If not, please send me the output of the above commands as well as journalctl -b after reboot.
Comment 23 Marc Schütz 2015-05-20 10:46:09 UTC
(In reply to Thomas Blume from comment #22)
> So, the difference between your (working) method and suspend from Cinnamon
> is, that you are not using the systemd commands.
> Can you please now suspend the machine via the commands:
> 
> /usr/bin/systemd-sleep-grub pre

INFO: running prepare-grub
  Skipping grub entry #2, because it has the noresume option
  Skipping grub entry #4, because it has the noresume option
  running kernel is grub menu entry 0 (vmlinuz-3.16.7-21-desktop)
  preparing boot-loader: selecting entry 0, kernel /boot/3.16.7-21-desktop
  grub-once:   running '/usr/sbin/grub2-once 0'
    time needed for sync: 0.3 seconds, time needed for grub: 0.1 seconds.

> 
> and then:
> 
> /usr/lib/systemd/systemd-sleep hibernate

I couldn't capture the output, but I think it complained about /etc/systemd/sleep.conf missing. Apart from that, it hibernated successfully ...

> 
> and check wheter this works?

... but hung again on boot :-(

> If not, please send me the output of the above commands as well as
> journalctl -b after reboot.

(going to attach the journal output)
Comment 24 Marc Schütz 2015-05-20 10:47:38 UTC
Created attachment 634857 [details]
journal of boot with debugging after failed resume
Comment 25 Thomas Blume 2015-05-22 10:17:08 UTC
(In reply to Marc Schütz from comment #24)
> Created attachment 634857 [details]
> journal of boot with debugging after failed resume

I can see the suspend event here:

-->--
May 16 11:53:43 kaim.site kernel: PM: Syncing filesystems ... 
May 16 11:53:43 kaim.site kernel: done.
May 16 11:54:35 kaim.site kernel: Freezing user space processes ... (elapsed 0.002 seconds) done.
May 16 11:54:35 kaim.site kernel: PM: Marking nosave pages: [mem 0x0009d000-0x000fffff]
May 16 11:54:35 kaim.site kernel: PM: Marking nosave pages: [mem 0xbeff6000-0xbf4f6fff]
May 16 11:54:35 kaim.site kernel: PM: Marking nosave pages: [mem 0xbf781000-0xbfef2fff]
May 16 11:54:35 kaim.site kernel: PM: Marking nosave pages: [mem 0xbff00000-0x100000fff]
May 16 11:54:35 kaim.site kernel: PM: Marking nosave pages: [mem 0xb4000000-0xb7ffffff]
May 16 11:54:35 kaim.site kernel: PM: Basic memory bitmaps created
May 16 11:54:35 kaim.site kernel: PM: Preallocating image memory... done (allocated 257448 pages)
May 16 11:54:35 kaim.site kernel: PM: Allocated 1029792 kbytes in 0.32 seconds (3218.10 MB/s)
May 16 11:54:35 kaim.site kernel: Freezing remaining freezable tasks ... (elapsed 0.000 seconds) done.
May 16 11:54:35 kaim.site kernel: Suspending console(s) (use no_console_suspend to debug)
May 16 11:54:35 kaim.site kernel: i8042 kbd 00:07: System wakeup enabled by ACPI
May 16 11:54:35 kaim.site kernel: serial 00:03: disabled
May 16 11:54:35 kaim.site kernel: PM: freeze of devices complete after 630.432 msecs
May 16 11:54:35 kaim.site kernel: PM: late freeze of devices complete after 0.580 msecs
May 16 11:54:35 kaim.site kernel: PM: noirq freeze of devices complete after 0.945 msecs
May 16 11:54:35 kaim.site kernel: ACPI: Preparing to enter system sleep state S4
May 16 11:54:35 kaim.site kernel: [Firmware Bug]: ACPI: BIOS _OSI(Linux) query ignored
May 16 11:54:35 kaim.site kernel: PM: Saving platform NVS memory
May 16 11:54:35 kaim.site kernel: Disabling non-boot CPUs ...
May 16 11:54:35 kaim.site kernel: kvm: disabling virtualization on CPU1
May 16 11:54:35 kaim.site kernel: smpboot: CPU 1 is now offline
May 16 11:54:35 kaim.site kernel: kvm: disabling virtualization on CPU2
May 16 11:54:35 kaim.site kernel: smpboot: CPU 2 is now offline
May 16 11:54:35 kaim.site kernel: kvm: disabling virtualization on CPU3
May 16 11:54:35 kaim.site kernel: smpboot: CPU 3 is now offline
May 16 11:54:35 kaim.site kernel: PM: Creating hibernation image:
May 16 11:54:35 kaim.site kernel: PM: Need to copy 263999 pages
May 16 11:54:35 kaim.site kernel: PM: Normal pages needed: 263999 + 1024, available pages: 1808971
May 16 11:54:35 kaim.site kernel: PM: Restoring platform NVS memory
May 16 11:54:35 kaim.site kernel: PCI-DMA: Resuming GART IOMMU
May 16 11:54:35 kaim.site kernel: PCI-DMA: Restoring GART aperture settings
May 16 11:54:35 kaim.site kernel: LVT offset 0 assigned for vector 0x400
May 16 11:54:35 kaim.site kernel: Enabling non-boot CPUs ...
[....]
May 16 11:54:35 kaim.site kernel: PM: restore of devices complete after 896.637 msecs
May 16 11:54:35 kaim.site kernel: PM: Image restored successfully.
May 16 11:54:35 kaim.site kernel: PM: Basic memory bitmaps freed
May 16 11:54:35 kaim.site kernel: Restarting tasks ... done.
-->--

It looks quite ok.
No hint to the root cause of the problem.
I've also tried on some test laptop but couldn't reproduce the issue.

We need to continue narrowing it down.
Please do the following tests:

1. run '/usr/bin/systemd-sleep-grub pre'
afterwards suspend via 'echo disk > /sys/power/state'

2. run '/usr/lib/systemd/systemd-sleep hibernate' without a previous /usr/bin/systemd-sleep-grub

Please report wheter the tests fail/succeed and provide the journal debug log from the tests.
Comment 26 Marc Schütz 2015-06-04 09:23:23 UTC
Created attachment 636693 [details]
output of "/usr/bin/systemd-sleep-grub pre"
Comment 27 Marc Schütz 2015-06-04 09:24:11 UTC
Created attachment 636694 [details]
journal of boot after /usr/bin/systemd-sleep-grub pre
Comment 28 Marc Schütz 2015-06-04 09:24:58 UTC
Created attachment 636695 [details]
journal of failed boot after /usr/lib/systemd/systemd-sleep hibernate
Comment 29 Marc Schütz 2015-06-04 09:25:53 UTC
Created attachment 636696 [details]
journal of next boot after /usr/lib/systemd/systemd-sleep hibernate (normal boot)
Comment 30 Marc Schütz 2015-06-04 09:27:32 UTC
Created attachment 636697 [details]
screenshot during failed boot
Comment 31 Marc Schütz 2015-06-04 09:34:40 UTC
(In reply to Thomas Blume from comment #25)
> 1. run '/usr/bin/systemd-sleep-grub pre'
> afterwards suspend via 'echo disk > /sys/power/state'

This succeeded. I attached the output of systemd-sleep-grub-pre and the journal of the entire successful boot.

> 
> 2. run '/usr/lib/systemd/systemd-sleep hibernate' without a previous
> /usr/bin/systemd-sleep-grub

Resume failed. I could not capture the output of the command, but I attached the journal of the hanging boot (resp. the part before the suspend), the journal of the next normal boot, and a screenshot taken during the hanging resume.

The last two lines on the screenshot are interesting. I pressed <Esc> after I entered the encryption passphrase to see the messages, and those lines appeared a few seconds later. It looks like the system attempts to hibernate again?
Comment 32 Thomas Blume 2015-06-10 11:25:24 UTC
(In reply to Marc Schütz from comment #31)
> Resume failed. I could not capture the output of the command, but I attached
> the journal of the hanging boot (resp. the part before the suspend), the
> journal of the next normal boot, and a screenshot taken during the hanging
> resume.
> 
> The last two lines on the screenshot are interesting. I pressed <Esc> after
> I entered the encryption passphrase to see the messages, and those lines
> appeared a few seconds later. It looks like the system attempts to hibernate
> again?

Indeed, pretty odd.
Can you please send me the initrd from your machine?
Let's see wheter I can find something therein.
Comment 33 Marc Schütz 2015-06-10 13:20:58 UTC
Created attachment 637329 [details]
initrd as requested
Comment 34 Thomas Blume 2015-06-11 07:51:56 UTC
(In reply to Marc Schütz from comment #33)
> Created attachment 637329 [details]
> initrd as requested

Unfortunately still no hint.
I need the initrd in the state before it hangs.
Please try the following:

1. Before a suspend attempt, add:

rd.break=pre-mount

to your boot command line (you can put it into /etc/default/grub and update the bootloader).
It will get you into the dracut emergency shell at the pre-mount stage.

2. from there, mount your system root:

mount /dev/disk/by-uuid/4c63085f-b79c-427d-bd4f-46ad072de4af /sysroot

3. tar your initrd:

/sysroot/bin/tar cvf /sysroot/tmp/initrd.tar --one-file-system /

4. save rdsosreport

cp /run/initramfs/rdsosreport.txt /sysroot/tmp


You can then umount /sysroot and try to continue the resume by typing: exit.
If this doesn't work, you need to reset your machine. 

The next boot will get you again into the emergency shell, but you can just leave it by typing: exit.
If the system is up again, you should remove the rd.break entry from /etc/default/grub.

Please attach initrd.tar and rdsosreport.txt
Comment 35 Thomas Blume 2015-06-19 09:29:40 UTC
There was a thread on the opensuse-factory mailing list about the same issue.
It turned out that the solution was to uninstall the package:

suspend

Can you please check wheter you have this package installed?
If so, does it help uninstalling the package?
Comment 36 Marc Schütz 2015-06-20 14:59:06 UTC
(In reply to Thomas Blume from comment #35)
> There was a thread on the opensuse-factory mailing list about the same issue.
> It turned out that the solution was to uninstall the package:
> 
> suspend
> 
> Can you please check wheter you have this package installed?
> If so, does it help uninstalling the package?

Yes! I removed that package, and now it works fine!
Comment 37 Thomas Blume 2015-06-23 06:17:05 UTC
(In reply to Marc Schütz from comment #36)
> (In reply to Thomas Blume from comment #35)
> > There was a thread on the opensuse-factory mailing list about the same issue.
> > It turned out that the solution was to uninstall the package:
> > 
> > suspend
> > 
> > Can you please check wheter you have this package installed?
> > If so, does it help uninstalling the package?
> 
> Yes! I removed that package, and now it works fine!

Ok, thanks for the feedback.
So, we have competing suspend functionality.
I guess therefore, we saw a second suspend attempt.

According to bug 905424 comment#6 a change in the pm-utils package defaults could fix it without deinstalling the suspend package.

Kristyna, can you take a look?
Comment 38 Stanislav Brabec 2015-06-23 13:27:45 UTC
OK, we can find a solution how to configure one suspend package to not break another.

But there is a question: Do we really need 9 different ways to suspend?

systemd and systemctl hibernate/systemctl suspend/systemctl hybrid-sleep

pm-utils and pm-hibernate/pm-suspend/pm-hybrid

suspend and and s2disk/s2ram/s2both

Now there are 3 packages and 3 different ways to suspend machine, which creates a matrix of 9 possibilities. Do we really need all of them, maintain them and test them?

How far are they compatible?
Comment 39 Dr. Werner Fink 2015-06-23 13:49:16 UTC
(In reply to Stanislav Brabec from comment #38)

Good question:  What are the pro/cons within the matrix entries?  Is there hardware out which works only with e.g. suspend or pm-utils
Comment 40 Stanislav Brabec 2015-06-23 14:48:57 UTC
Dr. Werner Fink:


Here are special features of other packages:


suspend: suspend is capable to do encrypted hibernation. This feature is limited to computers with AT keyboard. USB keyboards are not supported for entering key passphrase.


pm-utils: This package contains special quirks for hibernation of some (mostly obsolete) devices: /usr/lib/pm-utils/sleep.d, and we have pm-utils-ndiswrapper.

systemd also has a support for quirks in /usr/lib/systemd/system-sleep/ More than year ago, all package maintainers were asked to convert pm-utils quirks to systemd quirks.

Somebody also written a code to support pm-utils quirks inside systemd, but it seems to be dropped.

The status of pm-utils quirk database: Unmaintained for ~5 years.

Many quirks were working around ancient kernel or hardware bugs. Most of these bugs are already fixed, but quirks are staying.

Many quirks depend on obsolete dropped binaries like radeontool, vbetool, vbe, getkernels-grub2, getkernels-grub etc.


Fedora has only systemctl, SLE12 has only systemctl as well.


Here is my proposal:

1. Drop "suspend" package now.

2. Drop all binaries from pm-utils now, and add a systemd quirk calling pm-utils quirks. Rename the package to systemd-pm-utils-quirks, and remove this package from the all package selections. Over the time, port needed quirks to systemd, and drop the package later.

3. Add Obsoletes: suspend pm-utils to systemd package.

4. Edit pm-utils-ndiswrapper accordingly (or drop)-
Comment 42 Stanislav Brabec 2015-06-23 15:57:49 UTC
Thinking about it again, I would drop pm-utils completely as well.

We already dropped pm-utils quirks from systemd, so everybody, who still depends on pm-utils quirks, should have already broken hibernation e. g. when closing the lid or using desktop menu.

But nobody complains.
Comment 43 Wolfgang Bauer 2015-06-23 16:27:03 UTC
(In reply to Stanislav Brabec from comment #42)
> We already dropped pm-utils quirks from systemd, so everybody, who still
> depends on pm-utils quirks, should have already broken hibernation e. g.
> when closing the lid or using desktop menu.
> 
> But nobody complains.
That's not quite true. systemd calls pm-utils again if installed since November:
https://build.opensuse.org/request/show/262325

See also bug#904828.
Comment 44 Stanislav Brabec 2015-06-23 17:42:25 UTC
Comment 43: Thanks for the update.

The current patch depends on presence of pm-* binaries.

Anyway, most of the pm-utils quirks are totally obsolete and broken and they only make time to suspend and resume longer. All useful quirks were ported to systemd.

Both Fedora and SLES12 are able to suspend without pm-utils and suspend packages.

There is no single bug report that some system is not able to suspend without pm-utils or suspend packages. But there are several bug report against systems, that are not able to suspend with pm-utils and suspend, but they are able to suspend without them.


I think that there is no benefit in having three suspend utility implementations and two implementations of suspend quirks.
Comment 45 Dr. Werner Fink 2015-06-24 07:03:56 UTC
@ Takashi : What do you think?
            Should we drop pm-utils and suspend from openSUSE Factory?
Comment 46 Dr. Werner Fink 2015-06-24 07:12:33 UTC
@ Guido : As you have re-added the pm-utils support and removed your own pm-utils-hooks-compat.sh script, I'd like to ask you if there is a newer script which can handle the users scripts below /etc/pm/sleep.d ... or does this require the pm-utils package ... or at least a pm-utils-compat package?
Comment 47 Takashi Iwai 2015-06-24 07:19:22 UTC
I'm dealing with a similar bug 935086, and the result was same: the user-suspend is broken, at least, for S4.  So, I'm pretty much for changing the default sleep to kernel now.  Even for openSUSE 13.2.  The lack of splash screen is a drawback, but I don't think people would complain too much for the cost of crash.

OTOH, dropping the whole pm-utils has to be done carefully.  People may still use own hooks.  Also, some hooks like Z99grub look really useful.

That said, I'm for dropping pm-utils, but not now.  We need the careful evaluation of each hook and port to systemd (and eventually upstreaming), put a warning to pm-utils that it'll be deprecated soon later, and after some time, we can drop pm-utils finally.
Comment 48 Forgotten User cAXlJ_FoSf 2015-06-24 08:06:33 UTC
(In reply to Dr. Werner Fink from comment #46)
> @ Guido : As you have re-added the pm-utils support and removed your own
> pm-utils-hooks-compat.sh script, I'd like to ask you if there is a newer
> script which can handle the users scripts below /etc/pm/sleep.d ... or does
> this require the pm-utils package ... or at least a pm-utils-compat package?

The script was OK, the problem was that not going through pm-utils and just relying on the script would run certain actions twice, once via systemd natively and once via my script and in no particular order.

(In reply to Takashi Iwai from comment #47)
> OTOH, dropping the whole pm-utils has to be done carefully.  People may
> still use own hooks.  Also, some hooks like Z99grub look really useful.

There are also packages using the hooks, I know of storage-fixup.

> That said, I'm for dropping pm-utils, but not now.  We need the careful
> evaluation of each hook and port to systemd (and eventually upstreaming),
> put a warning to pm-utils that it'll be deprecated soon later, and after
> some time, we can drop pm-utils finally.

I fully agree with that. One big problem with migrating pm-utils package and user hooks to systemd hooks is that systemd executes everything in parallel whereas the pm-utils hooks had a defined order.
Comment 49 Dr. Werner Fink 2015-06-24 08:11:58 UTC
(In reply to Guido Berhoerster from comment #48)

>              One big problem with migrating pm-utils package and user hooks
> to systemd hooks is that systemd executes everything in parallel whereas the 
> pm-utils hooks had a defined order.

If a master script is used for calling the others in serial order this should be solvable.
Comment 50 Forgotten User cAXlJ_FoSf 2015-06-24 08:46:47 UTC
(In reply to Dr. Werner Fink from comment #49)
> (In reply to Guido Berhoerster from comment #48)
> 
> >              One big problem with migrating pm-utils package and user hooks
> > to systemd hooks is that systemd executes everything in parallel whereas the 
> > pm-utils hooks had a defined order.
> 
> If a master script is used for calling the others in serial order this
> should be solvable.

The docs are not very clear, it seems that hooks in /usr/lib/systemd/system-sleep/ are called immediately before sleeping and after all native systemd units have run and before all native services will run after waking up.
Thus, if systemd takes over some of the functionality of pm-utils via native services, any user or package hooks executed via compatibility script in would be run in a different order wrt the native services than they did before. pm-utils had the convention that scripts starting with 00-49 could expect that the "usual services and userspace infrastructure is still running". I don't know how much of that is actually a problem in practice.
Comment 51 Wolfgang Bauer 2015-06-24 09:03:48 UTC
(In reply to Takashi Iwai from comment #47)
> The lack of
> splash screen is a drawback, but I don't think people would complain too
> much for the cost of crash.

Even with pm-utils and suspend installed and pm-utils using suspend (the current default SLEEP_MODULE="uswsusp"), there is no splash screen.

plymouth support has been removed from suspend before 13.2 was released.
https://build.opensuse.org/package/rdiff/Base:System/suspend?linkrev=base&rev=42

Btw, encryption support (that has been mentioned in comment#40) seems to have been removed as well:
https://build.opensuse.org/package/rdiff/Base:System/suspend?linkrev=base&rev=37

> OTOH, dropping the whole pm-utils has to be done carefully.  People may
> still use own hooks.  Also, some hooks like Z99grub look really useful.

The functionality of Z99grub is in /usr/bin/systemd-sleep-grub in 13.2 (for grub2 at least), which is called by the systemd units (e.g. systemd-hibernate.service) on hibernate and resume. Somehow this seems to be missing in Factory though... (by mistake?)
https://build.opensuse.org/package/view_file/Base:System/systemd-legacy/systemd-sleep-grub?expand=1

> That said, I'm for dropping pm-utils, but not now.  We need the careful
> evaluation of each hook and port to systemd (and eventually upstreaming),
> put a warning to pm-utils that it'll be deprecated soon later, and after
> some time, we can drop pm-utils finally.

Sounds sensible to me.
Comment 52 Dr. Werner Fink 2015-06-24 09:30:18 UTC
(In reply to Wolfgang Bauer from comment #51)

> The functionality of Z99grub is in /usr/bin/systemd-sleep-grub in 13.2 (for 
> grub2 at least), which is called by the systemd units (e.g. systemd-
> hibernate.service) on hibernate and resume. Somehow this seems to be missing 
> in Factory though... (by mistake?)
> https://rudin.suse.de:8894/package/view_file/Base:System/systemd-legacy/systemd-sleep-grub?expand=1

From changelog:

 Wed Feb 18 05:01:38 UTC 2015 - crrodriguez@opensuse.org
 [...]
 - systemd-sleep-grub: moved to the grub2 package where it belongs as a
   suspend/resume hook (SR#286533) (drops prepare-suspend-to-disk.patch)

the question rises is this was ever accepted? It seems to be declined and
revoked

  https://build.opensuse.org/request/show/286533

which I'm not calling a solution!
Comment 53 Wolfgang Bauer 2015-06-24 10:01:35 UTC
(In reply to Dr. Werner Fink from comment #52)
> From changelog:
> 
>  Wed Feb 18 05:01:38 UTC 2015 - crrodriguez@opensuse.org
>  [...]
>  - systemd-sleep-grub: moved to the grub2 package where it belongs as a
>    suspend/resume hook (SR#286533) (drops prepare-suspend-to-disk.patch)

Ah, ok.

But there's a reason why this has not been implemented as suspend/resume hook in the first place, to make sure it is called before/after all other hooks. See also bug 904828, comment 32 .

> the question rises is this was ever accepted? It seems to be declined and
> revoked
> 
>   https://build.opensuse.org/request/show/286533
> 
> which I'm not calling a solution!

Apparently not.
Comment 54 Stanislav Brabec 2015-06-24 15:27:05 UTC
Takashi_ Werner: Wolfgang: SLES12/SLED12 has no pm-utils and no pm-utils quirks. There is no regression report.

Also Fedora already dropped pm-utils, so it is very probable that most quirks there are obsolete.

Upstream of pm-utils died in 2010, openSUSE branch has only 26 commits since 2010. => No hardware newer than 5 years depends on pm-utils quirks.


So the real question is:

Does pm-utils still contain any requires quirks.

If yes, these should be ported to systemd quirks.


Look at these quirks. Evaluation is possible only for generic system quirks. Hardware quirks cannot be easily evaluated. They refer to hardware nobody of us owns and call binaries that don't even exist in openSUSE.
Comment 55 Takashi Iwai 2015-06-24 15:39:06 UTC
(In reply to Stanislav Brabec from comment #54)
> Takashi_ Werner: Wolfgang: SLES12/SLED12 has no pm-utils and no pm-utils
> quirks. There is no regression report.

SLE12 isn't a good base for evaluation, unfortunately.
How many people did test SLE12 with the very old laptop?

> Also Fedora already dropped pm-utils, so it is very probable that most
> quirks there are obsolete.

Agreed.
 
> Upstream of pm-utils died in 2010, openSUSE branch has only 26 commits since
> 2010. => No hardware newer than 5 years depends on pm-utils quirks.
> 
> 
> So the real question is:
> 
> Does pm-utils still contain any requires quirks.
> 
> If yes, these should be ported to systemd quirks.

Right.
 
> Look at these quirks. Evaluation is possible only for generic system quirks.
> Hardware quirks cannot be easily evaluated. They refer to hardware nobody of
> us owns and call binaries that don't even exist in openSUSE.

Yes, but what's your point?  Just drop the package because we have no way to check?  Common, there are just a dozen of scripts.  We can review by ourselves, decide take or not, and ask on ML in doubt.
Comment 56 Takashi Iwai 2015-06-24 15:58:15 UTC
So, let me list and check each sleep.d hook:

00logging:
This one is obviously superfluous, we have other logging method.  Drop.

00powersave:
The power-save is mostly irrelevant with sleep nowadays.  Drop.

02rtcwake:
This is an own RTC alarm setup.  Do we have an alternative in systemd or else?

06autofs:
The autofs hook should be in systemd.  I thought we have some hook in NetworkManager, though.  Need to check.

30s2disk-check:
This is likely superfluous.  A sanity check would be nice to have in systemd, too, but not mandatory.  If no enough page can be written, the hibernation aborts by itself.

45pcmcia:
This is superfluous.  The PCMCIA kernel subsystem itself supports PM.  Drop.

70rcnetwork:
Likely superfluous as 06autofs.  But need to check.

75modules:
This is an optional hook, and I guess some users have its own setup as a workaround.  Any systemd alternative?

90clock:
This is also an optional hook.  This might be interesting to have in systemd, too.

94cpufreq:
Drop.  It's a kernel's job, after all, and a regression should be reported to kernel.

95led:
This ACPI proc is obsoleted.  Drop.

98video-quirk-db-handler:
This one is the biggest one.  It's basically only for old non-KMS graphics.  If any, the hook and the existing video-quirks/* db can be in a package like pm-utils-legacy or such, as a possible rescue.

99info:
Drop, just an echo.

99video:
This is again for old systems, and can be possibly an opt-in together with 98video-quirk-db-handler.

99Zgrub:
The same functionality should have been already in grub2 and/or systemd.  Drop (after fixing).
Comment 57 Stanislav Brabec 2015-06-24 16:48:31 UTC
We have no process to drop obsolete quirks.

Most of hw quirks work-around a certain kernel bug. Most of these bugs are fixed, but nobody removed the quirk.


I am just studying pm-utils in deep.

It is even more complicated that I could imagine.


pm-utils support very large matrix of combinations:


Suspend methods:

kernel: The fallback method using /sys/power/state

tuxonice: If tuxonice or suspend2 kernel features are detected, then it switches to tuxonice. (I goess it is obsolete, isn't it?)

suspend: (called uswsusp): If suspend binaries are detected, video and chvt quirks are disabled. suspend binaries are called instead. There is a convertor which converts list of proposed quirks to s2* command line arguments. It is expected that these quirks are executed by suspend package binaries. This mode is the default.

=> If suspend package is installed, pm-utils completely change its behavior.



Hardware detection:

DMI: Uses /sys/class/dmi/id/ (the fallback)
HAL: Uses hal-get-property --udi /org/freedesktop/Hal/devices/computer --key
dmidecode: Uses dmidecode -s.

I hope that all of the return the same string, otherwise there is one another level of fragility.


I will go through particular modules, try to understand them, check in kernel or systemd.


I think that the behavior change is a nightmare for testing.

I'll check suspend package, but I guess we can completely drop it, letting all quirks to be called by pm-utils (or systemd).
Comment 58 Stanislav Brabec 2015-06-24 16:54:09 UTC
Takashi: I fully agree.

Porting of remaining quirks to systemd should be easy. I would create an extra legacy package.
Comment 59 Cristian Rodríguez 2015-06-24 19:04:48 UTC
(In reply to Stanislav Brabec from comment #58)
> Takashi: I fully agree.
> 
> Porting of remaining quirks to systemd should be easy. I would create an
> extra legacy package.

No, this quirks will be rejected by systemd upstream..do not add them to the systemd package but to the relevant buggy components that need them, ideally try fixing the actual bugs instead.
Comment 60 Cristian Rodríguez 2015-06-24 19:21:03 UTC
(In reply to Takashi Iwai from comment #56)

> 02rtcwake:
> This is an own RTC alarm setup.  Do we have an alternative in systemd or
> else?

http://joeyh.name/blog/entry/a_programmable_alarm_clock_using_systemd/ --> WakeSystem= timer setting..

> 06autofs:
> The autofs hook should be in systemd.  I thought we have some hook in
> NetworkManager, though.  Need to check.

Isn't this an autofs bug ? if it needs to be restarted after resume or does not respond.. maybe it misses an event ? a timer does not respond correctly ? is it using CLOCK_MONOTONIC where it should be using CLOCK_BOOTTIME ? is there a race condition ?


> 75modules:
> This is an optional hook, and I guess some users have its own setup as a
> workaround.  Any systemd alternative?

No, why would systemd have to unload kernel drivers before suspend ? this is to workaround buggy devices/drivers that do not come up correctly on system resume.. (usually devices that need reset-on-resume quirk afaik) Need to identify the buggy drivers ..
 
> 90clock:
> This is also an optional hook.  This might be interesting to have in
> systemd, too.

The top comment says:

"#!/bin/sh
# Synchronize system time with hardware time.
# Modern kernels handle this correctly so we skip this hook by default.
"
Yeah and if the kernel does not.. then there is a bug ..
Comment 61 Takashi Iwai 2015-06-24 20:41:17 UTC
(In reply to Cristian Rodríguez from comment #60)
> (In reply to Takashi Iwai from comment #56)
> 
> > 02rtcwake:
> > This is an own RTC alarm setup.  Do we have an alternative in systemd or
> > else?
> 
> http://joeyh.name/blog/entry/a_programmable_alarm_clock_using_systemd/ -->
> WakeSystem= timer setting..

How to script this at hibernation time?
 
> > 06autofs:
> > The autofs hook should be in systemd.  I thought we have some hook in
> > NetworkManager, though.  Need to check.
> 
> Isn't this an autofs bug ? if it needs to be restarted after resume or does
> not respond.. maybe it misses an event ? a timer does not respond correctly
> ? is it using CLOCK_MONOTONIC where it should be using CLOCK_BOOTTIME ? is
> there a race condition ?

Well, we need to check what problem actually this has solved.  I vaguely remember autofs issue in 11.x time, but haven't tracked since then.

> > 75modules:
> > This is an optional hook, and I guess some users have its own setup as a
> > workaround.  Any systemd alternative?
> 
> No, why would systemd have to unload kernel drivers before suspend ? this is
> to workaround buggy devices/drivers that do not come up correctly on system
> resume.. (usually devices that need reset-on-resume quirk afaik) Need to
> identify the buggy drivers ..

*We* as a distro need to provide a workaround until the kernel gets the fix.
Sure, you can blame kernel, but it's no excuse to ignore the breakage.  Show must go on.  The system must keep working like before.

(And why systemd?  Because it ate others' cookies :)

> > 90clock:
> > This is also an optional hook.  This might be interesting to have in
> > systemd, too.
> 
> The top comment says:
> 
> "#!/bin/sh
> # Synchronize system time with hardware time.
> # Modern kernels handle this correctly so we skip this hook by default.
> "
> Yeah and if the kernel does not.. then there is a bug ..

True.
Comment 62 Takashi Iwai 2015-06-24 20:45:36 UTC
(In reply to Cristian Rodríguez from comment #59)
> (In reply to Stanislav Brabec from comment #58)
> > Takashi: I fully agree.
> > 
> > Porting of remaining quirks to systemd should be easy. I would create an
> > extra legacy package.
> 
> No, this quirks will be rejected by systemd upstream..do not add them to the
> systemd package but to the relevant buggy components that need them, ideally
> try fixing the actual bugs instead.

As far as I understand, Stano will create another package as an add-on, so it won't bother systemd upstream.

And, look at video quirk db, for example.  These are actually workarounds for BIOS.  And they are for non-KMS.  So, they can't be fixed in kernel.
Comment 63 Dr. Werner Fink 2015-06-25 06:50:53 UTC
(In reply to Takashi Iwai from comment #61)
> > > 90clock:
> > > This is also an optional hook.  This might be interesting to have in
> > > systemd, too.
> > 
> > The top comment says:
> > 
> > "#!/bin/sh
> > # Synchronize system time with hardware time.
> > # Modern kernels handle this correctly so we skip this hook by default.
> > "
> > Yeah and if the kernel does not.. then there is a bug ..
> 
> True.

This is not true for all use cases. If the kernel is in the eleven minutes mode then YES it is true (man:adjtimex(1)).  This is the normal case for most systems running the ntp service. Nevertheless if the hardware clock is out of sync and is to fast/slow then this is not true even with ntp as the kernel does not touch the hardware it the offset of its system clock and the BIOS hardware clock is equal of more then 15 minutes -> /usr/src/linux/kernel/time/ntp.c in ntp_validate_timex().

Be aware that even modern hardware may have a hardware clock which is broken.
Comment 64 Takashi Iwai 2015-06-25 07:37:10 UTC
Adding Seife to Cc, who should have a better clue about pm-utils and suspend kludges.
Comment 65 Stefan Seyfried 2015-06-25 08:39:29 UTC
not much more than "just use suspend-to-ram, everything else is too slow anyway" :-)
But I will try to reproduce and debug.
Comment 66 Stefan Seyfried 2015-06-25 09:55:08 UTC
(In reply to Thomas Blume from comment #34)

> 2. from there, mount your system root:
> 
> mount /dev/disk/by-uuid/4c63085f-b79c-427d-bd4f-46ad072de4af /sysroot

> You can then umount /sysroot and try to continue the resume by typing: exit.
> If this doesn't work, you need to reset your machine. 

This will possibly lead to data corruption. *NEVER EVER* touch any mounted file system between suspend and resume.

If anything, mount an USB stick or a networked file system. Or a partition that was not mounted in the suspended system (but I'd even avoid that).


(In reply to Stanislav Brabec from comment #40)
> Here are special features of other packages:

> suspend: suspend is capable to do encrypted hibernation. This feature is
> limited to computers with AT keyboard. USB keyboards are not supported for
> entering key passphrase.

...and this feature is not enabled in SUSE builds (it broke with newer libgcrypt IIRC)

But suspend (s2disk) can do compressed multithreaded suspend which really speeds up suspending and makes it possible to suspend bigger working sets to smaller swap partitions.

s2disk can also suspend to a file instead of to a swap partition, I'm not sure if in-kernel suspend can do that. I'm also not certain, that the wrappers in opensuse support that feature at all.

It also can show a nice splash screen during suspend / resume with progress bar etc., but this feature is also disabled in SUSE builds (it broke with splashy and did not work nice with plymouth IIRC).

> pm-utils: This package contains special quirks for hibernation of some
> (mostly obsolete) devices: /usr/lib/pm-utils/sleep.d, and we have
> pm-utils-ndiswrapper.

pm-utils should die a horrible death :-)

> systemd also has a support for quirks in /usr/lib/systemd/system-sleep/ More
> than year ago, all package maintainers were asked to convert pm-utils quirks
> to systemd quirks.
> 
> Somebody also written a code to support pm-utils quirks inside systemd, but
> it seems to be dropped.
> 
> The status of pm-utils quirk database: Unmaintained for ~5 years.
> 
> Many quirks were working around ancient kernel or hardware bugs. Most of
> these bugs are already fixed, but quirks are staying.

That's why it needs to die :)

> Fedora has only systemctl, SLE12 has only systemctl as well.
> 
> 
> Here is my proposal:
> 
> 1. Drop "suspend" package now.

This will lose support for compressed suspend, but maybe someone will finally put that feature into the kernel (where it would fit well).

So yes, dropping is probably a good idea.

> 2. Drop all binaries from pm-utils now, and add a systemd quirk calling
> pm-utils quirks. Rename the package to systemd-pm-utils-quirks, and remove
> this package from the all package selections. Over the time, port needed
> quirks to systemd, and drop the package later.

Needed quirks should be put into the kernel.

> 3. Add Obsoletes: suspend pm-utils to systemd package.
> 
> 4. Edit pm-utils-ndiswrapper accordingly (or drop)-

Drop :-)

(In reply to Takashi Iwai from comment #56)
> So, let me list and check each sleep.d hook:
> 
> 00logging:
> This one is obviously superfluous, we have other logging method.  Drop.

Yes.
All below is from my (sometimes flakey) memory, so to be used with a grain of salt :-)

> 00powersave:
> The power-save is mostly irrelevant with sleep nowadays.  Drop.

The problem this solves is, that the kernel does / did not set e.g. settings that were set before suspend (unplugging AC might have lowered display brightness or sucht). calling "pm-powersave" would just do the same thing that the desktop applet would have triggered when unplugging AC. Or that suspend was on battery power, but resume on AC and the desktop applet would not notice the "plug in" event.

The solution (probably already implemented since a long time) is to have the desktop applets re-evaluate system state after resume and redo all the actions.

> 02rtcwake:
> This is an own RTC alarm setup.  Do we have an alternative in systemd or
> else?

This was a feature implemented by some novell person a few years ago for a SLED11 feature request by some preload project IIRC.

> 06autofs:
> The autofs hook should be in systemd.  I thought we have some hook in
> NetworkManager, though.  Need to check.

Yes, belongs to Networkmanager or whatever.

> 30s2disk-check:
> This is likely superfluous.  A sanity check would be nice to have in
> systemd, too, but not mandatory.  If no enough page can be written, the
> hibernation aborts by itself.

oh, this does so much more! :-) But to be honest: if the user is stupid enough, he can probably still do crazy things (see my comment on comment#34 of this very bug :-), and a standard installation should not need this.

It also did autoconfigure s2disk, but if we are dropping "suspend" package and support for userspace suspend, we don't need that anyway.

> 45pcmcia:
> This is superfluous.  The PCMCIA kernel subsystem itself supports PM.  Drop.

Well, and nobody has real PCMCIA anymore anyway.

> 70rcnetwork:
> Likely superfluous as 06autofs.  But need to check.

susi:~ # /etc/init.d/network stop-all-dhcp-clients
bash: /etc/init.d/network: No such file or directory

It cannot work :-)

> 75modules:
> This is an optional hook, and I guess some users have its own setup as a
> workaround.  Any systemd alternative?

Anyway, the kernel should be fixed instead.

> 90clock:
> This is also an optional hook.  This might be interesting to have in
> systemd, too.

no. comment in this hook:
# Modern kernels handle this correctly so we skip this hook by default.
Now this seems not to be true, because it is still running on my system, but we should be able to do without.

> 94cpufreq:
> Drop.  It's a kernel's job, after all, and a regression should be reported
> to kernel.

Almost everything in there is the kernel's job. Working around in userspace just makes those kernel developers lazy :-P

> 95led:
> This ACPI proc is obsoleted.  Drop.

Yes.

> 98video-quirk-db-handler:
> This one is the biggest one.  It's basically only for old non-KMS graphics. 
> If any, the hook and the existing video-quirks/* db can be in a package like
> pm-utils-legacy or such, as a possible rescue.

We can still try to reinstate something similar if we get reports of brokenness after we dropped it :-)

> 99info:
> Drop, just an echo.
> 
> 99video:
> This is again for old systems, and can be possibly an opt-in together with
> 98video-quirk-db-handler.

interesting:
reset_brightness()
{
        for bl in /sys/class/backlight/* ; do
                [ -f "$bl/brightness" ] || continue
                BR="$(cat $bl/brightness)"
                echo 0 > "$bl/brightness"
                echo "$BR" > "$bl/brightness"
        done
}

If this is still necessary => kernel bug IMO.

> 99Zgrub:
> The same functionality should have been already in grub2 and/or systemd. 
> Drop (after fixing).

Yes. The idea was to *not* give the user the possibility to select anything in GRUB during resume, to avoid stuff like comment#34 :-)

(In reply to Takashi Iwai from comment #61)
> (In reply to Cristian Rodríguez from comment #60)
> > > 06autofs:
> > > The autofs hook should be in systemd.  I thought we have some hook in
> > > NetworkManager, though.  Need to check.
> > 
> > Isn't this an autofs bug ? if it needs to be restarted after resume or does
> > not respond.. maybe it misses an event ? a timer does not respond correctly
> > ? is it using CLOCK_MONOTONIC where it should be using CLOCK_BOOTTIME ? is
> > there a race condition ?
> 
> Well, we need to check what problem actually this has solved.  I vaguely
> remember autofs issue in 11.x time, but haven't tracked since then.

I think it solves the "network changed during suspend, the autofs config (NIS/LDAP) changed, too, and autofs needs to know this" case.

> *We* as a distro need to provide a workaround until the kernel gets the fix.
> Sure, you can blame kernel, but it's no excuse to ignore the breakage.  Show
> must go on.  The system must keep working like before.
> 
> (And why systemd?  Because it ate others' cookies :)

yes, true.

(In reply to Dr. Werner Fink from comment #63)
> (In reply to Takashi Iwai from comment #61)
> > > > 90clock:
> This is not true for all use cases. If the kernel is in the eleven minutes
> mode then YES it is true (man:adjtimex(1)).  This is the normal case for
> most systems running the ntp service. Nevertheless if the hardware clock is
> out of sync and is to fast/slow then this is not true even with ntp as the
> kernel does not touch the hardware it the offset of its system clock and the
> BIOS hardware clock is equal of more then 15 minutes ->
> /usr/src/linux/kernel/time/ntp.c in ntp_validate_timex().

I think the suspend/resume case is different, but need to check the code to confirm.
Comment 67 Thomas Blume 2015-06-25 10:47:06 UTC
(In reply to Stefan Seyfried from comment #66)
> (In reply to Thomas Blume from comment #34)
> 
> > 2. from there, mount your system root:
> > 
> > mount /dev/disk/by-uuid/4c63085f-b79c-427d-bd4f-46ad072de4af /sysroot
> 
> > You can then umount /sysroot and try to continue the resume by typing: exit.
> > If this doesn't work, you need to reset your machine. 
> 
> This will possibly lead to data corruption. *NEVER EVER* touch any mounted
> file system between suspend and resume.
> 

Uh, but we do an fsck on the root filesystem during resume.
Would this be also bad?
Comment 68 Stefan Seyfried 2015-06-25 10:57:57 UTC
(In reply to Thomas Blume from comment #67)
> Uh, but we do an fsck on the root filesystem during resume.
> Would this be also bad?

Yes, that's absolutely deadly, and thus it has been fixed after 13.2 was released in dracut:

* Sa Nov 22 2014 arvidjaar@gmail.com
- add 0165-Order-root-fsck-after-pre-mount.patch
  ensure root fsck runs after dracut-pre-mount.service which calls
  resume (bnc#906592)
Comment 69 Stanislav Brabec 2015-06-25 13:36:06 UTC
> 90clock:

As far as I remember, this quirk causes only problems.

When my machine was offline, it was much better do disable this quirk (and also the same in the shutdown sequence).


It causes problems for two reasons:

1. The quirk itself expects system clock being better that hwclock. This is true for NTP, but not for many machines without NTP.

2. We have /etc/adjtime feature, that works much better for machines without NTP, as it has predictable results.

If the suspend quirk saves an system time which went off over the time, user then probably wants to adjust the clock to the correct time. adjtime calculation thinks that the whole drift was caused by hwclock inaccuracy, and updates seconds per day compensation to an incorrect value. Next time, the clock will be more shifted by this time.

Imagine that system clock went 2 minutes off while going to suspend. Quirk written this incorrect time to hwclock. Next day user discovers the shift, and adjusts time. /etc/adjtime will be set to 2 minutes per day. 5 das later, the system is booted, and /etc/adjtime computes need for 10 minutes correction.
Comment 70 Dr. Werner Fink 2015-06-25 13:57:56 UTC
(In reply to Stanislav Brabec from comment #69)

Indeed this quirk should be only optional and should only enabled for systems where the hardware clock is much more worse then the system clock.  For such systems the /etc/adjtime does not help.

For correct working clocks (hardware and system) the /etc/adjtime should not be used if ntp is active as on such systems the hardware clock and system clock *are* in sync.
Comment 71 Dr. Werner Fink 2015-06-25 14:02:06 UTC
To be more in detail: I had seen a lot of bug reports in both directions, that is do not touch hardware clock (as the reporters insist that it works for them) as well as the other side that in all cases the systems clock should be written back at reboot/halt/sleep (and also those reporters insist that it works for them).  Hardware bugs can be evil ...
Comment 72 Stanislav Brabec 2015-06-25 14:21:50 UTC
> 06autofs:

According to the bug 916737 comment 15, the NetworkManager fix already exists.

> 90clock:

Werner: It needs some way of global configuration:

- --hctosys on resume is a work around. Good systems should always update system time.

- --systohc on suspend/shutdown makes sense for systems with bad hwclock. But adjtime must be disabled then.

- --systohc on suspend/shutdown also makes sense on systems with NTP. But regular --systohc (e. g. once or twice daily) just after NTP update makes even more sense.

- regular --hctosys and regular --adjust makes sense for systems with bad system clock. But systohc on suspend/shutdown must be disabled then.

I guess we should create global sysconfig for them. This quirk can be easily transformed to systemd quirk, so the result will not be dependent on pm-utils.
Comment 73 Stefan Seyfried 2015-06-25 15:23:45 UTC
  (In reply to Stanislav Brabec from comment #72)
> > 06autofs:
> 
> According to the bug 916737 comment 15, the NetworkManager fix already
> exists.
> 
> > 90clock:

> - --hctosys on resume is a work around. Good systems should always update
> system time.

And the kernel already did that before. It is useless IMO.

> - --systohc on suspend/shutdown makes sense for systems with bad hwclock.
> But adjtime must be disabled then.
>
> - --systohc on suspend/shutdown also makes sense on systems with NTP. But
> regular --systohc (e. g. once or twice daily) just after NTP update makes
> even more sense.
> 
> - regular --hctosys and regular --adjust makes sense for systems with bad
> system clock. But systohc on suspend/shutdown must be disabled then.
> 
> I guess we should create global sysconfig for them. This quirk can be easily
> transformed to systemd quirk, so the result will not be dependent on
> pm-utils.

Let's do that once someone is reporting a problem. Do not fix things that are not broken :-)
Until then, just get rid of all the scripts, unless they are obviously still useful. I think they can all go away and we'll later fix what's broken.
Comment 74 Stanislav Brabec 2015-06-25 15:51:56 UTC
Stefan:

>Let's do that once someone is reporting a problem. Do not fix things that are >not broken :-)

Things are broken. A year ago I was debugging a customer problem, where these nice "clock improvement tricks" caused drift more than 20 years per day!

http://www.spinics.net/lists/util-linux-ng/msg09294.html
Comment 75 Wolfgang Bauer 2015-06-25 16:48:35 UTC
(In reply to Stanislav Brabec from comment #74)
> Things are broken.
Right.

At the moment suspend cannot even work in Factory/Tumbleweed AFAICT, because dracut doesn't add the necessary things to the initrd any more.
See http://lists.opensuse.org/opensuse-factory/2015-06/msg00258.html

And it's mostly suspend that causes problems on 13.2 too, not pm-utils, as far as I can tell.

This discussion is all good and nice (and necessary), but it won't help users with problems _now_.

And this bug report was for 13.2, where the changes that are being discussed now would be too disruptive anyway I suppose.

My proposal (again): change the defaults for pm-utils to _not_ use suspend, i.e.  set SLEEP_MODULE="kernel" in /usr/lib/pm-utils/defaults (even on 13.2 with an update) to "fix" the problems at hand.

The rest can still be discussed (and changed/fixed/moved/whatever) afterwards.

May I go forward with this? IOW, are you ok if I create a submit request with this change of defaults?
Comment 76 Stanislav Brabec 2015-06-25 17:16:30 UTC
Comment 75: As the current configuration does not work, and it will do after the change, change of /usr/lib/pm-utils/defaults seems to a simplest fix. It will cause that broken s2* will never be called from other suspend utilities.
Comment 77 Takashi Iwai 2015-06-25 17:55:38 UTC
(In reply to Wolfgang Bauer from comment #75)
> My proposal (again): change the defaults for pm-utils to _not_ use suspend,
> i.e.  set SLEEP_MODULE="kernel" in /usr/lib/pm-utils/defaults (even on 13.2
> with an update) to "fix" the problems at hand.

Full agreed.
Comment 78 Wolfgang Bauer 2015-06-25 18:08:44 UTC
(In reply to Stanislav Brabec from comment #76)
> Comment 75: As the current configuration does not work, and it will do after
> the change, change of /usr/lib/pm-utils/defaults seems to a simplest fix. It
> will cause that broken s2* will never be called from other suspend utilities.

(In reply to Takashi Iwai from comment #77)
> Full agreed.

Ok, I opened a submit request:
https://build.opensuse.org/request/show/313737

There was another one open already too, though.
I hope that doesn't cause any disturbance.

@Maintenance Team: may I submit the same default change for 13.2 too?
Comment 79 Stefan Seyfried 2015-06-25 19:05:15 UTC
(In reply to Stanislav Brabec from comment #74)
> Stefan:
> 
>> Let's do that once someone is reporting a problem. Do not fix things that are >> not broken :-)
> 
> Things are broken. A year ago I was debugging a customer problem, where
> these nice "clock improvement tricks" caused drift more than 20 years per
> day!

Yes, it's broken with this special hook. I'm proposing to just drop all the pm-utils tricks, *without* porting them to systemd or similar.

If afterwards we find that one of them was actually doing something useful (I doubt it :-), and it being gone causes problems, then we can fix those potential problems once we see them.

All of the above is for factory.

For 13.2, I agree that changing the default seems sensible.
Comment 80 Wolfgang Bauer 2015-06-25 20:29:14 UTC
(In reply to Stefan Seyfried from comment #79)
> For 13.2, I agree that changing the default seems sensible.

I would think it would be sensible for Factory even more.
See my previous comments.

And (again): http://lists.opensuse.org/opensuse-factory/2015-06/msg00258.html
That's what dracut does on Factory, and suspend support comes afterwards, so isn't really there on systemd systems.

(on 13.2 the situation is different, because those lines are not there yet)
Comment 81 Dr. Werner Fink 2015-06-26 07:28:52 UTC
(In reply to Stanislav Brabec from comment #72)

> Werner: It needs some way of global configuration:
> 
> - --hctosys on resume is a work around. Good systems should always update 
> system time.

Normally the kernel does this at boot ... therefore in mkinitrd/dracut the
kernels system clock has to warped if local time is or has to be used for the
hardware clock *before* file system checks and mounting them.

> 
> - --systohc on suspend/shutdown makes sense for systems with bad hwclock. But 
> adjtime must be disabled then.

ACK

> - --systohc on suspend/shutdown also makes sense on systems with NTP. But 
> regular --systohc (e. g. once or twice daily) just after NTP update makes 
> even more sense.

Here the matrix has to expanded, that is if ntp service is running then the kernel may stay in elven minutes mode (bit 64 in the status line return by
adjtimex is *not* set -> STA_UNSYNC[1]) there is no need to perfom system clock to hw clock.

> - regular --hctosys and regular --adjust makes sense for systems with bad 
> system clock. But systohc on suspend/shutdown must be disabled then.
> 
> I guess we should create global sysconfig for them. This quirk can be easily 
> transformed to systemd quirk, so the result will not be dependent on pm-utils.

ACK


[1]
/usr/include/bits/timex.h

/* Status codes (timex.status) */
#define STA_PLL         0x0001  /* enable PLL updates (rw) */
#define STA_PPSFREQ     0x0002  /* enable PPS freq discipline (rw) */
#define STA_PPSTIME     0x0004  /* enable PPS time discipline (rw) */
#define STA_FLL         0x0008  /* select frequency-lock mode (rw) */

#define STA_INS         0x0010  /* insert leap (rw) */
#define STA_DEL         0x0020  /* delete leap (rw) */
#define STA_UNSYNC      0x0040  /* clock unsynchronized (rw) */
#define STA_FREQHOLD    0x0080  /* hold frequency (rw) */

#define STA_PPSSIGNAL   0x0100  /* PPS signal present (ro) */
#define STA_PPSJITTER   0x0200  /* PPS signal jitter exceeded (ro) */
#define STA_PPSWANDER   0x0400  /* PPS signal wander exceeded (ro) */
#define STA_PPSERROR    0x0800  /* PPS signal calibration error (ro) */

#define STA_CLOCKERR    0x1000  /* clock hardware fault (ro) */
#define STA_NANO        0x2000  /* resolution (0 = us, 1 = ns) (ro) */
#define STA_MODE        0x4000  /* mode (0 = PLL, 1 = FLL) (ro) */
#define STA_CLK         0x8000  /* clock source (0 = A, 1 = B) (ro) */
Comment 82 Stanislav Brabec 2015-06-26 16:18:28 UTC
Here is my simple proposal for systemd.spec for Factory:

+Obsoletes:      pm-utils <= 1.4.1
+Obsoletes:      suspend <= 1.0

-# PATCH-FIX-OPENSUSE forward to pm-utils -- until boo#904828 is addressed
-Patch25:        Forward-suspend-hibernate-calls-to-pm-utils.patch

-%patch25 -p1

https://build.opensuse.org/project/show/home:sbrabec:branches:systemd-drop-pm-utils


I wrote a long mail to opensuse-factory:
http://lists.opensuse.org/opensuse-factory/2015-06/msg00443.html


What needs to be done before doing that:
- Finish fix of NetworkManager bug 916737

What should be done:
- Make a global configuration for hwclock (splitting out new bug 936265).

What can be done:
- Port video quirk wrapper to systemd and create systemd-video-quirks-legacy
Comment 83 Stanislav Brabec 2015-06-29 19:34:43 UTC
Submitted the drop changes:
https://build.opensuse.org/request/show/314429
https://build.opensuse.org/request/show/314430
https://build.opensuse.org/request/show/314431

If everything will run OK, we could close all bugs opened against pm-utils.


If somebody requests, I will create legacy packages porting some pm-utils quirks:

systemd-quirks-legacy

with sub-packages like

systemd-quirks-legacy-modules

and maybe later (if required)
systemd-quirks-legacy-video
systemd-quirks-legacy-rtc (that could be a part of systemd suspend.target, see bug 936265)
Comment 84 Bernhard Wiedemann 2015-06-30 08:00:13 UTC
This is an autogenerated message for OBS integration:
This bug (925873) was mentioned in
https://build.opensuse.org/request/show/314462 Factory / systemd
Comment 85 Benjamin Brunner 2015-07-06 13:24:17 UTC
@Wolfgang, I'm sorry for the late reply. The update for 13.2 should be ok. Could you open a maintenancerequest please? Thanks.
Comment 86 Wolfgang Bauer 2015-07-07 13:14:36 UTC
(In reply to Benjamin Brunner from comment #85)
> @Wolfgang, I'm sorry for the late reply. The update for 13.2 should be ok.
> Could you open a maintenancerequest please? Thanks.

Ok, done:
https://build.opensuse.org/request/show/315473
Comment 87 Swamp Workflow Management 2015-07-27 11:08:14 UTC
openSUSE-RU-2015:1291-1: An update that has one recommended fix can now be installed.

Category: recommended (moderate)
Bug References: 925873
CVE References: 
Sources used:
openSUSE 13.2 (src):    pm-utils-1.4.1-38.7.1
Comment 88 Franck Bui 2017-02-01 16:55:56 UTC
13.2 has reached EOL and is not supported anymore.

Feel free to open a new bug report if this still can be reproduced on
a newer/supported distro such as Leap or Tumbleweed.

Thanks.