Bug 934397

Summary: Resume from suspend to ram fails when HDD is connected
Product: [openSUSE] openSUSE Distribution Reporter: Dario Savella <dario>
Component: KernelAssignee: E-mail List <kernel-maintainers>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P5 - None CC: dario, hare, jslaby, tiwai
Version: 13.2Flags: tiwai: needinfo? (dario)
Target Milestone: ---   
Hardware: x86-64   
OS: openSUSE 13.2   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: journal during successful resume

Description Dario Savella 2015-06-11 13:28:22 UTC
During resume from suspend, the display content reappears correctly but after a short pause (2-3 secs) the kernel panics, the keyboard is unresponsive with lights flashing.
Reset/power cycle is the only way forward.

After a long series of tests, I discovered that if I disconnect the extra drive WDC WD30EZRX-00M (3TB), the desktop can resume correctly every time!

The additional drive is for storage only. It contains only a Samba share, but it was unmounted (but powered and connected).

I persisted with the tests only because I've recently noticed that both Ubuntu and Debian that I installed (in their most recent versions) on a 3rd drive (that too is a WD Caviar Green) can resume from suspend without problems.

Only the default graphic driver is installed:
bigboy:~ # lspci | grep VGA
00:02.0 VGA compatible controller: Intel Corporation 82G965 Integrated Graphics Controller (rev 02)

Event though I can't seem to have the correct debuginfo package installed to use crash, I was able to read in the dmesg saved by kdump (abstract):

[ 165.133021] ata4: SATA link down (SStatus 0 SControl 300)
[ 165.133038] ata3: SATA link down (SStatus 0 SControl 300)
[ 165.135012] ata6: SATA link down (SStatus 0 SControl 300)
[ 166.052486] ata7.00: configured for UDMA/33
[ 169.469022] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 169.470705] ata1.00: configured for UDMA/133
[ 170.182017] ata5: link is slow to respond, please be patient (ready=0)
[ 170.183010] ata2: link is slow to respond, please be patient (ready=0)
[ 174.824010] ata2: COMRESET failed (errno=-16)
[ 174.874021] ata5: COMRESET failed (errno=-16)
[ 175.129019] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 175.281767] ata2.00: configured for UDMA/133

Since then I tried to change cable and SATA port with no changes: if that drive is connected, resume fails.

I would be happy to provide more info if necessary.

Thanks for your help
Comment 1 Takashi Iwai 2015-06-11 13:55:39 UTC
Sounds like the dup of bug 913105.  All are with WD harddisks.
If it's the same bug, the bug was introduced somewhere in 3.13.

To be sure, could you check whether the recent kernel still has the same problem?  For example, try the 4.0.x kernel in OBS Kernel:stable repo.
Comment 2 Takashi Iwai 2015-06-11 14:00:37 UTC
Also, try the SLE12 kernel, found in OBS Kernel:SLE12 repo, too.  It's 3.12.x base, so if the bug is as same as bug 913105, this kernel may survive.
Comment 3 Dario Savella 2015-06-11 17:16:32 UTC
I tried with kernel 4.0.5-1.gf4cd21b-desktop and the problem is still there.
I do have a crash dump if it helps (I can only provide the files, not any skills to look at them).

The dmesg from the crash says:

[   73.957016] ata3: SATA link down (SStatus 0 SControl 300)
[   73.961017] ata4: SATA link down (SStatus 0 SControl 300)
[   73.961030] ata6: SATA link down (SStatus 0 SControl 300)
[   74.114023] usb 5-1: reset low-speed USB device number 2 using uhci_hcd
[   74.880484] ata7.00: configured for UDMA/33
[   75.748492] e1000e: enp0s25 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
[   78.244018] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[   78.245701] ata1.00: configured for UDMA/133
[   79.002184] ata5: link is slow to respond, please be patient (ready=0)
[   79.012012] ata2: link is slow to respond, please be patient (ready=0)
[   83.031024] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[   83.184936] ata5.00: configured for UDMA/133
[   83.704007] ata2: COMRESET failed (errno=-16)
[   85.539018] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[   85.663005] sr 6:0:0:0: **** DPM device timeout ****
[   85.663014]  ffff880124913c68 ffff880126430450 ffff88012490a5d0 ffffffff8168289e
[   85.663015] sd 0:0:0:0: **** DPM device timeout ****
[   85.663018]  ffff880124913fd8
[   85.663019]  ffff8801260fbc68
[   85.663020]  0000000000000010
[   85.663021]  ffff88012603e250
[   85.663022]  ffffffff814a6030
[   85.663023]  ffff8800cabea190
[   85.663024]  0000000000000000
[   85.663024]  000000000000f630
[   85.663025] 
[   85.663025] 
[   85.663027]  ffffffff81a8e160
[   85.663028]  ffff8801260fbfd8
[   85.663029]  ffff880124913c88
[   85.663030]  0000000000000010
[   85.663031]  ffffffff8167f1c7
[   85.663031]  ffffffff814a6030
[   85.663032]  0000000000000286
[   85.663033]  0000000000000000
[   85.663033] 
[   85.663034] 
[   85.663035] Call Trace:
[   85.663038]  ffffffff81a8e160 ffff8801260fbc88 ffffffff8167f1c7 0000000000000286
[   85.663039] Call Trace:
[   85.663051]  [<ffffffff8167f1c7>] schedule+0x37/0x90
[   85.663057]  [<ffffffff8167f1c7>] schedule+0x37/0x90
[   85.663061]  [<ffffffff81083f15>] async_synchronize_cookie_domain+0x55/0x130
[   85.663066]  [<ffffffff81083f15>] async_synchronize_cookie_domain+0x55/0x130
[   85.663070]  [<ffffffff814a5fc4>] scsi_bus_resume_common+0xa4/0xd0
[   85.663074]  [<ffffffff814a5fc4>] scsi_bus_resume_common+0xa4/0xd0
[   85.663078]  [<ffffffff8147a89a>] dpm_run_callback+0x4a/0x150
[   85.663081]  [<ffffffff8147a89a>] dpm_run_callback+0x4a/0x150
[   85.663084]  [<ffffffff8147ae7b>] device_resume+0x10b/0x240
[   85.663086]  [<ffffffff8147ae7b>] device_resume+0x10b/0x240
[   85.663088]  [<ffffffff8147afc9>] async_resume+0x19/0x40
[   85.663090]  [<ffffffff8147afc9>] async_resume+0x19/0x40
[   85.663092]  [<ffffffff81083d13>] async_run_entry_fn+0x43/0x150
[   85.663094]  [<ffffffff81083d13>] async_run_entry_fn+0x43/0x150
[   85.663098]  [<ffffffff8107bcf2>] process_one_work+0x142/0x420
[   85.663102]  [<ffffffff8107bcf2>] process_one_work+0x142/0x420
[   85.663104]  [<ffffffff8107c0e4>] worker_thread+0x114/0x460
[   85.663106]  [<ffffffff8107c0e4>] worker_thread+0x114/0x460
[   85.663108]  [<ffffffff81081261>] kthread+0xc1/0xe0
[   85.663111]  [<ffffffff81081261>] kthread+0xc1/0xe0
[   85.663114]  [<ffffffff816830d8>] ret_from_fork+0x58/0x90
[   85.663117]  [<ffffffff816830d8>] ret_from_fork+0x58/0x90
[   85.663118] Kernel panic - not syncing: sr 6:0:0:0: unrecoverable failure
Comment 4 Dario Savella 2015-06-12 06:53:21 UTC
I can confirm that kernel 3.12.43-15.g537dcf2-default from SLE12 does resume correctly when the drive is connected.
Comment 5 Takashi Iwai 2015-06-12 09:58:51 UTC
(In reply to Dario Savella from comment #3)
> I tried with kernel 4.0.5-1.gf4cd21b-desktop and the problem is still there.
> I do have a crash dump if it helps (I can only provide the files, not any
> skills to look at them).
> 
> The dmesg from the crash says:
> 
> [   73.957016] ata3: SATA link down (SStatus 0 SControl 300)
> [   73.961017] ata4: SATA link down (SStatus 0 SControl 300)
> [   73.961030] ata6: SATA link down (SStatus 0 SControl 300)
> [   74.114023] usb 5-1: reset low-speed USB device number 2 using uhci_hcd
> [   74.880484] ata7.00: configured for UDMA/33
> [   75.748492] e1000e: enp0s25 NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: Rx/Tx
> [   78.244018] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> [   78.245701] ata1.00: configured for UDMA/133
> [   79.002184] ata5: link is slow to respond, please be patient (ready=0)
> [   79.012012] ata2: link is slow to respond, please be patient (ready=0)
> [   83.031024] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> [   83.184936] ata5.00: configured for UDMA/133
> [   83.704007] ata2: COMRESET failed (errno=-16)
> [   85.539018] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> [   85.663005] sr 6:0:0:0: **** DPM device timeout ****

So this looks like the cause.  The recent kernel has a watchdog for async resume workers, and if it expires, it panics.  This explains why 3.12 worked; the watchdog was introduced since 3.13.

The timeout length is unfortunately fixed in Kconfig, set to 12 as default.  And this seems too short.  We should extend this to at least a minute, I suppose.  Also, it'd be better to be dynamically configuratble.

The openSUSE-13.2 test kernel packages with the extended timeout to 60 seconds is being built on OBS home:tiwai:bnc934397 repo.  Could you give it a try?  It will take some time until the build finishes.
Comment 6 Dario Savella 2015-06-12 10:40:55 UTC
I will give it a try.
While we wait for the build, could you help me to add the repository you mention ?
I can add repositories, but now with the notation you are using.
Comment 7 Dario Savella 2015-06-12 10:46:06 UTC
(In reply to Dario Savella from comment #6)
> I will give it a try.
> While we wait for the build, could you help me to add the repository you
> mention ?
> I can add repositories, but now with the notation you are using.

...but NOT with...
Comment 8 Takashi Iwai 2015-06-12 10:47:51 UTC
(In reply to Dario Savella from comment #6)
> I will give it a try.
> While we wait for the build, could you help me to add the repository you
> mention ?
> I can add repositories, but now with the notation you are using.

  osc ar obs://home:/tiwai:/bnc934397/standard test-kernel

The build seems already finished, but not published yet.  Meanwhile you can get binaries directly via
  osc getbinaries home:tiwai:bnc934397/kernel-desktop/standard/x86_64
Comment 9 Takashi Iwai 2015-06-12 10:49:07 UTC
(In reply to Takashi Iwai from comment #8)
> (In reply to Dario Savella from comment #6)
> > I will give it a try.
> > While we wait for the build, could you help me to add the repository you
> > mention ?
> > I can add repositories, but now with the notation you are using.
> 
>   osc ar obs://home:/tiwai:/bnc934397/standard test-kernel

Sorry, it's zypper, instead of osc, of course.
 
> The build seems already finished, but not published yet.  Meanwhile you can
> get binaries directly via
>   osc getbinaries home:tiwai:bnc934397/kernel-desktop/standard/x86_64

This is with osc.  With directly using osc, you can download the unpublished packages, too.
Comment 10 Takashi Iwai 2015-06-12 12:23:42 UTC
... and now the project is published, can be downloaded from
    http://download.opensuse.org/repositories/home:/tiwai:/bnc934397/standard/
Comment 11 Dario Savella 2015-06-12 13:17:51 UTC
Good news I suppose. The new kernel resumes from suspend with the drive connected and/or mounted.
Just be sure I installed the right thing:

bigboy:~ # cat /proc/cmdline 
BOOT_IMAGE=/boot/vmlinuz-3.16.7-1.gf382e20-desktop root=UUID=b914b0a0-964b-4fa7-91c9-a0f8f0b57fcf quiet resume=/dev/sda1 splash=silent quiet showopts crashkernel=256M-:128M clocksource=tsc vga=792

Is there anything else you need from my end ?

What's the release cycle for these fixes?
I'm in no hurry, but I'd like to know when to stop worrying about resume.
Comment 12 Takashi Iwai 2015-06-12 13:22:21 UTC
(In reply to Dario Savella from comment #11)
> Good news I suppose. The new kernel resumes from suspend with the drive
> connected and/or mounted.
> Just be sure I installed the right thing:
> 
> bigboy:~ # cat /proc/cmdline 
> BOOT_IMAGE=/boot/vmlinuz-3.16.7-1.gf382e20-desktop
> root=UUID=b914b0a0-964b-4fa7-91c9-a0f8f0b57fcf quiet resume=/dev/sda1
> splash=silent quiet showopts crashkernel=256M-:128M clocksource=tsc vga=792
> 
> Is there anything else you need from my end ?

Could you give the kernel messages after resume with the new kernel?
 
> What's the release cycle for these fixes?
> I'm in no hurry, but I'd like to know when to stop worrying about resume.

The change must be safe, so I can take it soon.  But, the official update release may take some time for openSUSE 13.2, as usual.
Comment 13 Dario Savella 2015-06-12 14:40:32 UTC
Created attachment 637702 [details]
journal during successful resume
Comment 14 Mark Scott 2015-06-13 15:22:08 UTC
Upstream informed, please see bug report https://bugzilla.kernel.org/show_bug.cgi?id=91921
Comment 15 Jiri Slaby 2015-06-15 09:22:34 UTC
Pushed:
   0e899eb6113c..b5e86cc44ede  stable^ -> stable
Comment 16 Takashi Iwai 2015-06-15 11:50:55 UTC
The fix has been merged to 13.2, stable and master branches.  Let's close.
Comment 17 Dario Savella 2015-06-15 17:23:32 UTC
Sure. Thanks a lot for your help.
Comment 18 Swamp Workflow Management 2015-08-14 09:14:05 UTC
openSUSE-SU-2015:1382-1: An update that solves 21 vulnerabilities and has 8 fixes is now available.

Category: security (important)
Bug References: 907092,907714,915517,916225,919007,919596,921769,922583,925567,925961,927786,928693,929624,930488,930599,931580,932348,932844,933934,934202,934397,934755,935530,935542,935705,935913,937226,938976,939394
CVE References: CVE-2014-9728,CVE-2014-9729,CVE-2014-9730,CVE-2014-9731,CVE-2015-1420,CVE-2015-1465,CVE-2015-2041,CVE-2015-2922,CVE-2015-3212,CVE-2015-3290,CVE-2015-3339,CVE-2015-3636,CVE-2015-4001,CVE-2015-4002,CVE-2015-4003,CVE-2015-4036,CVE-2015-4167,CVE-2015-4692,CVE-2015-4700,CVE-2015-5364,CVE-2015-5366
Sources used:
openSUSE 13.2 (src):    bbswitch-0.8-3.11.1, cloop-2.639-14.11.1, crash-7.0.8-11.1, hdjmod-1.28-18.12.1, ipset-6.23-11.1, kernel-debug-3.16.7-24.1, kernel-default-3.16.7-24.1, kernel-desktop-3.16.7-24.1, kernel-docs-3.16.7-24.2, kernel-ec2-3.16.7-24.1, kernel-obs-build-3.16.7-24.2, kernel-obs-qa-3.16.7-24.1, kernel-obs-qa-xen-3.16.7-24.1, kernel-pae-3.16.7-24.1, kernel-source-3.16.7-24.1, kernel-syms-3.16.7-24.1, kernel-vanilla-3.16.7-24.1, kernel-xen-3.16.7-24.1, pcfclock-0.44-260.11.1, vhba-kmp-20140629-2.11.1, xen-4.4.2_06-25.1, xtables-addons-2.6-11.1