Bug 913105 - S2RAM Resume Failure on OS 13.2 Nvidia GTX750 binary blob and Nouveau
Summary: S2RAM Resume Failure on OS 13.2 Nvidia GTX750 binary blob and Nouveau
Status: RESOLVED FIXED
: 929020 (view as bug list)
Alias: None
Product: openSUSE Distribution
Classification: openSUSE
Component: Kernel (show other bugs)
Version: 13.2
Hardware: x86-64 openSUSE 13.2
: P5 - None : Major (vote)
Target Milestone: ---
Assignee: E-mail List
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-01-14 15:23 UTC by Mark Scott
Modified: 2015-07-11 04:31 UTC (History)
7 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---
hare: needinfo? (markcscott2003)


Attachments
Hwinfo output (42.99 KB, text/plain)
2015-01-14 19:00 UTC, Mark Scott
Details
DMESG ACPI Output For OS 3.12 K3.16 (7.99 KB, text/plain)
2015-01-21 11:26 UTC, Mark Scott
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mark Scott 2015-01-14 15:23:48 UTC
I recently experimented with kernel 3.18 on opensuse 13.1 but that resulted in a resume failure on S2RAM, see here for posting:

https://forums.opensuse.org/showthread.php/503846-Failure-to-resume-from-Suspend-to-RAM-Kernel-3-18-1-1-1-g5f2f35e-on-OS-13-1-x64

Oh well I thought time to move onto 13.2 however this now exhibits the same behaviour as 13.1 + 3.18 kernel (13.1 + 3.11 kernel OK) in that it will not resume from S2RAM and is hard locked so only a reset will do. This was a clean install on a brand new SSD drive with home copied so no cruft left over.

I tried both the Nvidia blob from the repos and 346.22 beta the hardway but the problem persisted. 

With the nouveau driver I experience the same issue as bug 

https://bugzilla.opensuse.org/show_bug.cgi?id=904483

With respect to the nvidia binary blob if I add the following to /usr/lib/pm-utils/defaults 

HIBERNATE_RESUME_POST_VIDEO="yes"
SLEEP_MODULE="kernel"

then I am reliably able to resume from suspend to RAM - so far only tested working on OS 13.2 with stock 3.16 kernel and nvidia blob 340.65 from the nvidia easy way repo.

For further info please see this OS forum thread:

https://forums.opensuse.org/showthread.php/504281-S2RAM-Resume-Failure-on-13-2

Thanks
Comment 1 Bernhard Wiedemann 2015-01-14 18:39:59 UTC
please attach output from
hwinfo --all

To me this sounds like a regression in the kernel
somewhere between 3.11 and 3.16
or video driver problems?
Comment 2 Mark Scott 2015-01-14 19:00:48 UTC
Created attachment 619583 [details]
Hwinfo output

Bug 913105 hwinfo
Comment 3 Mark Scott 2015-01-17 10:22:48 UTC
Hi Bernhard

Correction

Using HIBERNATE_RESUME_POST_VIDEO="yes" actually makes no difference. What seems to be happening is that I can resume from RAM may be one or twice after a reboot but then lock ups are experienced on any further suspend resumes to RAM. It is a bit erratic and unpredictable as to when it does actually resume. More often than not I cannot resume from RAM.

Thanks
Comment 4 Takashi Iwai 2015-01-20 15:53:33 UTC
S3 is a tough problem.  Basically you need to either find out the culprit via git bisect, or try to trim down the buggy component by reducing the modules or configurations.

For the latter, set up the system with the reduced modules, i.e. without graphics (no nouveau, no nvidia, just a text console with nomodeset vga=normal boot option).  Also, try to drop pm-utils package for possible obsoleted hooks.
In such a reduced system, try the S3 and see whether the kernel can resume (more or less) stably.
Comment 5 Mark Scott 2015-01-21 11:26:45 UTC
Created attachment 620319 [details]
DMESG ACPI Output For OS 3.12 K3.16

Hi Takashi and thank you for the reply. 

Please find attached the ACPI info from Dmesg just for info. I've a spare disk that I'll throw a fresh OS 13.2 to on and use that as a test rig as per your suggestions. In the meantime I've tried Fedora 21 with the same results. Looks like some hardwork ahead for me and a learning experience.
Comment 6 Mark Scott 2015-01-21 21:59:47 UTC
Good News ! 

I tested Kernel 3.18.3-1.1.gc3e148f on both OS 13.2 and OS 13.1 and I can now reliably resume from suspend to RAM using the Nvidia binary blob (I could not before) which leads me to conclude that there was a Kernel bugfix between K 3.18.1 and K 3.18.3 that has resolved the issue. There is one caveat in that the Nouveau driver does not resume from suspend to RAM with OS 13.1 + k3.18.1 or K3.18.3, OS 13.2 + K3.16 or K3.18.3. I didn't test Nouveau with OS 13.1 + k3.11 as I always use the Nvidia blob.

If you need further info please let me know.

Thanks
Comment 7 Mark Scott 2015-01-22 00:14:39 UTC
Bad News and Good News !

The bad news is my failure to resume from ram is back now I've connected everything together. The good news is I've found the cause. It's the pair of WD black 2TB drives. If either one of them is in or both (see HWinfo output) then resume from RAM fails. I've taken them out and tried lots of other drives in their place and S3 resume works for everything else but these drives, please note they are NTFS formatted. I've tried both ext 4 and NTFS formatted other drives in their place and they all work perfectly. These drives are definitely the root cause of my issue, now I need to find out why they worked under Kernel 3.11 but not later. Any suggestions as to my next step would be appreciated.

The models are:

WD2003FZEX-00Z4SA0

and

WD2002FAEX-0

NB

BIOS is set to legacy boot and not UEFI.

The other drives I tried were all either 1gb or less, don't know if 2gb size is important.
Comment 8 Takashi Iwai 2015-01-22 13:07:46 UTC
Good that you found out the cause!  Do you see any kernel messages relevant with these drives?

Also, with "no_console_suspend" boot option, you might see a bit more kernel messages at resume hang...
Comment 9 Mark Scott 2015-01-22 14:29:43 UTC
Hi Takashi

A couple of things to note:

1) I've had a look at all the drives I have tested under Windows using Crystaldiskinfo and curiously the pair of WD 2TB don't show the APM flag while all others do.

2) I noticed both WD 2TB drives have Grub2 bootloaders in their MBRs - an overhang from previous configurations that I forgot to get rid of.

I've checked journalctrl and there appears to be no kernel problems regards these drives, if you want the logs I can post.

I tried no_console_suspend=1 in kernel boot parameters but makes no difference, I assume I did that correctly ?

Thanks
Comment 10 Takashi Iwai 2015-01-22 14:43:40 UTC
With no_console_suspend boot option, you should see some messages (shortly) at the beginning of the resume.  Didn't it happen?  Of course, it's possible that something goes wild before the graphics engine is reinitialized...
Comment 11 Mark Scott 2015-01-23 11:37:04 UTC
Hi Takashi

Yes it does go BOOM but the graphics does come up with about half a line and then freezes, the entire system locks hard and only a physical power reset will do. I'm of the opinion best candidate causing the issue is to do with the fact that APM is not supported by these two drives and this scenario used to be handled by K3.11 however a commit beyond this kernel has caused a regression, unfortunately I have no other drives that do not support APM to test this theory. All other drives I have tested (about 6 in total including other WD) do support APM and cause no problem when resuming from ram. 

I did find a thread over at Arch linux into odd APM behaviour regarding WD drives that may be semi-relevent.

Just for info:

# smartctl -x /dev/sda | grep APM
APM feature is:   Unavailable

https://bbs.archlinux.org/viewtopic.php?id=159233

Thanks
Comment 12 Takashi Iwai 2015-01-23 12:59:37 UTC
(In reply to Mark Scott from comment #11)
> Hi Takashi
> 
> Yes it does go BOOM but the graphics does come up with about half a line and
> then freezes, the entire system locks hard and only a physical power reset
> will do. I'm of the opinion best candidate causing the issue is to do with
> the fact that APM is not supported by these two drives and this scenario
> used to be handled by K3.11 however a commit beyond this kernel has caused a
> regression, unfortunately I have no other drives that do not support APM to
> test this theory. All other drives I have tested (about 6 in total including
> other WD) do support APM and cause no problem when resuming from ram. 
> 
> I did find a thread over at Arch linux into odd APM behaviour regarding WD
> drives that may be semi-relevent.
> 
> Just for info:
> 
> # smartctl -x /dev/sda | grep APM
> APM feature is:   Unavailable
> 
> https://bbs.archlinux.org/viewtopic.php?id=159233
> 
> Thanks

Maybe Hannes has a better clue...   Hannes, any known PM issue with such a disk?
Comment 13 Mark Scott 2015-01-23 15:28:25 UTC
Hi All,

I think in all probability this is a kernel bug regression so I'm of the mind to open a bug report at the Kernel Bug Tracker.

Thanks
Comment 14 Mark Scott 2015-01-23 16:07:21 UTC
Bug filed at Kernel Bugzilla:

https://bugzilla.kernel.org/show_bug.cgi?id=91921
Comment 15 Mark Scott 2015-01-23 16:40:56 UTC
Uprated the bug to major as anything that causes a hard crash is major.
Comment 16 Mark Scott 2015-01-23 17:31:24 UTC
Hi,

I think I may have found someone else with a similar bug:

https://bugzilla.opensuse.org/show_bug.cgi?id=913885

The are using WDC WD30EFRX - 3TB WD red which also does not support APM. As it's 3TB they are using UEFI boot as opposed to my legacy boot.
Comment 17 Mark Scott 2015-01-24 10:14:05 UTC
Hi Takashi,

I noted you are working on this bug:

https://bugzilla.suse.com/show_bug.cgi?id=904483

which has also been passed to here:

https://bugs.freedesktop.org/show_bug.cgi?id=86115

When checking the hardware in this bug report I noted he has a WD drive which also does not support APM:

WDC WD20EARS-00MVWB

Do you think there is a possible connection ?

I noted your comment here:

"One is the broken graphics by S3, and another is the kernel panic due to the stall of disk PM"

Thanks
Comment 18 Mark Scott 2015-01-24 14:33:39 UTC
A little bit more info

Using the debug method:

sh -c "sync && echo 1 > /sys/power/pm_trace && pm-suspend"

The magic number returned is:

[    2.941809]   Magic number: 0:278:890
[    2.961872]   hash matches ../drivers/base/power/main.c:736

however this does match any pci device so using:

cat /sys/power/pm_trace_dev_match

The resulting output is:

bsg
scsi_device
scsi_disk
sd
pci

Thanks
Comment 19 Mark Scott 2015-01-25 21:33:35 UTC
Hi All,

I've managed to get my hands on Western Digital WD10EADS-65M280 which is a 1gb Caviar green which does not support APM. Testing the sytem with this drive attached, suspend to ram resume was successful so lack of APM does appear to be a universal factor with WD drives. Stumped for my next move.

Thanks
Comment 20 Mark Scott 2015-01-25 21:36:50 UTC
OOps correction:


I've managed to get my hands on Western Digital WD10EADS-65M280 which is a 1gb Caviar green which does not support APM. Testing the sytem with this drive attached, suspend to ram resume was successful so lack of APM does not appear to be a universal factor with WD drives. Stumped for my next move.

the correction was "does NOT appear"
Comment 21 Mark Scott 2015-01-26 22:41:01 UTC
With reference to comment 18 PM_trace was performed on Opensuse 13.2 using kernel 3.16.7

Thanks
Comment 22 Björn Voigt 2015-01-26 23:01:22 UTC
I may be worth to test other Kernel versions too, like 3.18.3 (from Kernel_stable repository) or latest 3.12.x kernel (from www.kernel.org).

I never had any issues with suspend-to-ram and suspend-to-disk with non-UEFI booting and Kernel 3.12.x on my PC. Problems started with UEFI and with Kernels >= 3.13.
Comment 23 Björn Voigt 2015-01-27 08:30:01 UTC
Today I had an SATA drive exception after resume from suspend-to-disk. I am relatively sure, that the exception occured after first I/O operation with this disk. (My Seamonkey profile uses the drive as cache and for some mail folders. I mount the EXT4 filesystem with option errors=remount-ro.)

Is this exception power-management related?

Setup:
- Kernel 3.12.36
- Mainboard & BIOS: DH55TC, BIOS TCIBX10H.86A.0048.2011.1206.1342 12/06/2011
- Second drive: ata4.00: ATA-9: WDC WD30EFRX-68EUZN0, 80.00A80, max UDMA/133
  (Western Digital Red 3 TB)

The exception (from /var/log/messages):
2015-01-27T09:07:15.986794+01:00 mybox sudo: pam_unix(sudo:session): session closed for user root
2015-01-27T09:07:17.087711+01:00 mybox kernel: [59711.988610] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
2015-01-27T09:07:17.087729+01:00 mybox kernel: [59711.988615] ata4.00: irq_stat 0x40000001
2015-01-27T09:07:17.087730+01:00 mybox kernel: [59711.988620] ata4.00: failed command: READ DMA EXT
2015-01-27T09:07:17.087732+01:00 mybox kernel: [59711.988629] ata4.00: cmd 25/00:08:b0:1d:ed/00:00:a0:00:00/e0 tag 7 dma 4096 in
2015-01-27T09:07:17.087733+01:00 mybox kernel: [59711.988629]          res 51/40:08:b0:1d:ed/00:00:a0:00:00/e0 Emask 0x9 (media error)
2015-01-27T09:07:17.087735+01:00 mybox kernel: [59711.988633] ata4.00: status: { DRDY ERR }
2015-01-27T09:07:17.087736+01:00 mybox kernel: [59711.988636] ata4.00: error: { UNC }
2015-01-27T09:07:17.090989+01:00 mybox kernel: [59711.993572] ata4.00: configured for UDMA/133
2015-01-27T09:07:17.090993+01:00 mybox kernel: [59711.993579] sd 3:0:0:0: [sdb] Unhandled sense code
2015-01-27T09:07:17.090994+01:00 mybox kernel: [59711.993580] sd 3:0:0:0: [sdb]  
2015-01-27T09:07:17.090994+01:00 mybox kernel: [59711.993581] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
2015-01-27T09:07:17.090995+01:00 mybox kernel: [59711.993582] sd 3:0:0:0: [sdb]  
2015-01-27T09:07:17.090995+01:00 mybox kernel: [59711.993584] Sense Key : Medium Error [current] [descriptor]
2015-01-27T09:07:17.090996+01:00 mybox kernel: [59711.993586] Descriptor sense data with sense descriptors (in hex):
2015-01-27T09:07:17.090996+01:00 mybox kernel: [59711.993590]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
2015-01-27T09:07:17.090997+01:00 mybox kernel: [59711.993595]         a0 ed 1d b0 
2015-01-27T09:07:17.090997+01:00 mybox kernel: [59711.993597] sd 3:0:0:0: [sdb]  
2015-01-27T09:07:17.090998+01:00 mybox kernel: [59711.993598] Add. Sense: Unrecovered read error - auto reallocate failed
2015-01-27T09:07:17.090998+01:00 mybox kernel: [59711.993599] sd 3:0:0:0: [sdb] CDB: 
2015-01-27T09:07:17.090999+01:00 mybox kernel: [59711.993600] Read(16): 88 00 00 00 00 00 a0 ed 1d b0 00 00 00 08 00 00
2015-01-27T09:07:17.090999+01:00 mybox kernel: [59711.993606] end_request: I/O error, dev sdb, sector 2699894192
2015-01-27T09:07:17.091000+01:00 mybox kernel: [59711.993619] ata4: EH complete
2015-01-27T09:07:17.111065+01:00 mybox kernel: [59712.011555] EXT4-fs error (device sdb6): ext4_find_entry:1303: inode #50857019: comm seamonkey-bin: reading directory lblock 0
2015-01-27T09:07:17.111081+01:00 mybox kernel: [59712.011560] Aborting journal on device sdb6-8.
2015-01-27T09:07:17.111083+01:00 mybox kernel: [59712.011737] EXT4-fs (sdb6): Remounting filesystem read-only
2015-01-27T09:09:34.637633+01:00 mybox kernel: [59849.546908] usb 2-1.4: usbfs: process 3898 (amarok) did not claim interface 0 before use
Comment 24 Mark Scott 2015-01-29 16:27:27 UTC
Hi All

I've done some further testing and obtained some error messages which I don't fully understand.

See kernel bug report for info:

https://bugzilla.kernel.org/show_bug.cgi?id=91921

They are related to the Intel Corporation 82801 PCI Bridge.

There was a discussion about problems of how VT-d boards are handled resuming from suspend to RAM with Linus T. chipping in himself.

http://www.gossamer-threads.com/lists/linux/kernel/1894740

I noted that a common factor with the two other bug reports I referenced, they are all Vt-d boards.
Comment 25 Mark Scott 2015-01-31 14:42:42 UTC
Hi Takashi,

I've done some more Kernel version testing on this bug and I can reproduce exactly the same results as this bug report:

https://bugs.freedesktop.org/show_bug.cgi?id=86115

So the last known good kernel is 3.12.37 and the first known bad Kernel is 3.13.7

I think it's safe to assume that https://bugzilla.suse.com/show_bug.cgi?id=904483 is a duplicate of this bug.


Thanks
Comment 26 Mark Scott 2015-02-02 10:03:12 UTC
Hi Takashi,

I need some help with some investigation work into this bug as the things I need to progress this are currently above my level of understanding and skill set.

please see kernel bug report comments 13 & 14.

Thanks
Comment 27 Takashi Iwai 2015-02-02 14:51:00 UTC
(In reply to Mark Scott from comment #26)
> Hi Takashi,
> 
> I need some help with some investigation work into this bug as the things I
> need to progress this are currently above my level of understanding and
> skill set.
> 
> please see kernel bug report comments 13 & 14.

You need to build your kernel from scratch, and find out the regression point via git bisect.  Sorry, it's not what we can provide as a package.  You need to learn how to compile and install a kernel manually.  A brief instruction can be found in the bugzilla comment below, for example:
   https://apibugzilla.novell.com/show_bug.cgi?id=907368#c36
Comment 28 Mark Scott 2015-02-02 16:37:13 UTC
Hi Takashi,

Thanks for the link, I'll give it a go and wish me luck. I'm going to have one grumpy wife and a big coffee bill :)

Do you think I should open a separate bug report on this as we have two issues here ?

1) Resume busted by Nouveau

2) Resume busted by unknown WD disk problem.

Issue 1 has already been reported upstream so you might as well consider that part of the bug report closed as it's moved upstream.

Issue 2 - New bug report ?

Many Thanks
Comment 29 Hannes Reinecke 2015-05-27 09:29:34 UTC
Sorry, no idea here.
It's not totally unfeasible that the WD drives have firmware issues; it would be really helpful if we could get a dmesg output when the devices are locked.
But only if there are some ATA messages found; if the output is identical to the working one there's no point in trying.
Can you please attach the dmesg output / screenshot from the failing case?
Comment 30 Mark Scott 2015-06-04 09:44:00 UTC
Hi Hans

Thanks for the reply, a git bisect is needed to nail it but I've just not had the time. You cannot get a DMESG output as it causes a hard lock before any relevant messages can be obtained. Yes it's one of those unfortunately.

See here for info:

https://bugzilla.kernel.org/show_bug.cgi?id=91921

It works fine under Windows and worked before kernel 3.13.7 so it's a Kernel issue.
Comment 31 Takashi Iwai 2015-06-12 13:37:44 UTC
Could you try the oS 13.2 test kernel in OBS home:tiwai:bnc934397 repo?
This worked for a similar bug report with WD HD.
Comment 32 Mark Scott 2015-06-13 09:45:59 UTC
Takashi you are a genius ! May you be blessed with good fortune for the next one hundred years, this has resolved the issue. Please can you let me know what the issue was so I can report upstream to kernal and close off the bug report I have there ? 

Many thanks

Dragon 32
Comment 33 Takashi Iwai 2015-06-15 11:52:29 UTC
The fix for WD disks has been merged to 13.2, stable and master branches (for bug 934397).
Comment 34 Stephan Barth 2015-07-11 04:31:52 UTC
*** Bug 929020 has been marked as a duplicate of this bug. ***