Bug 1034357

Summary: Kernel 4.10.9: Computer Intermittently Reboots
Product: [openSUSE] openSUSE Tumbleweed Reporter: John Shand <jshand2013>
Component: KernelAssignee: E-mail List <kernel-maintainers>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Major    
Priority: P5 - None CC: bpetkov, jshand2013, Larry.Finger, tiwai
Version: CurrentFlags: bpetkov: needinfo? (jshand2013)
Target Milestone: ---   
Hardware: x86-64   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: DMESG with errors in certain areas
Hardware Information
journalctl -b information
iomem contents

Description John Shand 2017-04-17 03:47:20 UTC
Created attachment 721379 [details]
DMESG with errors in certain areas

i was doing git downloads, watching youtube, chatting on skype and a few other things and i got these errors after the computer suddenly desired to reboot by itself.  i have very little understanding on the problem itself, but i have a few error codes.

mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: b600000000070f0f
mce: [Hardware Error]: TSC 0 ADDR fea00040
mce: [Hardware Error]: PROCESSOR 2:610f31 TIME 1492398286 SOCKET 0 APIC 0 microcode 6001119

i also checked the mcelog and got:

jshand@linux-zkok:~> sudo mcelog --client
Memory errors
SOCKET 0 CHANNEL 0 DIMM 0
DMI_NAME "Node0_Dimm1" DMI_LOCATION "Node0_Bank0"
corrected memory errors:
        0 total
        0 in 24h
uncorrected memory errors:
        0 total
        0 in 24h

dmesg info is added as a file

i hope this helps
Comment 1 John Shand 2017-04-17 03:49:10 UTC
Created attachment 721382 [details]
Hardware Information
Comment 2 John Shand 2017-04-17 03:51:30 UTC
let me know if there is any other information you need
Comment 3 John Shand 2017-04-17 05:01:15 UTC
Created attachment 721384 [details]
journalctl -b information
Comment 4 John Shand 2017-04-17 05:10:54 UTC
Hardware event. This is not a software error.
mcelog[1020]: MCE 0
mcelog[1020]: CPU 0 BANK 4
mcelog[1020]: ADDR fea00040
mcelog[1020]: TIME 1492398286 Mon Apr 17 15:04:46 2017
mcelog[1020]:   MC4 Error: Watchdog timeout due to lack of progress.
mcelog[1020]:   cache level: generic, mem/io: generic, mem-tx: generic error, 

part-proc: generic participation (request timed out)

mcelog[1020]: STATUS b600000000070f0f MCGSTATUS 0
mcelog[1020]: MCGCAP 107 APICID 0 SOCKETID 0
mcelog[1020]: CPUID Vendor AMD Family 21 Model 3
Comment 5 Takashi Iwai 2017-04-19 12:47:12 UTC
Sounds like a memory problem.  Boris, is there any known issue recently, or does it show rather a real hardware error?
Comment 6 Borislav Petkov 2017-04-19 17:26:22 UTC
Can you disable the wlan card in your BIOS (if possible) and use
the machine *without* the wifi - i.e., use eth0 and see if you can
reproduce.

Can you even reproduce reliably?

Also, please send /proc/iomem.

Thanks.
Comment 7 John Shand 2017-04-21 11:13:47 UTC
not too sure if you were asking me or Takashi.
Comment 8 Borislav Petkov 2017-04-21 11:33:23 UTC
You, of course. Takashi doesn't have your box. :-)

Also, you should see the NEEDINFO?

Flags: 	
bpetkov: 	needinfo 	jshand2013@gmail.com

Thanks.
Comment 9 John Shand 2017-04-21 22:09:12 UTC
iomem contents:

00000000-00000000 : reserved
00000000-00000000 : System RAM
00000000-00000000 : reserved
00000000-00000000 : PCI Bus 0000:00
00000000-00000000 : PCI Bus 0000:00
  00000000-00000000 : Video ROM
00000000-00000000 : reserved
  00000000-00000000 : System ROM
00000000-00000000 : System RAM
00000000-00000000 : reserved
00000000-00000000 : ACPI Tables
00000000-00000000 : ACPI Non-volatile Storage
00000000-00000000 : reserved
00000000-00000000 : System RAM
00000000-00000000 : ACPI Non-volatile Storage
00000000-00000000 : System RAM
00000000-00000000 : reserved
00000000-00000000 : System RAM
00000000-00000000 : RAM buffer
00000000-00000000 : pnp 00:01
00000000-00000000 : PCI Bus 0000:00
  00000000-00000000 : 0000:00:01.0
  00000000-00000000 : PCI Bus 0000:01
    00000000-00000000 : 0000:01:00.0
      00000000-00000000 : r8169
    00000000-00000000 : 0000:01:00.0
      00000000-00000000 : r8169
  00000000-00000000 : PCI MMCONFIG 0000 [bus 00-ff]
    00000000-00000000 : pnp 00:00
  00000000-00000000 : PCI Bus 0000:02
    00000000-00000000 : 0000:02:00.0
      00000000-00000000 : rtl_pci
  00000000-00000000 : 0000:00:01.0
  00000000-00000000 : 0000:00:14.2
    00000000-00000000 : ICH HD audio
  00000000-00000000 : 0000:00:01.1
    00000000-00000000 : ICH HD audio
  00000000-00000000 : 0000:00:16.2
    00000000-00000000 : ehci_hcd
  00000000-00000000 : 0000:00:16.0
    00000000-00000000 : ohci_hcd
  00000000-00000000 : 0000:00:14.5
    00000000-00000000 : ohci_hcd
  00000000-00000000 : 0000:00:13.2
    00000000-00000000 : ehci_hcd
  00000000-00000000 : 0000:00:13.0
    00000000-00000000 : ohci_hcd
  00000000-00000000 : 0000:00:12.2
    00000000-00000000 : ehci_hcd
  00000000-00000000 : 0000:00:12.0
    00000000-00000000 : ohci_hcd
  00000000-00000000 : 0000:00:11.0
    00000000-00000000 : ahci
  00000000-00000000 : amd_iommu
  00000000-00000000 : reserved
    00000000-00000000 : IOAPIC 0
  00000000-00000000 : reserved
    00000000-00000000 : pnp 00:03
  00000000-00000000 : reserved
    00000000-00000000 : HPET 0
      00000000-00000000 : PNP0103:00
  00000000-00000000 : pnp 00:03
  00000000-00000000 : reserved
    00000000-00000000 : pnp 00:03
  00000000-00000000 : Local APIC
    00000000-00000000 : pnp 00:03
  00000000-00000000 : reserved
    00000000-00000000 : pnp 00:03
00000000-00000000 : System RAM
  00000000-00000000 : Kernel code
  00000000-00000000 : Kernel data
  00000000-00000000 : Kernel bss
00000000-00000000 : RAM buffer
Comment 10 John Shand 2017-04-21 22:19:41 UTC
i have tried eth0 and it was fine.  the problems seems to be wireless, however i have been unable to reproduce it for some time
Comment 11 Borislav Petkov 2017-04-21 22:22:21 UTC
(In reply to John Shand from comment #9)
> iomem contents:
> 
> 00000000-00000000 : reserved
> 00000000-00000000 : System RAM
> 00000000-00000000 : reserved
> 00000000-00000000 : PCI Bus 0000:00
> 00000000-00000000 : PCI Bus 0000:00
>   00000000-00000000 : Video ROM

Whoops, forgot to say "do it as root". Otherwise you get all zeros, as
you see above.

That's sekuritee! :-)
Comment 12 John Shand 2017-04-21 22:25:20 UTC
how do i got about getting that information for you?
Comment 13 John Shand 2017-04-21 22:27:52 UTC
Created attachment 722212 [details]
iomem contents

i hope this helps
Comment 14 Borislav Petkov 2017-04-21 22:39:35 UTC
Ok,

here's the explanation. The MCE you're seeing is trying to tell us this:

"NB WDT timeout due to lack of progress. The NB WDT monitors transaction
completions. A transaction that exceeds the programmed time limit
reports errors via the MCA. The cause of error may be another node or
device which failed to respond."

And the address reported is the physical address for which that
transaction failed to complete and hit the watchdog timeout: ADDR
fea00040

From /proc/iomem, the rtl8192ce wifi card occupies this range:

  fea00000-feafffff : PCI Bus 0000:02
    fea00000-fea03fff : 0000:02:00.0
      fea00000-fea03fff : rtl_pci

so it looks like your wifi card didn't complete a transaction to or from
memory in the programmed time limit. Or the time limit was too short and
I can imagine due to loaded network traffic, packets getting delayed and
... purely hypothetically, of course.

However, the driver or the card or the firmware:

[   20.005960] rtl8192ce:rtl92ce_read_eeprom_info [rtl8192ce]:<0-0> Chip Version ID: B_CHIP_92C
[   20.015986] rtl8192ce: Using firmware rtlwifi/rtl8192cfw.bin

should *actually* handle such timeouts much more gracefully or increase
the timeout or whatever.

So if you can't reproduce reliably and this almost never happens, I'd
say, forget it and enjoy your life. :)

If you can actually reproduce it pretty reliably, then I guess we should
talk to the network driver author to see what he has to say. Maybe Larry
has seen it already and knows what the problem is.

HTH.
Comment 15 John Shand 2017-04-21 23:40:00 UTC
can you forward this information to larry, as it may fix an intermittent issue?
Comment 16 Takashi Iwai 2017-04-22 06:56:30 UTC
Larry, any clue about this issue?
Comment 17 John Shand 2017-04-22 10:05:27 UTC
thanks mate
Comment 18 Borislav Petkov 2017-04-22 10:08:29 UTC
You keep avoiding answering the question: how often did this happen and can you reproduce it?
Comment 19 Larry Finger 2017-04-22 16:40:17 UTC
(In reply to Takashi Iwai from comment #16)
> Larry, any clue about this issue?

No. Reports of such crashes have not been reported to me. In addition, the basic PCI setup has not been changed since the first inclusion of driver rtl8192ce.

If this error is causing the reboot, then it is not always fatal. In the attached dmesg output, three such errors were logged 0.25 sec after the clock was started, That is far earlier than the PCI bus scan. The Realtek PCI driver was loaded at 20 sec. The next mce event is at 316 sec. Even then the machine is still running with the firewall logging a packet drop 83 seconds later.

Each processor on my fastest system is 5800 BogoMIPS, thus John's is faster at 7400, but I do not expect that to have any effect.

Larry
Comment 20 Borislav Petkov 2017-04-22 17:31:58 UTC
(In reply to Larry Finger from comment #19)
> If this error is causing the reboot, then it is not always fatal. In the
> attached dmesg output, three such errors were logged 0.25 sec after the
> clock was started, That is far earlier than the PCI bus scan.

No, that's a single error and it is from the previous boot. It is
a fatal error which has caused the machine to warm-reset. We log
that error *after* the machine reboots because it remains in the MCA
registers after a warm reset. I.e., we can't log it when it happens
because the machine resets immediately - that's how the hardware behaves
in the face of fatal errors.

One other thing you could do, John, is see if there's a BIOS update for
your machine. Something newer than what you have now:

[    0.000000] DMI: Gigabyte Technology Co., Ltd. To be filled by O.E.M./F2A55M-DS2, BIOS F6 06/26/2013

Maybe newer BIOS has a fix, who knows.
Comment 21 John Shand 2017-04-23 08:28:41 UTC
(In reply to Borislav Petkov from comment #18)
> You keep avoiding answering the question: how often did this happen and can
> you reproduce it?

The funny thing about this issue is that a few months ago it used to happen at least 3 times a week, then it went away until the last update.  Since the last update  it has only happened twice without any reason i can think of
Comment 22 Borislav Petkov 2017-04-23 11:07:31 UTC
(In reply to John Shand from comment #21)
> The funny thing about this issue is that a few months ago it used to happen
> at least 3 times a week, then it went away until the last update.

By "update" you mean kernel update? What exactly did get updated, making
the issue go away?

> Since the last update it has only happened twice without any reason i
> can think of

Ok, so it was a good thing I kept insisting on that question: this is an
important piece of information.

Btw, your board has a newer BETA BIOS:

http://www.gigabyte.com/Motherboard/GA-F2A55M-DS2-rev-10#support-dl

Description says it updates AGESA which is the CPU part of the BIOS
support. So while it doesn't say it fixes some wifi chip issues, it
would still be worth to try...
Comment 23 John Shand 2017-04-23 20:58:47 UTC
(In reply to Borislav Petkov from comment #22)
> (In reply to John Shand from comment #21)
> > The funny thing about this issue is that a few months ago it used to happen
> > at least 3 times a week, then it went away until the last update.
> 
> By "update" you mean kernel update? What exactly did get updated, making
> the issue go away?
> 
> > Since the last update it has only happened twice without any reason i
> > can think of
> 
> Ok, so it was a good thing I kept insisting on that question: this is an
> important piece of information.
> 
> Btw, your board has a newer BETA BIOS:
> 
> http://www.gigabyte.com/Motherboard/GA-F2A55M-DS2-rev-10#support-dl
> 
> Description says it updates AGESA which is the CPU part of the BIOS
> support. So while it doesn't say it fixes some wifi chip issues, it
> would still be worth to try...

yeah i did the kernel updates as per normal until kernel 4.10.9, then i had the issue mentioned.

yeah, i double checked with my motherboard and i have a revision 1.2 i am unsure how to update the BIOS myself.
Comment 24 Borislav Petkov 2017-04-23 21:17:17 UTC
(In reply to John Shand from comment #23)
> yeah i did the kernel updates as per normal until kernel 4.10.9, then
> i had the issue mentioned.

You said:

> The funny thing about this issue is that a few months ago it used to
> happen at least 3 times a week,

With which kernel did you get it 3 times a week?

> then it went away until the last update.

What exactly did you update/change to make the issue go away?
Change/update to what version?

> Since the last update it has only happened twice without any reason i
> can think of

And that is with 4.10.9, correct?

Basically, I'd like to find out what you did to cause the issue to
happen and what you did to make it go away.
Comment 25 John Shand 2017-04-24 05:49:31 UTC
(In reply to Borislav Petkov from comment #24)
> (In reply to John Shand from comment #23)
> > yeah i did the kernel updates as per normal until kernel 4.10.9, then
> > i had the issue mentioned.
> 
> You said:
> 
> > The funny thing about this issue is that a few months ago it used to
> > happen at least 3 times a week,

Yes i did.  i can't remember the kernel version.

> 
> With which kernel did you get it 3 times a week?

no.  more like a different one when updates were available.

> 
> > then it went away until the last update.
> 
> What exactly did you update/change to make the issue go away?
> Change/update to what version?
> 
> > Since the last update it has only happened twice without any reason i
> > can think of
> 
> And that is with 4.10.9, correct?

yes that's correct

> 
> Basically, I'd like to find out what you did to cause the issue to
> happen and what you did to make it go away.

All i did was update when stable updates were available, which went very smoothly.  kernel 4.10.8 was the last version that didn't have this issue that i'm aware of.
Comment 26 John Shand 2017-04-27 00:50:16 UTC
issue is still current with kernel 4.10.10 and has happened more than with kernel 4.10.9
Comment 27 Borislav Petkov 2017-04-27 09:17:55 UTC
Well, the only thing I could think of is try updating your BIOS.

Then, I guess I'm all out of ideas and I'd look in Larry's direction.

Ok, maybe one practical idea: if the wifi card is one you can replace
and it was cheap, I'd go get a different one which doesn't have the
issue.

HTH.
Comment 28 John Shand 2017-05-03 01:32:58 UTC
the new kernel 4.10.12 seems to have fixed this problem.  i have had to change no hardware as a result
Comment 29 John Shand 2017-06-02 23:42:31 UTC
this issue seems to have been resolved