|
Bugzilla – Full Text Bug Listing |
| Summary: | Kernel 4.10.9: Computer Intermittently Reboots | ||
|---|---|---|---|
| Product: | [openSUSE] openSUSE Tumbleweed | Reporter: | John Shand <jshand2013> |
| Component: | Kernel | Assignee: | E-mail List <kernel-maintainers> |
| Status: | RESOLVED FIXED | QA Contact: | E-mail List <qa-bugs> |
| Severity: | Major | ||
| Priority: | P5 - None | CC: | bpetkov, jshand2013, Larry.Finger, tiwai |
| Version: | Current | Flags: | bpetkov:
needinfo?
(jshand2013) |
| Target Milestone: | --- | ||
| Hardware: | x86-64 | ||
| OS: | Other | ||
| Whiteboard: | |||
| Found By: | --- | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
| Attachments: |
DMESG with errors in certain areas
Hardware Information journalctl -b information iomem contents |
||
Created attachment 721382 [details]
Hardware Information
let me know if there is any other information you need Created attachment 721384 [details]
journalctl -b information
Hardware event. This is not a software error. mcelog[1020]: MCE 0 mcelog[1020]: CPU 0 BANK 4 mcelog[1020]: ADDR fea00040 mcelog[1020]: TIME 1492398286 Mon Apr 17 15:04:46 2017 mcelog[1020]: MC4 Error: Watchdog timeout due to lack of progress. mcelog[1020]: cache level: generic, mem/io: generic, mem-tx: generic error, part-proc: generic participation (request timed out) mcelog[1020]: STATUS b600000000070f0f MCGSTATUS 0 mcelog[1020]: MCGCAP 107 APICID 0 SOCKETID 0 mcelog[1020]: CPUID Vendor AMD Family 21 Model 3 Sounds like a memory problem. Boris, is there any known issue recently, or does it show rather a real hardware error? Can you disable the wlan card in your BIOS (if possible) and use the machine *without* the wifi - i.e., use eth0 and see if you can reproduce. Can you even reproduce reliably? Also, please send /proc/iomem. Thanks. not too sure if you were asking me or Takashi. You, of course. Takashi doesn't have your box. :-) Also, you should see the NEEDINFO? Flags: bpetkov: needinfo jshand2013@gmail.com Thanks. iomem contents:
00000000-00000000 : reserved
00000000-00000000 : System RAM
00000000-00000000 : reserved
00000000-00000000 : PCI Bus 0000:00
00000000-00000000 : PCI Bus 0000:00
00000000-00000000 : Video ROM
00000000-00000000 : reserved
00000000-00000000 : System ROM
00000000-00000000 : System RAM
00000000-00000000 : reserved
00000000-00000000 : ACPI Tables
00000000-00000000 : ACPI Non-volatile Storage
00000000-00000000 : reserved
00000000-00000000 : System RAM
00000000-00000000 : ACPI Non-volatile Storage
00000000-00000000 : System RAM
00000000-00000000 : reserved
00000000-00000000 : System RAM
00000000-00000000 : RAM buffer
00000000-00000000 : pnp 00:01
00000000-00000000 : PCI Bus 0000:00
00000000-00000000 : 0000:00:01.0
00000000-00000000 : PCI Bus 0000:01
00000000-00000000 : 0000:01:00.0
00000000-00000000 : r8169
00000000-00000000 : 0000:01:00.0
00000000-00000000 : r8169
00000000-00000000 : PCI MMCONFIG 0000 [bus 00-ff]
00000000-00000000 : pnp 00:00
00000000-00000000 : PCI Bus 0000:02
00000000-00000000 : 0000:02:00.0
00000000-00000000 : rtl_pci
00000000-00000000 : 0000:00:01.0
00000000-00000000 : 0000:00:14.2
00000000-00000000 : ICH HD audio
00000000-00000000 : 0000:00:01.1
00000000-00000000 : ICH HD audio
00000000-00000000 : 0000:00:16.2
00000000-00000000 : ehci_hcd
00000000-00000000 : 0000:00:16.0
00000000-00000000 : ohci_hcd
00000000-00000000 : 0000:00:14.5
00000000-00000000 : ohci_hcd
00000000-00000000 : 0000:00:13.2
00000000-00000000 : ehci_hcd
00000000-00000000 : 0000:00:13.0
00000000-00000000 : ohci_hcd
00000000-00000000 : 0000:00:12.2
00000000-00000000 : ehci_hcd
00000000-00000000 : 0000:00:12.0
00000000-00000000 : ohci_hcd
00000000-00000000 : 0000:00:11.0
00000000-00000000 : ahci
00000000-00000000 : amd_iommu
00000000-00000000 : reserved
00000000-00000000 : IOAPIC 0
00000000-00000000 : reserved
00000000-00000000 : pnp 00:03
00000000-00000000 : reserved
00000000-00000000 : HPET 0
00000000-00000000 : PNP0103:00
00000000-00000000 : pnp 00:03
00000000-00000000 : reserved
00000000-00000000 : pnp 00:03
00000000-00000000 : Local APIC
00000000-00000000 : pnp 00:03
00000000-00000000 : reserved
00000000-00000000 : pnp 00:03
00000000-00000000 : System RAM
00000000-00000000 : Kernel code
00000000-00000000 : Kernel data
00000000-00000000 : Kernel bss
00000000-00000000 : RAM buffer
i have tried eth0 and it was fine. the problems seems to be wireless, however i have been unable to reproduce it for some time (In reply to John Shand from comment #9) > iomem contents: > > 00000000-00000000 : reserved > 00000000-00000000 : System RAM > 00000000-00000000 : reserved > 00000000-00000000 : PCI Bus 0000:00 > 00000000-00000000 : PCI Bus 0000:00 > 00000000-00000000 : Video ROM Whoops, forgot to say "do it as root". Otherwise you get all zeros, as you see above. That's sekuritee! :-) how do i got about getting that information for you? Created attachment 722212 [details]
iomem contents
i hope this helps
Ok,
here's the explanation. The MCE you're seeing is trying to tell us this:
"NB WDT timeout due to lack of progress. The NB WDT monitors transaction
completions. A transaction that exceeds the programmed time limit
reports errors via the MCA. The cause of error may be another node or
device which failed to respond."
And the address reported is the physical address for which that
transaction failed to complete and hit the watchdog timeout: ADDR
fea00040
From /proc/iomem, the rtl8192ce wifi card occupies this range:
fea00000-feafffff : PCI Bus 0000:02
fea00000-fea03fff : 0000:02:00.0
fea00000-fea03fff : rtl_pci
so it looks like your wifi card didn't complete a transaction to or from
memory in the programmed time limit. Or the time limit was too short and
I can imagine due to loaded network traffic, packets getting delayed and
... purely hypothetically, of course.
However, the driver or the card or the firmware:
[ 20.005960] rtl8192ce:rtl92ce_read_eeprom_info [rtl8192ce]:<0-0> Chip Version ID: B_CHIP_92C
[ 20.015986] rtl8192ce: Using firmware rtlwifi/rtl8192cfw.bin
should *actually* handle such timeouts much more gracefully or increase
the timeout or whatever.
So if you can't reproduce reliably and this almost never happens, I'd
say, forget it and enjoy your life. :)
If you can actually reproduce it pretty reliably, then I guess we should
talk to the network driver author to see what he has to say. Maybe Larry
has seen it already and knows what the problem is.
HTH.
can you forward this information to larry, as it may fix an intermittent issue? Larry, any clue about this issue? thanks mate You keep avoiding answering the question: how often did this happen and can you reproduce it? (In reply to Takashi Iwai from comment #16) > Larry, any clue about this issue? No. Reports of such crashes have not been reported to me. In addition, the basic PCI setup has not been changed since the first inclusion of driver rtl8192ce. If this error is causing the reboot, then it is not always fatal. In the attached dmesg output, three such errors were logged 0.25 sec after the clock was started, That is far earlier than the PCI bus scan. The Realtek PCI driver was loaded at 20 sec. The next mce event is at 316 sec. Even then the machine is still running with the firewall logging a packet drop 83 seconds later. Each processor on my fastest system is 5800 BogoMIPS, thus John's is faster at 7400, but I do not expect that to have any effect. Larry (In reply to Larry Finger from comment #19) > If this error is causing the reboot, then it is not always fatal. In the > attached dmesg output, three such errors were logged 0.25 sec after the > clock was started, That is far earlier than the PCI bus scan. No, that's a single error and it is from the previous boot. It is a fatal error which has caused the machine to warm-reset. We log that error *after* the machine reboots because it remains in the MCA registers after a warm reset. I.e., we can't log it when it happens because the machine resets immediately - that's how the hardware behaves in the face of fatal errors. One other thing you could do, John, is see if there's a BIOS update for your machine. Something newer than what you have now: [ 0.000000] DMI: Gigabyte Technology Co., Ltd. To be filled by O.E.M./F2A55M-DS2, BIOS F6 06/26/2013 Maybe newer BIOS has a fix, who knows. (In reply to Borislav Petkov from comment #18) > You keep avoiding answering the question: how often did this happen and can > you reproduce it? The funny thing about this issue is that a few months ago it used to happen at least 3 times a week, then it went away until the last update. Since the last update it has only happened twice without any reason i can think of (In reply to John Shand from comment #21) > The funny thing about this issue is that a few months ago it used to happen > at least 3 times a week, then it went away until the last update. By "update" you mean kernel update? What exactly did get updated, making the issue go away? > Since the last update it has only happened twice without any reason i > can think of Ok, so it was a good thing I kept insisting on that question: this is an important piece of information. Btw, your board has a newer BETA BIOS: http://www.gigabyte.com/Motherboard/GA-F2A55M-DS2-rev-10#support-dl Description says it updates AGESA which is the CPU part of the BIOS support. So while it doesn't say it fixes some wifi chip issues, it would still be worth to try... (In reply to Borislav Petkov from comment #22) > (In reply to John Shand from comment #21) > > The funny thing about this issue is that a few months ago it used to happen > > at least 3 times a week, then it went away until the last update. > > By "update" you mean kernel update? What exactly did get updated, making > the issue go away? > > > Since the last update it has only happened twice without any reason i > > can think of > > Ok, so it was a good thing I kept insisting on that question: this is an > important piece of information. > > Btw, your board has a newer BETA BIOS: > > http://www.gigabyte.com/Motherboard/GA-F2A55M-DS2-rev-10#support-dl > > Description says it updates AGESA which is the CPU part of the BIOS > support. So while it doesn't say it fixes some wifi chip issues, it > would still be worth to try... yeah i did the kernel updates as per normal until kernel 4.10.9, then i had the issue mentioned. yeah, i double checked with my motherboard and i have a revision 1.2 i am unsure how to update the BIOS myself. (In reply to John Shand from comment #23) > yeah i did the kernel updates as per normal until kernel 4.10.9, then > i had the issue mentioned. You said: > The funny thing about this issue is that a few months ago it used to > happen at least 3 times a week, With which kernel did you get it 3 times a week? > then it went away until the last update. What exactly did you update/change to make the issue go away? Change/update to what version? > Since the last update it has only happened twice without any reason i > can think of And that is with 4.10.9, correct? Basically, I'd like to find out what you did to cause the issue to happen and what you did to make it go away. (In reply to Borislav Petkov from comment #24) > (In reply to John Shand from comment #23) > > yeah i did the kernel updates as per normal until kernel 4.10.9, then > > i had the issue mentioned. > > You said: > > > The funny thing about this issue is that a few months ago it used to > > happen at least 3 times a week, Yes i did. i can't remember the kernel version. > > With which kernel did you get it 3 times a week? no. more like a different one when updates were available. > > > then it went away until the last update. > > What exactly did you update/change to make the issue go away? > Change/update to what version? > > > Since the last update it has only happened twice without any reason i > > can think of > > And that is with 4.10.9, correct? yes that's correct > > Basically, I'd like to find out what you did to cause the issue to > happen and what you did to make it go away. All i did was update when stable updates were available, which went very smoothly. kernel 4.10.8 was the last version that didn't have this issue that i'm aware of. issue is still current with kernel 4.10.10 and has happened more than with kernel 4.10.9 Well, the only thing I could think of is try updating your BIOS. Then, I guess I'm all out of ideas and I'd look in Larry's direction. Ok, maybe one practical idea: if the wifi card is one you can replace and it was cheap, I'd go get a different one which doesn't have the issue. HTH. the new kernel 4.10.12 seems to have fixed this problem. i have had to change no hardware as a result this issue seems to have been resolved |
Created attachment 721379 [details] DMESG with errors in certain areas i was doing git downloads, watching youtube, chatting on skype and a few other things and i got these errors after the computer suddenly desired to reboot by itself. i have very little understanding on the problem itself, but i have a few error codes. mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: b600000000070f0f mce: [Hardware Error]: TSC 0 ADDR fea00040 mce: [Hardware Error]: PROCESSOR 2:610f31 TIME 1492398286 SOCKET 0 APIC 0 microcode 6001119 i also checked the mcelog and got: jshand@linux-zkok:~> sudo mcelog --client Memory errors SOCKET 0 CHANNEL 0 DIMM 0 DMI_NAME "Node0_Dimm1" DMI_LOCATION "Node0_Bank0" corrected memory errors: 0 total 0 in 24h uncorrected memory errors: 0 total 0 in 24h dmesg info is added as a file i hope this helps