Bugzilla – Bug 533556
Boot loops on Temperature above threshold messages
Last modified: 2009-10-30 23:57:42 UTC
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.2) Gecko/20090730 SUSE/3.5.2-2.1 Firefox/3.5.2 I have a an Intel Core 2 Duo 6600 2.4gz processor. This has been running over three years with no problem. After ms 6 64 bit DVD was installed the first boot resulted in thousands of the messages: <6>[ 12.899003] CPU0: Temperature/speed normal <2>[ 12.899970] CPU0: Temperature above threshold, cpu clock throttled (total events = 2324) Boot was not able to complete. It turns out according to the MB bios the CPU was running at 74/75C. Now this may be high but this has been running Opensuse 9/10.1/10.3/ and 11.1/. Also, this did not happen in ms 1 and ms 5 32 bit. ms 2,3,4 never got to install this far. Reproducible: Always Steps to Reproduce: 1.Install ms 6 64 bit DVD 2.Boot 3. Actual Results: System loops with thousands of tempurature messages. Expected Results: Even if the temp was high it should be a warning and not a loop. There is nothing to attach since all I get is: klogd 1.4.1, log source = ksyslog started. throttled (total events = 465) <6>[ 8.279502] CPU0: Temperature/speed normal <2>[ 8.281029] CPU0: Temperature above threshold, cpu clock throttled (total events = 466) and then thousands more.
can you please try kernel-default instead of -desktop?
I also experience this bug sometimes. The message flooding in the console/logs is so extreme that in fact I think it causes the temperature rise (and so the message shows again and again...) I had to press the poweroff button to end the booting process (not enough patience I guess). I'm also using kernel-desktop, in M7. It happens even with the "Failsafe" option in the Grub menu.
Created attachment 318956 [details] boot.msg
I have now tested M7 which although allows the boot to complete there are still messages in the boot,msg log which is attached. I cannot find a default kernel in /boot or available in yast. I tried to search on kernel and only "desktop" is available in yast.
Today the CPU temp went to 74.5/75C and 11.2 just looped on boot with the same problem. So it appears the higher the temp the more messages/loops. And the only way to stop it is to press RESET. So there is still the same problem in M7 as mentioned by asd asd.
https://bugzilla.novell.com/show_bug.cgi?id=539261 is a duplicate of this bug
Hello, I've encountered a problem with the 11.2 M7 release. After some working time my machine (Acer Aspire 5315, Intel Celeron Processor 550) is completely crashing. It goes off immediately with no warning or clean shutdown. This is reproducible but the time it takes to crash differs a lot (between 5 minutes - 24 hours). However, the notebook works flawlessly except for this happens. With the 11.1 final release this problem doesn't occur. Here are the logs which I found interesting: Sep 19 12:43:01 ebcoM7 kernel: [ 1583.711112] CPU0: Temperature above threshold, cpu clock throttled (total events = 1) Sep 19 12:43:01 ebcoM7 kernel: [ 1583.711184] Disabling lock debugging due to kernel taint Sep 19 12:43:01 ebcoM7 kernel: [ 1583.712061] CPU0: Temperature/speed normal Sep 19 12:43:02 ebcoM7 kernel: [ 1584.083749] CPU0: Temperature above threshold, cpu clock throttled (total events = 2) Sep 19 12:43:02 ebcoM7 kernel: [ 1584.084732] CPU0: Temperature/speed normal Sep 19 12:43:02 ebcoM7 kernel: [ 1584.092081] CPU0: Temperature above threshold, cpu clock throttled (total events = 3) Sep 19 12:43:02 ebcoM7 kernel: [ 1584.093043] CPU0: Temperature/speed normal ....... I cut the purview in the middle! Sep 19 12:45:47 ebcoM7 kernel: [ 1749.644472] CPU0: Temperature above threshold, cpu clock throttled (total events = 19287) Sep 19 12:45:47 ebcoM7 kernel: [ 1749.650429] CPU0: Temperature/speed normal Sep 19 12:45:47 ebcoM7 kernel: [ 1749.650485] CPU0: Temperature above threshold, cpu clock throttled (total events = 19288) Sep 19 12:45:57 ebcoM7 kernel: [ 1759.123985] CPU0: Temperature/speed normal Sep 19 12:45:57 ebcoM7 kernel: [ 1759.124049] CPU0: Temperature above threshold, cpu clock throttled (total events = 19295) If there are further details or logs needed or if someone can advise me how I can dig deeper into the problem please let me know. Best regards Stefanie Dotzler
This endless throttling is weird because on my celeron processor there are only 8 throttling states also using my mark one temperature probe (finger tips) the case seems to be relatively cool while the internal sensor says it is hot enough to boil water.
I am wondering since this prevents boot up and sometimes installation shouldn't this be a ship stopper/higher priority?
Guys, I have the temperature throttling problem here, too. To me it looks like something broken with the sensor values. Also so bug #539261, which could be a duplicate Increasing priority. Something seems to be very broken. Anything I can do to provide better debugging info?
FYI: I got through M8 install but boot has many many of these messages so loop still is there. It appears to try to continue booting. But at some point the loop takes over (I suspect the CPU gets warmer from the loop?).
I can confirm the bug appearing in M8. Also, I've found that when these messages appear in the middle of a session, one way to stop the sudden hard drive activity is to type "killall klogd".
sounds like many have this issue -> ship stopper
It looks like kernel 2.6.32-rc3 has the fix commit 5001f861219a082e6a64ae61fccea2272bc6751a Merge: 663cc81 c7db7ba Author: Linus Torvalds <torvalds@linux-foundation.org> Date: Sun Oct 4 14:59:53 2009 -0700 Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6 * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: ACPI: EC: Don't parse DSDT for EC early init on Compal ACPI: EC: Rewrite DMI checks ACPI: dock: fix "sibiling" typo ACPI: kill overly verbose "throttling states" log messages ACPI: Fix bound checks for copy_from_user in the acpi /proc code ACPI: fix bus scanning memory leaks ACPI: EC: Restart command even if no interrupts from EC sony-laptop: Don't unregister the SPIC driver if it wasn't registered sony-laptop: remove _INI call at init time sony-laptop: SPIC unset IRQF_SHARED, set IRQF_DISABLED sony-laptop: remove device_ctrl and the SPIC mini drivers
It is fixed? What specific git commit looks to resolve this issue?
ping. Can anyone of those having the problem please compile and use a vanilla master kernel? Joe, is your machine accessible?
Dale, did you test the -rc3 kernel to verify this or did the ''kill overly verbose "throttling states" log messages'' commit just look like a likely candidate? The commit you've posted is a merge commit so it's tough to see what you're trying to show us. That one corresponds to git commit 53412c5b, which is unrelated.
I didn't test the rc3 kernel I just saw this in the change log ACPI: kill overly verbose "throttling states" log messages
Yeah, that commit is unrelated and only suppresses messages like this: ACPI: Processor [CPU0] (supports 8 throttling states)
Thomas, can you look at Joe's machine? I would like to know the impact of this bug - if it's specific to some machines or a lot.
Just to chime in again with a comment: If this bug is not fixed that 11.2 is not usable or installable on my machine. I would like to provide more info but I am not an expert as some are here. I can provide logs, etc. if I can find them, but I have not compiled kernels, etc. Wish I could provide more info.
This is absolutely unrelated to acpi. The temperature exception is triggered through a MCE (Machine Check Excpetion). There has been some work done in this kernel area lately. For Joe's machine: I upgraded the BIOS from a version from 2006 to 2008 and the bug seem to be fixed. Please tell us if not. (I tried to reproduce this on an already upgraded machine and could not run into it). Therefore: Please upgrade everybody to the latest BIOS. It could be that they didn't care about MCEs in BIOS already because OSes didn't catch the event yet.
The BIOS update seems to have slightly improved the situation (machine doesn't hang so frequently when I try to boot), but I'm still getting the "Temperature above threshold" messages.
I can also reproduce this with my machine here now. Joe used a kde app to show sensors output from coretemp CPU temperature driver. Sometimes the temperature dropped to zero (or below?) for just one or some reads. I tried to reproduce this with watch -n1 sensors, but couldn't see zero or negative measures. But this may be because it doesn't read out the temperature that often. This is the last measure I got, while stressing the CPUs, before the machine froze with lots of these messages in syslog: CPU0: Temperature above threshold, cpu clock throttled (total events = 86) CPU0: Temperature/speed normal Every 1.0s:sensors Fri Oct 16 13:23:03 2009 coretemp-isa-0000 Adapter: ISA adapter Core 0: +98.0C (high = +84.0C, crit = +100.0C) coretemp-isa-0001 Adapter: ISA adapter Core 1: +96.0C (high = +84.0C, crit = +100.0C) coretemp-isa-0002 Adapter: ISA adapter Core 2: +98.0C (high = +84.0C, crit = +100.0C) coretemp-isa-0003 Adapter: ISA adapter Core 3: +97.0C (high = +84.0C, crit = +100.0C) Adding Intel people who are more familiar with mce than me. Also attaching mcelog and dmesg.
Created attachment 322854 [details] /var/log/mcelog with the MCEs logged
I am not attaching dmesg, please refer to comment #3 where the messages are shown nicely. Setting needinfo to racing.guo@intel.com who touched a lot parts of the P4 thermal code in mcelog, hope you can help. Kent/Youquan: This probably is CPU specific, but a lot people seem to be affected, thus this is rather important for us, compare with comment #14: > sounds like many have this issue -> ship stopper
Update here: RC1 still has the problem. I was able to get the install to complete but the log clearly has a LOT (thousands) of these messages. A definite show stopper for me (and it looks like others). If this goes into GA 11.2 looks unsuable for me. Frankly I am surprised such as serious problam made it to an RC. Some people mentioned a bios update may have helped. I think the day was simply cold. In my case I have had the latest bios on my Asus MB for some time now, longer before this problem started. Again, 11.1 and all previous had no problem and also early 11.2 had no problem. My dmesg (from 11.1) is attached
Created attachment 322951 [details] dmesg 11.1
Can you try the latest 2.6.32 kernel on such platform? please show me the "cat /proc/cpuinfo"? Please point me the link I can download 11.2 M8?
I'm not aware of any mirrors still holding M8, what do you need it for? If important, I can publish the M8 DVD on a side path. But RC1 has the same problem and is available from http://download.opensuse.org/distribution/11.2-RC1/iso/
Her is my /proc/cpuinfo dale@linux:~> cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 13 model name : Intel(R) Celeron(R) M processor 1.60GHz stepping : 8 cpu MHz : 1596.061 cache size : 1024 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov clflush dts acpi mmx fxsr sse sse2 ss tm pbe up bts bogomips : 3192.12 clflush size : 64 power management: dale@linux:~> ale@linux:~>
In case this is of any use (from 11.1): cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz stepping : 6 cpu MHz : 1596.000 cache size : 4096 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm bogomips : 4808.20 clflush size : 64 power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz stepping : 6 cpu MHz : 1596.000 cache size : 4096 KB physical id : 0 siblings : 2 core id : 1 cpu cores : 2 apicid : 1 initial apicid : 1 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm bogomips : 4808.23 clflush size : 64 power management:
I booted with nomce and I am not getting the above threshold messages anymore
I tried with 11.1 (SLE11) and I also could see the above threshold messages, but not this kind of mce storm. I expect the high temperature is real. You can monitor this nicely by: a) Put the cores under load and start several of these processes: cat /dev/zero >/dev/null & b) Monitor temperature: - Install sensors package, run: sensors-detect (always confirm), - Start lm-sensors service: rclm-sensors start - and monitor temperature by: watch -n1 sensors The temperature should now raise slowly until the critical temperature is reached. You should see the same with older SUSE versions. The bug seem to be that the hysteresis changed, the mce and speed limitation/throttling and unthrottling seem to toggle quickly all the time, resulting in a hard machine freeze (no display, machine can still be pinged, but ssh login does not work anymore, keyboard is dead). I added the machine to a serial console over night, hopefully I can catch something.
I'd not say that this is a ship stopper, because this should only be seen on HW which has not enough fan power (someone might want to examine/compare fan activity with 11.1 or check whether the message also appears there). nomce might result in a hard switch off of the CPU if the temperature continues raising. A better workaround would be: thermal.psv=98 thermal.tzp=20 This would set a passive trip point to 98 C (check with sensors for the best value for your machine before MCEs are kicking in -> this might vary) and let the OS (thermal.ko driver) check every 20 seconds. This is much better more efficient than the throttling done if cpufreq is supported. This may not help for the Celeron if neither cpufreq nor throttling is exported to the OS (check: cat /proc/acpi/processor/*/throttling).
what do I edit
These are boot parameters, you can add them in /grub/boot/menu.lst to the kernel line of your boot entry: thermal.psv=98 thermal.tzp=20 One drawback: Your BIOS must export a thermal device: ls /proc/acpi/thermal_zone/ must not be empty. At some time there will be an ACPI independent connection between hwmon and cpufreq drivers for passive trip points... But wait: I had a deeper look at: arch/x86/kernel/cpu/mcheck/therm_throt.c and mce_intel_64.c A lot changed, but this one could explain it. I already attach the patch and will give it a try tomorrow (machine is dead again). I will provide a kernel to test if it helps here.
Created attachment 323137 [details] Better ack the irq first... Looking at the git history, this really should be it..., Originally the ack was placed at the beginning which makes more sense (in mce_intel_64.c): 1da177e4 (Linus Torvalds 2005-04-16 15:20:36 -0700 24) ack_APIC_irq(); 95833c83 (Andi Kleen 2006-01-11 22:44:36 +0100 26) exit_idle(); 1da177e4 (Linus Torvalds 2005-04-16 15:20:36 -0700 27) irq_enter(); Then it got moved to therm_throt.c, already the ack_APIC_irq has been put at the end (after irq_exit()) and finally the misleading comment got added with commit id a65c88dd2c83b569dbd13778da689861bdf977f2: - irq_exit(); - ack_APIC_irq(); ... + /* Ack only at the end to avoid potential reentry */ + ack_APIC_irq(); Looks like someone else already run into this and only tried to fix the message, last commit of therm_throt.c: commit b417c9fd8690637f0c91479435ab3e2bf450c038 Author: Ingo Molnar <mingo@elte.hu> Date: Tue Sep 22 15:50:24 2009 +0200 x86: mce: Fix thermal throttling message storm If a system switches back and forth between hot and cold mode, the MCE code will print a stream of critical kernel messages. But this only suppresses some messages and does not fix the root cause -> possible freeze after too many irqs. Nothing for sure for now...
Looks like the unresponsiveness of the machine came from the hundreds of syslog messages send from irq context. These should get suppressed now (by another patch). The huge amount of IRQs is normal. You might already want to try a kernel from: ftp://ftp.suse.com/pub/projects/kernel/kotd/openSUSE-11.2 wait some hours for building and syncing and check for: Tue Oct 20 22:57:50 CEST 2009 - trenn@suse.de - patches.arch/x86_mce_therm_suppress_msgs.patch: X86: Suppress hundreds of Intel thermal MCE messages on high temps (bnc#533556). with rpm -qp --changelog kernel-default.rpm be aware that you need kernel-default and kernel-default-base.rpm
FYI: I was able to install RC2 without any of these problems.