Bug 259992

Summary: kernel acpi read wrong temperature - critical shutdown
Product: [openSUSE] openSUSE 10.2 Reporter: Christoph Resch <shanti>
Component: KernelAssignee: Thomas Renninger <trenn>
Status: RESOLVED WONTFIX QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P5 - None CC: forgotten_1GBkbCnI0A, info, jdelvare, shanti
Version: Final   
Target Milestone: ---   
Hardware: x86-64   
OS: SUSE Other   
Whiteboard:
Found By: Customer Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: detailed systeminfo
my acpidump
acpidump of MSI MegaPC 865 barebone
Request IO and mem resources for ACPI operation regions
Set correct name of the resource
DSDT from desktop ASUS mainboard
update: changed from x86_64 to pure x86
acpidump for my Shuttle ST20G5

Description Christoph Resch 2007-04-02 18:13:19 UTC
Hi

regulary my system shutdown on a "wrong" temperature-alerm from ACPI:

> > <<<<>>>>
> > Mar  6 14:01:20 zion kernel: ACPI: Critical trip point
> > Mar  6 14:01:20 zion kernel: Critical temperature reached (80 C),
> > shutting down.
> > Mar  6 14:01:20 zion kernel: ACPI: Unable to turn cooling device
> > [ffff810037fdd290] 'on'
> > Mar  6 14:01:20 zion shutdown[15861]: shutting down for system halt
> > Mar  6 14:01:21 zion powersaved[3500]: WARNING
> > (checkTemperatureStateChanges:218) Temperature state changed to critical.
> > Mar  6 14:01:26 zion kernel: Critical temperature reached (33 C),
> > shutting down.
> > <<<<<>>>>>>

as you can see "critical temperature" restores to normal (delta47°) within 7 seconds .. there is no turning back of the system .. no annoying-level - no margin .. just a clean shutdown 

i am nor even sure what cooling device the kernel means .. i have sensors for all cpu&fan and mainboard .. maybe it means my VGA-device or my southbridge  :-o ??

please take a look :-) 

tnx

RFC

-c-

--------------------------

this is my lspci:

00:00.0 Host bridge: ATI Technologies Inc RS480 Host Bridge (rev 01)
00:01.0 PCI bridge: ATI Technologies Inc RS480 PCI Bridge
00:06.0 PCI bridge: ATI Technologies Inc RS480 PCI Bridge
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control
00:19.0 PCI bridge: ALi Corporation M5249 HTT to PCI Bridge
00:1c.0 USB Controller: ALi Corporation USB 1.1 Controller (rev 03)
00:1c.1 USB Controller: ALi Corporation USB 1.1 Controller (rev 03)
00:1c.2 USB Controller: ALi Corporation USB 1.1 Controller (rev 03)
00:1c.3 USB Controller: ALi Corporation USB 2.0 Controller (rev 01)
00:1d.0 Audio device: ALi Corporation High Definition Audio/AC'97 Host Controller
00:1e.0 ISA bridge: ALi Corporation PCI to LPC Controller (rev 31)
00:1e.1 Bridge: ALi Corporation M7101 Power Management Controller [PMU]
00:1f.0 IDE interface: ALi Corporation M5229 IDE (rev c7)
00:1f.1 RAID bus controller: ALi Corporation ULi 5287 SATA (rev 02)
01:05.0 VGA compatible controller: ATI Technologies Inc RS480 [Radeon Xpress 200G Series]
02:00.0 Ethernet controller: Broadcom Corporation NetLink BCM5789 Gigabit Ethernet PCI Express (rev 11)
03:15.0 FireWire (IEEE 1394): VIA Technologies, Inc. IEEE 1394 Host Controller (rev 80)
03:16.0 RAID bus controller: Triones Technologies, Inc. HPT302/302N (rev 02)

<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>


and my lsmod


<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>
Module                  Size  Used by
usblp                  32128  0
appletalk              74736  2
ax25                   99068  2
ipx                    65096  2
p8023                  19072  1 ipx
xt_tcpudp              20352  14
xt_pkttype             18816  5
ipt_LOG                23808  12
xt_limit               20224  12
fglrx                 790276  58
vmnet                  76720  9
vmmon                 166252  0
rfcomm                 75688  2
hidp                   50688  2
l2cap                  58624  10 rfcomm,hidp
it87                   42404  0
af_packet              57356  0
hwmon_vid              19584  1 it87
hwmon                  20360  1 it87
i2c_isa                23040  1 it87
bluetooth              90116  5 rfcomm,hidp,l2cap
snd_pcm_oss            71680  0
snd_mixer_oss          35840  1 snd_pcm_oss
snd_seq                82976  0
cpufreq_conservative    25608  0
cpufreq_ondemand       24592  1
cpufreq_userspace      24064  0
cpufreq_powersave      18688  0
powernow_k8            32416  1
freq_table             22912  1 powernow_k8
thermal                33552  1
processor              53864  2 powernow_k8,thermal
button                 24736  0
battery                28296  0
ac                     22792  0
ipt_REJECT             22528  3
xt_state               19200  12
iptable_mangle         19840  0
iptable_nat            24964  0
ip_nat                 37804  1 iptable_nat
iptable_filter         19968  1
ip6table_mangle        19456  0
ip6table_filter        19840  0
ip_conntrack           78372  3 xt_state,iptable_nat,ip_nat
nfnetlink              24648  2 ip_nat,ip_conntrack
ip_tables              39784  3 iptable_mangle,iptable_nat,iptable_filter
ip6_tables             33480  2 ip6table_mangle,ip6table_filter
x_tables               37384  9 xt_tcpudp,xt_pkttype,ipt_LOG,xt_limit,ipt_REJECT,xt_state,iptable_nat,ip_tables,ip6_tables
uhci_hcd               42520  0
apparmor               74264  0
aamatch_pcre           31232  1 apparmor
reiserfs              260096  1
loop                   34192  0
dm_mod                 81872  0
ohci1394               52040  0
ieee1394              130552  1 ohci1394
snd_usb_audio         108672  1
snd_usb_lib            36224  1 snd_usb_audio
snd_rawmidi            47104  1 snd_usb_lib
snd_seq_device         26516  2 snd_seq,snd_rawmidi
snd_hwdep              28552  1 snd_usb_audio
usbhid                 69792  0
ide_cd                 59680  0
cdrom                  54056  1 ide_cd
i2c_ali15x3            25348  0
tg3                   125572  0
ohci_hcd               38404  0
shpchp                 56492  0
snd_hda_intel          37660  2
snd_hda_codec         220160  1 snd_hda_intel
snd_pcm               115464  4 snd_pcm_oss,snd_usb_audio,snd_hda_intel,snd_hda_codec
snd_timer              44680  2 snd_seq,snd_pcm
snd                    89384  18 snd_pcm_oss,snd_mixer_oss,snd_seq,snd_usb_audio,snd_usb_lib,snd_rawmidi,snd_seq_device,snd_hwdep,snd_hda_intel,snd_hda_codec,snd_pcm,snd_timer
soundcore              28192  1 snd
snd_page_alloc         27792  2 snd_hda_intel,snd_pcm
i2c_ali1535            24452  0
i2c_core               41472  4 it87,i2c_isa,i2c_ali15x3,i2c_ali1535
pci_hotplug            52228  1 shpchp
floppy                 82408  0
parport_pc             58984  1
lp                     30664  0
parport                59660  2 parport_pc,lp
ext3                  167696  2
mbcache                27016  1 ext3
jbd                    90872  1 ext3
ehci_hcd               51080  0
usbcore               148320  7 usblp,uhci_hcd,snd_usb_audio,snd_usb_lib,usbhid,ohci_hcd,ehci_hcd
edd                    27912  0
fan                    22408  1
sg                     55080  0
hpt366                 36992  0 [permanent]
sata_uli               25860  4
libata                145312  1 sata_uli
alim15x3               29208  0 [permanent]
sd_mod                 39296  4
scsi_mod              173744  3 sg,libata,sd_mod
ide_disk               34304  2
ide_core              174720  4 ide_cd,hpt366,alim15x3,ide_disk
<<<<<<<<<<<<<<>>>>>>>>>>>>>>
Comment 1 Christoph Resch 2007-04-03 11:50:51 UTC
Created attachment 128495 [details]
detailed systeminfo

some more sysinfo
Comment 2 Christoph Resch 2007-04-07 18:09:13 UTC
interesting maybe: my Bios show an additional fan sensor for the cpu-cooling device , but the linux-module (rs480-mainboard) doesnt show this "fan3" value , with is the  integrated smartfan of the CPU ( normally about 800-900 rpms as shown in BIOS ) .. this could be the missing "cooling device" that is not reachable .. must be a missing feature of the RS480-mainboard driver 

just FYI 

best regards

-c-
Comment 4 Christoph Resch 2007-05-02 21:49:45 UTC
WOOOOHAAA !

happens again

May  2 23:07:57 zion kernel: ACPI: Critical trip point
May  2 23:07:57 zion kernel: Critical temperature reached (110 C), shutting down.
May  2 23:07:57 zion shutdown[31489]: shutting down for system halt
May  2 23:07:57 zion powersaved[3611]: WARNING (checkTemperatureStateChanges:218) Temperature state changed to critical.
May  2 23:07:57 zion init: Switching to runlevel: 0
May  2 23:07:59 zion kernel: Critical temperature reached (41 C), shutting down.
May  2 23:08:01 zion kernel: md: stopping all md devices.

i think that the i2c_ali15x3,i2c_ali1535-modules are buggy .. they at least miss on active fan. At least this shows that the driver is incomplete .. 

i caught this up(in a boot-log):

<4>ali1535_smbus 0000:00:1e.1: ALI1535_smb region uninitialized - upgrade BIOS?
<4>ali1535_smbus 0000:00:1e.1: ALI1535 not detected, module not inserted.
<3>ali15x3_smbus 0000:00:1e.1: ALI15X3_smb region uninitialized - upgrade BIOS or use force_addr=0xaddr
<3>ali15x3_smbus 0000:00:1e.1: ALI15X3 not detected, module not inserted.

i think this might have some issue in common

rfc

-c-

Comment 5 Thomas Renninger 2007-05-03 08:42:42 UTC
Yep, the temp looks bogus. It may be that something interferes with ACPI causing this.

> i think that the i2c_ali15x3,i2c_ali1535-modules are buggy .. they at least
> miss on active fan. At least this shows that the driver is incomplete ..
Those could likely interfere with ACPI, better don't use them.
There were some patches to unhide smbus to use it with such legacy modules, which might conflict with ACPI.

> but the linux-module (rs480-mainboard)
What's that? I can't find this module?

I'd like to close this one invalid if without above modules this does not happen any more.

Some more questions:
How often does/did it happen? frequently? These modules do not load automatically? What is the machine's vendor/model name?

Adding Pavel and Daniel, they might be interested in this.
Daniel already had a fix with this smbus (un)hiding, any ideas/comments?
Comment 6 Christoph Resch 2007-05-03 10:16:40 UTC
these modules load automatically .. if i remove them i have no hardware-sensors :-( .. i think that at least the instance that checks temperature should have more intelligence .. if temperature drops to noncritical within 5 seconds a shutdown should be overridden :-o 

:-) 
Comment 7 Christoph Resch 2007-05-03 12:56:12 UTC
i blacklisted the modules for now (i2c_ali15x3,i2c_ali1535) , hope it will bring some improvement .. yet 

>> but the linux-module (rs480-mainboard)
>What's that? I can't find this module?

i meant the regarding ali*-modules, since they fir for this mainboard .. my hardware is a shuttle st20g5 (IPX-Board) 
http://global.shuttle.com/Product/Barebone/ST20G5.asp

the specs tell its a ULi 1573 mainboard

and yes indeed it happens frequently ( once a week ) .. after rebooting and going straight for the PC-HEalth in the bios , the temperatures are quite ok ( under 50°C ) .. i got the latest available bios-version for this board

please gimme 2 weeks testing period before closing this ticket :-/ 

tnx

Comment 8 Forgotten User 1GBkbCnI0A 2007-05-09 08:06:39 UTC
I am having a MSI MegaPC 865 barebone. I saw this behaviour earlier from time to time with 10.0, but thought, it had to do with temperature indeed - and bad cooling.
I updated the system to 10.2 a while ago, everything went smooth. Last Saturday I installed the actual kernel update and now the system won't run for longer than two minutes due to the emergency shutdown. I switched back to the former kernel and everything runs smooth again.

What is interesting, is that my triggering temperature is WAY OFF! These are always different, but most of the time negative numbers like -14537.

When I am at home today, I will post examples from syslog.
What else information can I provide? dmidecode? hwinfo?

Ah yes, a MS Windows XP runs without problems.


The problem seems to exist for a longer time, as you find longer threads of such behaviour using google (about 1 3/4 year) - without a solution.

Comment 9 Christoph Resch 2007-05-09 11:22:41 UTC
> I'd like to close this one invalid if without above modules this does not
> happen any more.

hmmm .. i would not, because after latest Kernel-update-RPM from Suse , these modules are back in play .. and no more blacklisted .. so i suggest that Modules that are so unclearly incomplete shouldnt find their way into standard configuration after beeing blacklisted before .. i will open a new bug for this , because its a little offtopic 

anyway since latest kernel-patch i was happily found running my computer for 24hours (!!!!) without crashing or shoutting down .. but gimme one more week for testing :-) 
Comment 10 Thomas Renninger 2007-05-09 11:39:10 UTC
This is all very machine specific! Please don't mix up things (e.g. HPs also had crit shutdown problems, but that is totally unrelated!).

Please check for i2c whatever sensors related modules and blacklist them move them away or whatever.
If you are sure you identified an offending module please post it's name with some info what you've done to test it.
Comment 11 Christoph Resch 2007-05-09 11:58:38 UTC
i now blacklist :

blacklist it87
blacklist i2c_ali15x3
blacklist i2c_ali1535

should i blacklist i2c_isa and i2c_core as well ? 
Comment 12 Thomas Renninger 2007-05-09 12:26:04 UTC
Probably a good idea. As said, I don't know much about these modules, but AFAIK they are mostly used for sensors (fan/thermal reads which should only be done by ACPI). If you lack any functionality, pls let us know.
Comment 13 Christoph Resch 2007-05-09 15:53:26 UTC
well i think that this message points out a blind spot in the driver . i may be wrong:

<4>ali1535_smbus 0000:00:1e.1: ALI1535_smb region uninitialized - upgrade BIOS?
<4>ali1535_smbus 0000:00:1e.1: ALI1535 not detected, module not inserted.
<3>ali15x3_smbus 0000:00:1e.1: ALI15X3_smb region uninitialized - upgrade BIOS
or use force_addr=0xaddr
<3>ali15x3_smbus 0000:00:1e.1: ALI15X3 not detected, module not inserted.

maybe on top of this missing part in the driver its not working as it should

regards
Comment 14 Christoph Resch 2007-05-26 09:56:41 UTC
well now i disabled this module (it87) i have var/log/messages full of this (what makes me nervous ) 

ACPI: Unable to turn cooling device [ffff810037fde290] 'on'
ACPI: Unable to turn cooling device [ffff810037fde290] 'on'
ACPI: Unable to turn cooling device [ffff810037fde290] 'on'
ACPI: Unable to turn cooling device [ffff810037fde290] 'on'
ACPI: Unable to turn cooling device [ffff810037fde290] 'on'
ACPI: Unable to turn cooling device [ffff810037fde290] 'on'
ACPI: Unable to turn cooling device [ffff810037fde290] 'on'
ACPI: Unable to turn cooling device [ffff810037fde290] 'on'
ACPI: Unable to turn cooling device [ffff810037fde290] 'on'
ACPI: Unable to turn cooling device [ffff810037fde290] 'on'
..

of course the mainboards sensors are unshown now , i would love to have them again .

i am not sure but doesnt this message says that there is a part of my hardware the kernel doesnt recognise ... this must be my southbridge-fan i guess 

i BIOS i know its running and that the Temperature is fine :-(


Comment 15 Thomas Renninger 2007-05-29 12:23:04 UTC
Yes, looks scary, but could be harmless, if e.g. BIOS exports trip point table(s), but the fans are still handled by BIOS.

I more expect wrong/weird EC values (110 C and 80 C values should be simply wrongly reported), than real fan failure.

You may want to monitor this a bit:
watch -n1 cat /proc/acpi/fan/*/* /proc/acpi/thermal_zone/*/{temperature,trip_points}

(Also use your ears for monitoring :), if fan control is done by BIOS, you can at least guess from temperature and fan activity if things seem to work normal).
To increase temp, you can e.g. do: "cat /dev/zero >/dev/null &" (this should keep one core busy and let increase temperature or fan activity fast.

Do you still get the critical shutdowns with the smbus accessing module disabled?

You should also be able to trigger this bug more often if you set THERMAL_POLLING_FREQUENCY="1" in /etc/powersave/thermal and /etc/sysconfig/powersave/thermal and restart powersaved (rcpowersaved restart). You should then be able to verify the critical shutdown more easily (only if the smbus module is loaded?).

Jean Delvare has some ideas about getting legacy sensor modules and ACPI working together -> adding to CC.

Christoph: If you don't see critical shutdowns anymore without legacy smbus modules, I'd like to close this one (hmm, it's worth to still make sure not loading them per default for 10.2, maybe Daniel or Jean could give me a hint how to do that, the modules should also taint the kernel IMO).

Christoph: It would also be very interesting to see how this machine behaves with latest 10.3 Alpha (it's the 4th AFAIK). If you still have a free partition on your machine, it would be great to give it a test. Now we still can fix things, as soon as 10.3 is released it's getting difficult.

Christoph: Could you also attach acpidump output please.
Comment 16 Christoph Resch 2007-05-29 13:12:34 UTC
Created attachment 142653 [details]
my acpidump
Comment 17 Christoph Resch 2007-05-29 13:24:45 UTC
cat /proc/acpi/fan/FAN/state
status:                  on

cat /proc/acpi/thermal_zone/THRM/temperature
temperature:             46 C

cat /proc/acpi/thermal_zone/THRM/trip_points
critical (S5):           60 C
passive:                 50 C: tc1=4 tc2=3 tsp=60 devices=0xffff810037fe72b0
active[0]:               50 C: devices=0xffff810037fde290

this fan is the CPU which is set to "SmartFan" in BIOS .. i also have ksensors running. Since i cannot monitore northbridge/Systemfan anymore (nomore it87 and related) this is not optimal i know 

when this shutdown occured there was definitly NO problem with temperature at all , silent smooth and cool my system is. 

ATM i am running `cat /dev/zero >/dev/null` on both core (AMDx2) generating full utilaziation on the CPUs for 15 minutes and my /proc/acpi/thermal_zone/THRM/temperature doesnt even raise more than 2° , even fan is still working rather silently.

My point was only if there shouldnt be a workaround in the kernel or the module or the powerwatch to concern the mistaken panic and at least wait for next result ( like a UPS would care about powerfailure ) .. as you see , this result are meaningless after the next poll ( after a few seconds ) .. 

now i will change to modprobed it87 and lett you know again :-) 
Comment 18 Jean Delvare 2007-05-29 13:25:56 UTC
With regards to the synchronization mechanism between ACPI and non-ACPI drivers, some proposals have been made, but we don't have anything ready for production yet, and I have higher priority tasks at the moment.

The i2c-ali15x3 and i2c-ali1535 drivers aren't the cause of the reporter's problem. The message in the logs clearly show that the BIOS did not properly initialize the SMBus device, so the Linux drivers can't use it. You can blacklist these drivers to prevent the messages in the logs, but this won't make any difference otherwise.

There's no point in blacklisting i2c-core nor i2c-isa either, as these are supporting drivers which do not access the hardware.

The driver which may be conflicting with ACPI on your system would rather be it87. This driver does _not_ autoload, so rather than blacklisting it, you should delete /etc/sysconfig/lm_sensors. If it makes the problem go away, then it would confirm 
that this was a driver conflict. Alternatively, you could blacklist the (ACPI) thermal and fan drivers. After all, the it87 driver gives you much more information than the ACPI drivers.
Comment 19 Christoph Resch 2007-05-29 13:58:14 UTC
/etc/sysconfig/lm_sensors suggest me also that it might not be an issue width the it87 alone or neither ..
MODULE_0=k8temp #<- :-( ??
MODULE_1=it87

shanti@zion:~> lsmod |grep k8
powernow_k8            32416  1
freq_table             22912  1 powernow_k8
processor              53864  2 powernow_k8,thermal

shanti@zion:~> lsmod |grep i2
i2c_ali15x3            25348  0
i2c_isa                23040  1 it87
i2c_core               41472  3 i2c_ali15x3,it87,i2c_isa

i remember from some gentoo-forums in 2005 ( when i started with this PC ) that that time there hab been issues with powernow_k8 and AMD64/SMP CPUs,

but after fixing this ( came in kernelupdates )  there was no more thought on this 

this shutdown issue also first touched me with OPENSUSE10.2 .. never had some of this in my life ( even in custom-kernels ) 


since RPM says
"file /etc/sysconfig/lm_sensors is not owned by any package"

i will comment those lines, bring back it87 and proceed :-) 

PS:
kernel: ali15x3_smbus 0000:00:1e.1: ALI15X3 not detected, module not inserte
d.
so i guess i should use ali1563 instead , albeit my hardware is a 00:1e.1 Bridge: ALi Corporation M7101 Power Management Controller [PMU]  ??? 

tnx4support
Comment 20 Jean Delvare 2007-05-29 14:14:46 UTC
(In reply to comment #19)
> /etc/sysconfig/lm_sensors suggest me also that it might not be an issue width
> the it87 alone or neither ..
> MODULE_0=k8temp #<- :-( ??
> MODULE_1=it87

The k8temp driver is new in kernel 2.6.19 so you don't have it in openSuse 10.2.

> since RPM says
> "file /etc/sysconfig/lm_sensors is not owned by any package"
> 
> i will comment those lines, bring back it87 and proceed :-) 

The file /etc/sysconfig/lm_sensors is generated by sensors-detect.

> PS:
> kernel: ali15x3_smbus 0000:00:1e.1: ALI15X3 not detected, module not inserted.
> so i guess i should use ali1563 instead , albeit my hardware is a 00:1e.1
> Bridge: ALi Corporation M7101 Power Management Controller [PMU]  ??? 

No. Your hardware is really supported by either the i2c-ali15x3 driver or the i2c-ali1535 driver (can't tell which, as ALi unfortunately used the same PCI device ID for both) and not the i2c-ali1563. But the BIOS did not properly map the SMBus function so it cannot be used. Anyway, you don't really care about the SMBus, so you can simply ignore it.
Comment 21 Jean Delvare 2007-05-29 15:41:01 UTC
I disassembled the DSDT and took a look. I am no ACPI expert but it is clear that ACPI is accessing a device at 0x295-0x296, which is the default address of the IT87xxF hardware monitoring function. This confirms that you should be using either thermal+fan, or it87, but not both.
Comment 22 Thomas Renninger 2007-05-29 17:43:44 UTC
Jean, IMO we have to set sensor modules to unsupported and must somehow make sure they get not loaded (for ACPI archs), but still let people a chance to load them manually. How are those modules loaded?

Can you come up with a list of modules that access smbus/i2c bus and could interfere with ACPI, pls.

How does the modules get loaded?
I can't find any autoloading in kernel (quick look, I might have overseen something), does this work that you run some userspace hwmon-test app, that one writes /etc/sysconfig/lm_sensors with suggestions which modules to load via /etc/init.d/hwmon (re-)start?

Jean, I did a quick check of the DSDT, this machine is hopeless to run ACPI and it87 module.

This functions all access the device:
 - SFAN, FON, FOFF, RTMP, STHY, STOS, SCFG

It also looks like the it87 addresses seem not to be used by default, but I would not trust this assumption.

Next thing is, that the above functions are all in _SI scope, but are not assigned to a specific ACPI device. That means writting an ACPI driver for it87 could get difficult and all this looks very machine/BIOS/vendor specific... (on the other hands side I am sure I already have seen the SFAN method, looks like one need a acer-acpi module including this or it could be added to asus-acpi or whatever machine this is, what kind of machine/model is this?)

IMO it87 module must vanish anyway (or must not get loaded for ACPI machines) and we need something ACPI (probably also BIOS/vendor/model) specific.
Comment 23 Jean Delvare 2007-05-30 08:51:04 UTC
(In reply to comment #22)
> Jean, IMO we have to set sensor modules to unsupported and must somehow make
> sure they get not loaded (for ACPI archs), but still let people a chance to
> load them manually. How are those modules loaded?

The hardware monitoring drivers are already all unsupported.

> Can you come up with a list of modules that access smbus/i2c bus and could
> interfere with ACPI, pls.

Virtually all of them can. Note that the it87 driver doesn't even touch smbus/i2c in this case. So the list you want is all smbus/i2c master drivers _and_ all non-i2c-based hardware monitoring drivers, limited to devices which can be found on ACPI-enabled systems:

i2c-ali1535
i2c-ali1563
i2c-ali15x3
i2c-amd756
i2c-amd756-s4882
i2c-amd8111
i2c-i801
i2c-nforce2
i2c-piix4
i2c-sis5595
i2c-sis630
i2c-sis96x
i2c-viapro

abituguru
f71805f
hdaps
it87
k8temp
pc87360
pc87427
sis5595
smsc47b397
smsc47m1
via686a
vt1211
vt8231
w83627ehf
w83627hf

That's a pretty long list, isn't it?

Note: of these, the smsc47m1, vt8231 and via686a are probably less risky than the others because their access is stateless. So even if ACPI is accessing these devices, the drivers shouldn't get in the way. Shouldn't...

> How does the modules get loaded?
> I can't find any autoloading in kernel (quick look, I might have overseen
> something), does this work that you run some userspace hwmon-test app, that
> one writes /etc/sysconfig/lm_sensors with suggestions which modules to load
> via /etc/init.d/hwmon (re-)start?

The only hardware monitoring modules which autload are k8temp, sis5595, via686a and vt8231, because they are PCI devices. All SMBus master drivers listed above autoload too, again because they are PCI devices. For the other hardware monitoring drivers, they are loaded by /etc/rc.d/lm_sensors based on the list found in /etc/sysconfig/lm_sensors. This configuration file is generated by /usr/sbin/sensors-detect.

One thing that could be done would be to add an ACPI check in sensors-detect, so that we can warn the user that a conflict could happen. However, only checking for the presence of ACPI would result in many false positives. So ideally we would need a way to determine if a given DSDT contains functions which access the I/O ports of the devices detected by sensors-detect. But I guess this would be pretty hard to automatize.

Another (hopefully temporary) approach is to add DMI-based blacklists to individual SMBus and hardware monitoring drivers. If the list of motherboards with a conflict is small enough, it might work. If not, it might become quickly unmaintainable :(

> Jean, I did a quick check of the DSDT, this machine is hopeless to run ACPI
> and it87 module.

I agree.

> This functions all access the device:
>  - SFAN, FON, FOFF, RTMP, STHY, STOS, SCFG

Are these functions called if neither the "fan" nor "thermal" drivers are loaded? My guess is that Christoph would rather use the it87 driver than the less featured ACPI drivers.

> It also looks like the it87 addresses seem not to be used by default, but I
> would not trust this assumption.

I don't understand what you mean here. Can you please explain?

> Next thing is, that the above functions are all in _SI scope, but are not
> assigned to a specific ACPI device. That means writting an ACPI driver for
> it87 could get difficult and all this looks very machine/BIOS/vendor
> specific... (on the other hands side I am sure I already have seen the SFAN
> method, looks like one need a acer-acpi module including this or it could be
> added to asus-acpi or whatever machine this is, what kind of machine/model
> is this?)

Please remember I am no ACPI expert. What is the "_SI scope"?

> IMO it87 module must vanish anyway (or must not get loaded for ACPI machines)
> and we need something ACPI (probably also BIOS/vendor/model) specific.

ACPI was supposed to be a standard, and now I seem to understand that we would need to write a dedicated driver for every motherboard vendor, or maybe even every motherboard model out there? *sigh* The it87 driver is supporting the ITE IT87xxF/FG chips on _all_ motherboard models.

But sure, just do it. The problem right now is that the ACPI people keep complaining that non-ACPI hardware monitoring drivers are causing trouble, but users keep using them because ACPI doesn't offer anything next to what lm-sensors has been providing for years.
Comment 24 Forgotten User 1GBkbCnI0A 2007-05-30 09:38:59 UTC
On my machine (MSI MegaPC 865 barebone) the module smsc47m1 seems to be the cause of the trouble, contrary to the assumption, that this module does not affect the acpi value reading.

The difference between the older and the newer kernels in loaded modules is
lm90, hwmon and smsc47m1, where the critical shutdowns do not happen, when smsc47m1 is not loaded.
Comment 25 Jean Delvare 2007-05-30 10:13:25 UTC
Stefan, please attach your acpidump.
Comment 26 Forgotten User 1GBkbCnI0A 2007-05-30 17:31:37 UTC
Created attachment 142976 [details]
acpidump of MSI MegaPC 865 barebone

Here you are...
Comment 27 Jean Delvare 2007-05-31 10:05:07 UTC
Stefan, I checked your DSDT table and it seems that your ACPI implementation includes a fairly complete fan speed control mechanism. It is setting the fan speed using the SMSC LPC47M1xx PWM outputs based on the temperatures readings from a chip on the SMBus at address 0x2d. This can't possibly be an LM90-compatible chip, as these live at 0x4c or 0x4d. You must have a 3rd hardware monitoring driver which you didn't list. Could be smsc47m192 (which is NOT the same as smsc47m1)? Or some LM85-compatible device.

Anyway, I double-checked the smsc47m1 driver, the device has a flat I/O space and the driver doesn't even write to it by default so I just can't see how it would interact with ACPI, especially not with temperatures as the smsc47m1 driver only deals with fans. There could be some unexpected interaction if you tried to control the fan by yourself (using fancontrol), but no invalid temperature reads as you have been seeing.

Given the ACPI implementation your system has, I recommend that you do not load non-ACPI hardware monitoring drivers nor SMBus master drivers. The ACPI stuff should work just fine for you.
Comment 28 Thomas Renninger 2007-05-31 10:27:38 UTC
> so I just can't see how it would interact with ACPI, especially not with
> temperatures as the smsc47m1 driver only deals with fans
There also exists a little micro controller called EC (Embedded Controller) that makes it even easier to program AML/ASL code for developers.
This one has its own firmware, 256 byte registers and could pre-process info like fan speed, temp, even control fan speed depending on temps (as done on recent ThinkPads).
You cannot see which addresses/busses this one accesses (AFAIK it also access i2c and smbus) and it's very likely that the EC got confused by sensor module reads/writes.
E.g. we had EC confusion because of psmouse driver interference, the EC also accesses super I/O chipset (even this probably was a EC firmware bug, I just want to show that things can be complex...).

Jean, do you think one can write an acpi-hwmon driver with such provided ACPI functions to read/write fan and temp?
If yes, I expect we need something similar as it's done in asus_acpi (There at least always the same ACPI device (ATK..) existed, but ACPI functions were named differently from machine  to machine). What we would need the is something like whitelisting machines via e.g. DSDT table id, or something better and then assign ACPI funcs to hwmon driver where we know what they do, e.g. like:

if (match_xy_model())
   acpi_fan_on_acpi_method="\_SI/FON";
elsif (match_yz_model())
   acpi_fan_on_acpi_method="\scope_xy/XFON";
...

acpi_fan_on(){
    evaluate_acpi_object(.., acpi_fan_on_acpi_method,..);   
}

struct driver_hwmon acpi-hwmon;
        acpi-hwmon->fan_on_callback = acpi_fan_on;
Comment 29 Thomas Renninger 2007-05-31 10:29:25 UTC
Back to the bug, are there any offending modules loaded automatically now?
Do we need to do something?
I'd like to close or rename this one (enhancement -> implement hwmon acpi drivers :) ).
Comment 30 Jean Delvare 2007-05-31 12:20:31 UTC
(In reply to comment #28)
> Jean, do you think one can write an acpi-hwmon driver with such provided ACPI
> functions to read/write fan and temp?

Yes, but it will be difficult. There was an attempt to do this for some Asus motherboards:
http://lists.lm-sensors.org/pipermail/lm-sensors/2007-May/019715.html
My fear is that this will be heavily motherboard-dependent. We will have to check the DSDT of every motherboard out there to find out which AML methods do what. Not only the function names with vary from one board to the next, but also the calling convention, possible side effects and even availability of these functions. Even a simple BIOS update could break it. This is something we always tried to avoid with non-ACPI hardware monitoring drivers, because it doesn't scale at all over time and makes maintenance a nightmare.

But of course, if your plan is really to get rid of all SMBus master and hardware monitoring drivers as soon as ACPI is enabled, then you will have to do something like that. You simply can't kill a functionality thousands of people have been using for years without providing something to replace it. That something will lack some (most?) of the features (for example ACPI doesn't care about voltages at all, does it?) but at the very least it must exist. Otherwise, expect receiving hundred mails from angry users every month.
Comment 31 Jean Delvare 2007-05-31 12:25:20 UTC
(In reply to comment #29)
> Back to the bug, are there any offending modules loaded automatically now?
> Do we need to do something?

I already listed the potentially offending drivers which autoload, see comment #23. Basically these are all the pci drivers in the list.
Comment 32 Forgotten User 1GBkbCnI0A 2007-05-31 13:15:22 UTC
(In reply to comment #27)
> Stefan, I checked your DSDT table and it seems that your ACPI implementation
> includes a fairly complete fan speed control mechanism. It is setting the fan
> speed using the SMSC LPC47M1xx PWM outputs based on the temperatures readings
> from a chip on the SMBus at address 0x2d. This can't possibly be an
> LM90-compatible chip, as these live at 0x4c or 0x4d. You must have a 3rd
> hardware monitoring driver which you didn't list. Could be smsc47m192 (which is
> NOT the same as smsc47m1)? Or some LM85-compatible device.

It works pretty well ;)
With the update to 2.6.18.8-0.3 some modules have been loaded automatically.
I did not load any monitoring modules before.
With 0.1-kernel the diff of loaded modules was:
i2c_isa
lm90
hwmon
smsc47m1
 
> Anyway, I double-checked the smsc47m1 driver, the device has a flat I/O space
> and the driver doesn't even write to it by default so I just can't see how it
> would interact with ACPI, especially not with temperatures as the smsc47m1
> driver only deals with fans. There could be some unexpected interaction if you
> tried to control the fan by yourself (using fancontrol), but no invalid
> temperature reads as you have been seeing.

"As far as I remember" blacklisting smsc47m1 solved the problems with my machine. Maybe it has been hwmon, but I am tending to the smsc47m1.
 
> Given the ACPI implementation your system has, I recommend that you do not
> load non-ACPI hardware monitoring drivers nor SMBus master drivers. The ACPI 
> stuff should work just fine for you.

So I thought, but with the update kernel they got loaded.

My concern is, that this module will not be blacklisted.

Comment 33 Jean Delvare 2007-05-31 14:08:02 UTC
(In reply to comment #32)
> With the update to 2.6.18.8-0.3 some modules have been loaded automatically.
> I did not load any monitoring modules before.
> With 0.1-kernel the diff of loaded modules was:
> i2c_isa
> lm90
> hwmon
> smsc47m1

I am fairly certain this isn't true. There is no way these modules could be loaded automatically, because these hardware monitoring devices cannot be easily detected. Please check your /etc/sysconfig/lm_sensors file.
Comment 34 Forgotten User 1GBkbCnI0A 2007-06-04 06:55:22 UTC
You got me here. Sorry for the confusion.

The funny thing is, that with 2.6.18.8-0.1 the system runs smooth, whereas with 0.3 is reads the strange temperatures.

OK for now, I see the problem with the conflicting/confusing accesses to the hardware. Thank you for the help and effort!
Comment 35 Thomas Renninger 2007-06-05 14:22:03 UTC
Created attachment 144232 [details]
Request IO and mem resources for ACPI operation regions

This patch requests IO and mem resources for ACPI operation regions.
Unfortunately it still clashes with some of it's own resources (see line 5000).
002e-002f : acpi*
0072-0073 : acpi*
0090-0091 : acpi*
0295-0296 : acpi*
5000-50fe : acpi*
  5000-5003 : ACPI PM1a_EVT_BLK 
  5004-5005 : ACPI PM1a_CNT_BLK 
  5008-500b : ACPI PM_TMR 
  5010-5015 : ACPI CPU throttle 
  5020-5023 : ACPI GPE0_BLK 
  50b0-50b7 : ACPI GPE1_BLK 
adalid:~ # cat /proc/iomem  |grep -i acpi
Purpose of this patch is to still allow other drivers to request the io/mem resources and only throw a kern_err if a driver's done so.
Like that we could get a picture which Operation Regions DSDT are generally using and which machines possibly might have potential problems with it.


At some later time the request_resource_soft might fall away again if possible (see next patch). Then the only way to access the regions is via ACPI.
Comment 36 Thomas Renninger 2007-06-05 14:49:56 UTC
No next patch, it's just the same but does not do the _soft and further drivers trying to request such a resource will fail.

Here some dmesg output of patch from comment #35, the conflicts show specific IO regions that get used for sure, parsed from fadt and also declared as SYSTEM_IO operating region:

IO resource region conflicts with IO ACPI PM1a_EVT_BLK regions, conflict is ignored, system might run unstable.
IO resource region conflicts with IO ACPI PM1a_CNT_BLK regions, conflict is ignored, system might run unstable.
IO resource region conflicts with IO ACPI PM_TMR regions, conflict is ignored, system might run unstable.
IO resource region conflicts with IO ACPI GPE0_BLK regions, conflict is ignored, system might run unstable.
IO resource region conflicts with IO ACPI GPE1_BLK regions, conflict is ignored, system might run unstable.


This needs some more fiddling (should just be an example patch I like to send to ACPI/hwmon list if you think it's worth it), but could be a beginning to get a picture which drivers might conflict with ACPI?

When cleaned up there should a message be print when it87 is trying to be loaded:
IO resource region conflicts with acpi IO region [0x295-0x296], conflict is ignored, system might run unstable.

If it comes out that no important drivers are affected, we could let the drivers fail to load.

Jean, I grepped over my DSDTs, here is a list of common SystemIO addresses used/claimed by DSDTs: /suse/trenn/Export/SystemIO.txt

Any comments welcome...
Comment 37 Alexey Starikovskiy 2007-06-06 20:31:23 UTC
Created attachment 144583 [details]
Set correct name of the resource

Thomas,
I think it's convenient to know the name of the resource, not just acpi*.
Comment 38 Alexey Starikovskiy 2007-06-07 18:42:54 UTC
Created attachment 144838 [details]
DSDT from desktop ASUS mainboard

Thomas,
This DSDT seems to encapsulate some hwmon, please look at 
ASOC.
Comment 39 Tristan Hoffmann 2007-06-07 19:38:29 UTC
Hi,
I just want to add that I also have this problem since OpenSUSE 10.2 on a HP nx6125 laptop. I now have openSUSE 10.3 Alpha 3 and it still occurs. but I've never loaded hardware monitoring modules manually.

/var/log/messages:
Jun  7 20:14:04 turion-laptop kernel: ACPI: Critical trip point
Jun  7 20:14:04 turion-laptop kernel: Critical temperature reached (7168 C), shutting down.
Jun  7 20:14:05 turion-laptop shutdown[5639]: shutting down for system halt
Jun  7 20:14:05 turion-laptop init: Switching to runlevel: 0 
Comment 40 Alexey Starikovskiy 2007-06-08 08:59:31 UTC
Do you have anything in /etc/sysconfig/lm_sensors?
Do you see problems if you remove all sensors out of this file and re-start hwmon?
Comment 41 Jean Delvare 2007-06-08 10:56:21 UTC
(In reply to comment #40)
> Do you have anything in /etc/sysconfig/lm_sensors?
> Do you see problems if you remove all sensors out of this file and re-start
> hwmon?

That's not the right order. You must first stop hwmon ("/etc/rc.d/lm_sensors stop"), and then delete /etc/sysconfig/lm_sensors, otherwise the hwmon drivers will not be removed.
Comment 42 Tristan Hoffmann 2007-06-08 11:48:53 UTC
Well on both openSUSE 10.3 and 10.2 there is no directory or file called "lm-sensors" in /etc/sysconfig.
Comment 43 Tristan Hoffmann 2007-06-08 11:51:58 UTC
sorry, I wanted to write "lm_sensors"
Comment 44 Jean Delvare 2007-06-08 12:06:58 UTC
(In reply to comment #42)
> Well on both openSUSE 10.3 and 10.2 there is no directory or file called
> "lm_sensors" in /etc/sysconfig.

This means that your problem is different from the original report. Please open a separate bug report.

Comment 45 Tristan Hoffmann 2007-06-08 14:06:10 UTC
Do you really think this is a different bug?
Maybe it's just not lm_sensors causing this?
Comment 46 Jean Delvare 2007-06-08 14:25:54 UTC
(In reply to comment #45)
> Do you really think this is a different bug?
> Maybe it's just not lm_sensors causing this?

Same symptoms but different cause => different bug.
Comment 47 Christoph Resch 2007-06-10 17:19:46 UTC
here again:

#dmesg |grep ACPI 

there are some quirks and an invalid value for PBLK within: 


 BIOS-e820: 0000000077ef0000 - 0000000077ef3000 (ACPI NVS)
 BIOS-e820: 0000000077ef3000 - 0000000077f00000 (ACPI data)
ACPI: RSDP (v000 XPC                                   ) @ 0x00000000000f7e50
ACPI: RSDT (v001 XPC    AWRDACPI 0x42302e31 AWRD 0x00000000) @ 0x0000000077ef3040
ACPI: FADT (v001 XPC    AWRDACPI 0x42302e31 AWRD 0x00000000) @ 0x0000000077ef30c0
ACPI: SSDT (v001 PTLTD  POWERNOW 0x00000001  LTP 0x00000001) @ 0x0000000077ef74c0
ACPI: HPET (v001 XPC    AWRDACPI 0x42302e31 AWRD 0x00000098) @ 0x0000000077ef76c0
ACPI: MCFG (v001 XPC    AWRDACPI 0x42302e31 AWRD 0x00000000) @ 0x0000000077ef7740
ACPI: MADT (v001 XPC    AWRDACPI 0x42302e31 AWRD 0x00000000) @ 0x0000000077ef7400
ACPI: DSDT (v001 XPC     ST20V10 0x00001000 MSFT 0x0100000e) @ 0x0000000000000000
ACPI: PM-Timer IO Port: 0x4008
ACPI: Local APIC address 0xfee00000
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0])
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 low level)
ACPI: IRQ0 used by override.
ACPI: IRQ2 used by override.
ACPI: IRQ9 used by override.
ACPI: HPET id: 0x10b9a201 base: 0xfed00000
Using ACPI (MADT) for SMP configuration information
ACPI: Core revision 20060707
ACPI: bus type pci registered
ACPI: Interpreter enabled
ACPI: Using IOAPIC for interrupt routing
ACPI: PCI Root Bridge [PCI0] (0000:00)
ACPI: Assume root bridge [\_SB_.PCI0] bus is 0
PCI quirk: region 4000-403f claimed by ali7101 ACPI
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.P2PB._PRT]
ACPI: PCI Interrupt Link [LNK1] (IRQs 1 3 4 5 6 7 10 11 12 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LNK2] (IRQs 1 3 4 *5 6 7 10 11 12 14 15)
ACPI: PCI Interrupt Link [LNK3] (IRQs 1 3 4 5 6 7 *10 11 12 14 15)
ACPI: PCI Interrupt Link [LNK4] (IRQs 1 3 4 5 6 7 10 *11 12 14 15)
ACPI: PCI Interrupt Link [LNK5] (IRQs 1 3 4 5 6 7 10 11 12 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LNK6] (IRQs 1 3 4 5 6 *7 10 11 12 14 15)
ACPI: PCI Interrupt Link [LNK7] (IRQs 1 3 4 5 6 7 10 11 12 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LNK8] (IRQs 1 3 4 5 6 7 10 11 12 14 15) *9
ACPI: PCI Interrupt Link [LNK9] (IRQs 1 *3 4 5 6 7 10 11 12 14 15)
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.AGP_._PRT]
PCI: Using ACPI for IRQ routing
ACPI: (supports S0 S1 S4 S5)
ACPI: PCI Interrupt 0000:00:1f.1[A] -> GSI 19 (level, low) -> IRQ 185
ACPI: Fan [FAN] (on)
ACPI: PCI Interrupt 0000:00:1c.3[D] -> GSI 23 (level, low) -> IRQ 193
ACPI: PCI Interrupt 0000:00:1f.0[A] -> GSI 19 (level, low) -> IRQ 185
ACPI: PCI Interrupt 0000:00:1d.0[C] -> GSI 21 (level, low) -> IRQ 201
ACPI: PCI Interrupt 0000:00:1c.0[A] -> GSI 18 (level, low) -> IRQ 209
ACPI: PCI Interrupt 0000:00:1c.1[B] -> GSI 18 (level, low) -> IRQ 209
ACPI: PCI Interrupt 0000:00:1c.2[C] -> GSI 18 (level, low) -> IRQ 209
ACPI: PCI Interrupt 0000:02:00.0[A] -> GSI 18 (level, low) -> IRQ 209
ACPI: PCI Interrupt 0000:03:16.0[A] -> GSI 18 (level, low) -> IRQ 209
ACPI: PCI Interrupt 0000:03:15.0[A] -> GSI 17 (level, low) -> IRQ 217
ACPI: Power Button (FF) [PWRF]
ACPI: Power Button (CM) [PWRB]
ACPI: Invalid PBLK length [7]
ACPI: Unable to turn cooling device [ffff810037fde290] 'on'
ACPI: Thermal Zone [THRM] (50 C)
ACPI: Unable to turn cooling device [ffff810037fde290] 'on'
ACPI: Unable to turn cooling device [ffff810037fde290] 'on'
ACPI: Unable to turn cooling device [ffff810037fde290] 'on'
ACPI: Unable to turn cooling device [ffff810037fde290] 'on'
ACPI: Unable to turn cooling device [ffff810037fde290] 'on'
ACPI: Unable to turn cooling device [ffff810037fde290] 'on'
ACPI: Unable to turn cooling device [ffff810037fde290] 'on'
ACPI: Unable to turn cooling device [ffff810037fde290] 'on'
ACPI: Unable to turn cooling device [ffff810037fde290] 'on'
ACPI: PCI Interrupt 0000:01:05.0[A] -> GSI 17 (level, low) -> IRQ 217
ACPI: Unable to turn cooling device [ffff810037fde290] 'on'
Comment 48 Thomas Renninger 2007-06-11 07:53:09 UTC
> an invalid value for PBLK
This one is harmless

The HPs are known to have ACPI issues:
 a) buggy ACPI implementation, that especially hits HPs hard
    (no fan watchdog, once unsynchronized, fans stop working, ...)
 b) EC interferes with other device drivers (e.g. like the psmouse issue, there
    were more strange EC breakage reported). Very hard to find out, probably
    EC firmware issue. This could be related to this one, in fact I'd be very
    happy if those EC failures there are because of some (noticable) device
    interference.

You certainly hit b) as the temperature value (in comment #39) is totally bogus.

Tristan: Some work is/was going on, best you try the latest kernel of the day which is 2.6.22-rcX based: 
ftp://ftp.suse.com/pub/projects/kernel/kotd/HEAD/x86_64/kernel-default.rpm
The only change that could help here IMO is the psmouse cleanup, not sure whether you already got it, it's in 2.6.22-rcX for sure.

about comment #47:
This one looks scary:
PCI quirk: region 4000-403f claimed by ali7101 ACPI
Comment 49 Tristan Hoffmann 2007-06-11 12:07:11 UTC
Thanks, I will try the new kernel with openSUSE Alpha 5 when it's released.
Comment 50 Pavel Machek 2007-06-15 20:48:49 UTC
Sorry for jumping late here.

In comment #27, there's something about blacklist for sensors-detect.

If sensors are important enough that we'd get hundreds angry mails per month if we disable them, what about doing it right?

That means a whitelist, of systems where we know sensors work. Then we could think about loading them by default etc.

Blacklist does not work, as more systems where ACPI conflicts with sensors are manufactured as years pass. Whitelist works, given big enough community.
Comment 51 Christoph Resch 2007-08-17 21:25:56 UTC
i havent had those nasty shutdown for a month or more now ... i switch to x86-arch last week .. still i have those debug in messages:

Aug 17 02:13:12 zion kernel: ACPI: Unable to turn cooling device [dffecd88] 'on'


and lots of them , but not all the time :-o 

my actual acpiout:

shanti@zion:~> dmesg |grep ACPI
 BIOS-e820: 0000000077ef0000 - 0000000077ef3000 (ACPI NVS)
 BIOS-e820: 0000000077ef3000 - 0000000077f00000 (ACPI data)
ACPI: RSDP (v000 XPC                                   ) @ 0x000f7e50
ACPI: RSDT (v001 XPC    AWRDACPI 0x42302e31 AWRD 0x00000000) @ 0x77ef3040
ACPI: FADT (v001 XPC    AWRDACPI 0x42302e31 AWRD 0x00000000) @ 0x77ef30c0
ACPI: SSDT (v001 PTLTD  POWERNOW 0x00000001  LTP 0x00000001) @ 0x77ef74c0
ACPI: HPET (v001 XPC    AWRDACPI 0x42302e31 AWRD 0x00000098) @ 0x77ef76c0
ACPI: MCFG (v001 XPC    AWRDACPI 0x42302e31 AWRD 0x00000000) @ 0x77ef7740
ACPI: MADT (v001 XPC    AWRDACPI 0x42302e31 AWRD 0x00000000) @ 0x77ef7400
ACPI: DSDT (v001 XPC     ST20V10 0x00001000 MSFT 0x0100000e) @ 0x00000000
ACPI: PM-Timer IO Port: 0x4008
ACPI: Local APIC address 0xfee00000
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0])
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 low level)
ACPI: IRQ0 used by override.
ACPI: IRQ2 used by override.
ACPI: IRQ9 used by override.
ACPI: HPET id: 0x10b9a201 base: 0xfed00000
Using ACPI (MADT) for SMP configuration information
ACPI: Core revision 20060707
ACPI: bus type pci registered
ACPI: Interpreter enabled
ACPI: Using IOAPIC for interrupt routing
ACPI: PCI Root Bridge [PCI0] (0000:00)
ACPI: Assume root bridge [\_SB_.PCI0] bus is 0
PCI quirk: region 4000-403f claimed by ali7101 ACPI
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.P2PB._PRT]
ACPI: PCI Interrupt Link [LNK1] (IRQs 1 3 4 5 6 7 10 11 12 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LNK2] (IRQs 1 3 4 *5 6 7 10 11 12 14 15)
ACPI: PCI Interrupt Link [LNK3] (IRQs 1 3 4 5 6 7 *10 11 12 14 15)
ACPI: PCI Interrupt Link [LNK4] (IRQs 1 3 4 5 6 7 10 *11 12 14 15)
ACPI: PCI Interrupt Link [LNK5] (IRQs 1 3 4 5 6 7 10 11 12 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LNK6] (IRQs 1 3 4 5 6 *7 10 11 12 14 15)
ACPI: PCI Interrupt Link [LNK7] (IRQs 1 3 4 5 6 7 10 11 12 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LNK8] (IRQs 1 3 4 5 6 7 10 11 12 14 15) *9
ACPI: PCI Interrupt Link [LNK9] (IRQs 1 *3 4 5 6 7 10 11 12 14 15)
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.AGP_._PRT]
pnp: PnP ACPI init
pnp: PnP ACPI: found 11 devices
PnPBIOS: Disabled by ACPI PNP
PCI: Using ACPI for IRQ routing
ACPI: (supports S0 S1 S4 S5)
ACPI: Invalid PBLK length [7]
ACPI: Thermal Zone [THRM] (46 C)
ACPI: PCI Interrupt 0000:00:1f.0[A] -> GSI 19 (level, low) -> IRQ 185
ACPI: PCI Interrupt 0000:00:1f.1[A] -> GSI 19 (level, low) -> IRQ 185
ACPI: PCI Interrupt 0000:03:16.0[A] -> GSI 18 (level, low) -> IRQ 193
ACPI: Fan [FAN] (on)
ACPI: PCI Interrupt 0000:00:1d.0[C] -> GSI 21 (level, low) -> IRQ 201
ACPI: PCI Interrupt 0000:00:1c.0[A] -> GSI 18 (level, low) -> IRQ 193
ACPI: PCI Interrupt 0000:00:1c.1[B] -> GSI 18 (level, low) -> IRQ 193
ACPI: PCI Interrupt 0000:00:1c.2[C] -> GSI 18 (level, low) -> IRQ 193
ACPI: PCI Interrupt 0000:00:1c.3[D] -> GSI 23 (level, low) -> IRQ 209
ACPI: PCI Interrupt 0000:02:00.0[A] -> GSI 18 (level, low) -> IRQ 193
ACPI: PCI Interrupt 0000:03:15.0[A] -> GSI 17 (level, low) -> IRQ 217
ACPI: Power Button (FF) [PWRF]
ACPI: Power Button (CM) [PWRB]
ACPI: PCI Interrupt 0000:01:05.0[A] -> GSI 17 (level, low) -> IRQ 217
Comment 52 Pavel Machek 2007-08-22 07:44:46 UTC
So original  problem is gone and this can be closed?
Comment 53 Christoph Resch 2007-08-22 20:26:44 UTC
not at all ;-( happened again today

meanwhile my hole architecture changed to x86 and since long time never had this before 

of course my opensuse is at latest patch

Aug 22 21:48:47 zion kernel: ACPI: Critical trip point
Aug 22 21:48:47 zion kernel: Critical temperature reached (65 C), shutting down.
Aug 22 21:48:47 zion shutdown[7106]: shutting down for system halt
Aug 22 21:48:47 zion powersaved[3587]: WARNING (checkTemperatureStateChanges:218) Temperature state changed to critical.
Aug 22 21:48:47 zion init: Switching to runlevel: 0
Aug 22 21:48:49 zion kernel: Critical temperature reached (47 C), shutting down.

again my point is to change code to give ACPI a second try after few seconds .. as you can see this is pure panic on a bogus x-o , but the system is responding unappropiate.



Comment 54 Christoph Resch 2007-08-22 21:15:51 UTC
fyi:

EVENT_TEMPERATURE_CRITICAL
EVENT_TEMPERATURE_HOT

are set to notify .

shanti@zion:~> cat /proc/acpi/thermal_zone/THRM/polling_frequency
polling frequency:       2 seconds

that should IMHO be the minimum delay before reacting ( by shutting down ) 

shanti@zion:~> cat /proc/acpi/thermal_zone/THRM/trip_points
critical (S5):           60 C
passive:                 50 C: tc1=4 tc2=3 tsp=60 devices=0xdffec338
active[0]:               50 C: devices=0xdffecd88

60° is set too low by far .. anyway amending in not the point :-) since i hardly top 51°C 

tnx
Comment 55 Christoph Resch 2007-08-22 21:19:10 UTC
Created attachment 159317 [details]
update: changed from x86_64 to pure x86

update: changed from x86_64 to pure x86
Comment 56 Christoph Resch 2007-08-22 21:20:33 UTC
Created attachment 159321 [details]
acpidump for my Shuttle ST20G5

UPDATE: acpidump for my Shuttle ST20G5 .. now on x86
Comment 57 Pavel Machek 2007-08-24 07:51:42 UTC
No, I do not think we want to break our thermal handling by adding "hmm lets try again" because single broken system. Can you confirm that you are not running lm_sensors any more?

Len Brown has "lets make acpi thermal work as well as it does in window" goal, perhaps you should report it to bugzilla.kernel.org and see what happens?

You may also be able to unload thermal module and/or tweak some bios settings.
Comment 58 Christoph Resch 2007-10-16 16:28:02 UTC
Oct 16 18:00:33 zion kernel: ACPI: Critical trip point
Oct 16 18:00:33 zion kernel: Critical temperature reached (110 C), shutting down.
Oct 16 18:00:33 zion shutdown[28204]: shutting down for system halt
Oct 16 18:00:33 zion powersaved[3546]: WARNING (checkTemperatureStateChanges:218) Temperature state changed to critical.
Oct 16 18:00:34 zion init: Switching to runlevel: 0
Oct 16 18:00:35 zion kernel: Critical temperature reached (44 C), shutting down.

happened again just 10 Minutes ago :-( :-( 

you misunderstand, i dont want you to break some code, but it seems that this procedure is tricked by bogus values .. 

even if you wont code the software with a redundant check (what is a problem with 5 seconds to wait) .. there has to be a user-interaction to decide ..

right bevore: i was working with a lot of documents open and 3 VMWares running .. then suddenly:

"Hello i shut down " .. zzzzz .. all within 15 seconds .. ralated to the high disk-activity on shutdown , submitting a "sudo shutdown -c" , was not possible in time .. 

This is definitly a disaster !!
especially since my Hardware is AMD/ATI who are Top-Supporters of opnesuse
one could not say that its my fault opensuse is not working with my AMD/ATI/RS480-hardware properly 

i cannot see any LM-sensors stuff:

shanti@zion:~> lsmod |grep lm
shanti@zion:~> ps fax |grep lm
 6282 pts/1    S+     0:00  |       \_ grep --colour=auto lm
shanti@zion:~>  
shanti@zion:~> /etc/init.d/lm_sensors status
it8712-isa-0290
Adapter: ISA adapter
VCore 1:   +1.28 V  (min =  +4.08 V, max =  +4.08 V)   ALARM
VCore 2:   +1.76 V  (min =  +4.08 V, max =  +4.08 V)   ALARM
+3.3V:     +3.18 V  (min =  +4.08 V, max =  +4.08 V)   ALARM
+5V:       +4.87 V  (min =  +6.85 V, max =  +6.85 V)   ALARM
+12V:     +11.71 V  (min = +16.32 V, max = +16.32 V)   ALARM
-12V:     -19.87 V  (min =  +3.93 V, max =  +3.93 V)   ALARM
-5V:       -2.76 V  (min =  +4.03 V, max =  +4.03 V)   ALARM
Stdby:     +4.78 V  (min =  +6.85 V, max =  +6.85 V)   ALARM
VBat:      +4.08 V
fan1:     3750 RPM  (min =    0 RPM, div = 8)
fan2:        0 RPM  (min =    0 RPM, div = 8)
fan3:        0 RPM  (min =    0 RPM, div = 8)
M/B Temp:    +45 C  (low  =    -1 C, high =    -1 C)   sensor = thermistor
CPU Temp:    -48 C  (low  =    -1 C, high =    -1 C)   sensor = thermistor
Temp3:       -55 C  (low  =    -1 C, high =    -1 C)   sensor = thermistor


the only thing i want my bios to tweak is with a new version ( which there is none ) .. there are no other things to to than to prevent opensuse from shutting down no matter how many user do whatever business :-o 

it may be a kernel-issue in it86 .. or in the it8712-isa-0290 module/driver or so .. 


but the result is overreacting of the operating-system !
Comment 59 Jean Delvare 2007-10-16 17:48:13 UTC
Christoph, see comment #21. Do not use the it87 driver on this machine, it interacts with ACPI in an unpredictable way.
Comment 60 Christoph Resch 2007-10-16 18:25:56 UTC
0x295-0x296 ... could that be be helpful regarding to log-messages like this:

from /var/log/messages:
=========================
Oct 16 18:03:18 zion kernel: it87: Found IT8712F chip at 0x290, revision 6
Oct 16 18:03:18 zion kernel: it87-isa 9191-0290: Detected broken BIOS defaults, disabling PWM interface

some-dmesg:
=============
PCI quirk: region 4000-403f claimed by ali7101 ACPI
PCI: Setting latency timer of device 0000:00:06.0 to 64
pcie_portdrv_probe->Dev[5a38:1002] has invalid IRQ. Check vendor BIOS
ali1535_smbus 0000:00:1e.1: ALI1535_smb region uninitialized - upgrade BIOS?
ali1535_smbus 0000:00:1e.1: ALI1535 not detected, module not inserted.
ali15x3_smbus 0000:00:1e.1: ALI15X3_smb region uninitialized - upgrade BIOS or use force_addr=0xaddr
ali15x3_smbus 0000:00:1e.1: ALI15X3 not detected, module not inserted.

[[ lspci -> 00:06.0 PCI bridge: ATI Technologies Inc RS480 PCI Bridge ]]

Shuttle INC. insists that their BIOS has no problems ( it only revision13 :-p) and it seems that the are not willing to even try to fix such behavior ...

this is definitly too often BIOS in my logs 

I'd like to know: can i use that mentioned "force_addr=0x..." to somehow fix this problems ? .. 

is thermal+fan superior to the it87 , because i cannot "rmmod thermal" and without it87 i cannot monitor my hardware , and have not any information about 


so how/where can this be fixed ? 
Comment 61 Jean Delvare 2007-10-16 18:50:07 UTC
Please read comment #18 again. Using force_addr won't help you.

Why can't you "rmmod thermal"? It works for me. I'm not saying that it's a great idea though, as this also means that you lose the automatic thermal regulation that might be implemented by your BIOS. Need less to say that this may have very bad consequences.

Without the it87 driver you should still be able to get the system temperature with "acpi -t". I know that it's inferior to what the it87 driver offers, but for openSuse 10.2 there's no other way. We're working on a solution but it is a complex issue and it will take some more time before it's ready.
Comment 62 Christoph Resch 2007-10-16 19:44:30 UTC
i would love to upgrade to 10.3 :-) but this bug https://bugzilla.novell.com/show_bug.cgi?id=332298 keeps me hoping , so i have to stay 10.2 for now :-) 
Comment 63 Christoph Resch 2007-10-16 19:46:34 UTC
i read about the troubles regarding ACPI .. seems the modern hardware wont have any access anymore to powertrotteling-functions :-( 
Comment 64 Jean Delvare 2007-10-16 20:22:49 UTC
(In reply to comment #62 from Christoph Resch)
> i would love to upgrade to 10.3 :-) but this bug
> https://bugzilla.novell.com/show_bug.cgi?id=332298 keeps me hoping , so i have
> to stay 10.2 for now :-) 

10.3 would not do any better with regards to ACPI and hwmon drivers conflicting anyway.
Comment 65 Pavel Machek 2007-10-17 08:35:01 UTC
Can we close this monsterbug?

If you are running lm_sensors, your machine is unsupported...

If you are NOT running lm_sensors, please open specific bug for your system.

If you claim we need to detect false readings, take it to Len. We should make sure there are no false readings in the first place. I have done "lets read again" attempt to detecting false readings, and even that fails -- if you are unlucky, you'll get two bad readings in row.
Comment 66 Thomas Renninger 2007-10-17 09:16:20 UTC
Pavel is right...
I have a patch in queue worked out with Jean to at least detect such a clash and print a warning when a driver is loaded that interferes with ACPI IO/mem regions. Goal is to let the driver fail to load then, let's see how much of these messages will pop up when this one hits mainline. I doubt it's worth to backport that, let's see..
Comment 67 Christoph Resch 2007-10-17 11:31:34 UTC
so how can i tell suse10.2 to disable "thermal" in favour to it87 ????

rmmod thermal tells me its in use .. i cannot remove it .. and IMHO the blacklistin of modules doesnt work out as expected

please help !

best regards

-c-
Comment 68 Thomas Renninger 2007-10-17 11:47:53 UTC
As said by Pavel in comment #65, you should not load the sensor module, but go for thermal.
> rmmod thermal tells me its in use
This should work generally, maybe you still accessed /proc/acpi/thermal somewhere? fuser -m /proc/acpi/thermal should tell you.

To get rid of the thermal module you need to:
remove thermal from ACPI_MODULES (or similar) variable in /etc/sysconfig/kernel and from /etc/sysconfig/kernel (INITRD_MODULES list), then invoke mkinitrd.
Comment 69 Christoph Resch 2007-10-17 11:57:47 UTC
# fuser -m /proc/acpi/thermal
tells me:
Cannot stat /proc/acpi/thermal: No such file or directory
Cannot stat /proc/acpi/thermal: No such file or directory
Cannot stat /proc/acpi/thermal: No such file or directory

i only have /proc/acpi/thermal_zone/THRM/

i albeitly removed "thermal" from INITRD_MODULES and ran mkinitrd .. on next reboot i will tell you ..

 thank you for support
Comment 70 Christoph Resch 2007-10-17 12:28:19 UTC
ok results: 

no more it87-module but thermal is still active :-( 

shanti@zion:~> lsmod |grep the
thermal                18568  1
processor              34664  2 powernow_k8,thermal
shanti@zion:~> lsmod |grep it87
shanti@zion:~>

.. i have no ACPI_MODULES-section in /etc/sysconfig/kernel .. and no other "thermal" in there :-( 

please another suggestion .. for now i miss my temp-sensors :-( 

Comment 71 Thomas Renninger 2007-10-17 13:04:19 UTC
Sorry:
> remove thermal from ACPI_MODULES (or similar) variable in /etc/sysconfig/kernel
Shoud be:
remove thermal from ACPI_MODULES (or similar) variable in /etc/sysconfig/powersave/common
You have to enter explicitly the modules you like to have not remove thermal:
"processor button"  ,possibly battery ac or more if you have a laptop.
Comment 72 Christoph Resch 2007-10-21 17:30:57 UTC
ACPI_MODULES is totally empty .. its also removed INITRD_MODULES .. still its loaded after system went up :-o 
Comment 73 Thomas Renninger 2007-10-22 07:39:02 UTC
If you read the comment above the variable:
# If this variable is empty, the default is used. If you want to disable
# module loading, enter "NONE".
That means, if it's empty all default modules (stated some lines above) are tried to be loaded. Either add the drivers you like to have loaded or set it to NONE.