Bugzilla – Bug 498966
Sporadic hangs with SLERT 10 SP2 on a Dell Precision 670
Last modified: 2010-07-26 11:05:06 UTC
Created attachment 288694 [details] Output of `ps -eo pid,user,args,ni,rtprio --sort rtprio`
Created attachment 288695 [details] Analysis.txt from core
These are interesting excerpts from the bug report: On one system the frequency of shutdown increased to an unusable situation of 4 hours runtime. We suspected the RAID setup and the Dell PERC5 driver and so changed to the on board SATA drive controller. The system ran without shutdowns for 1 month and we declared the problem solved BUT then a shutdown of the same symptoms occurred 1 week later. And: This core was grabbed using sysrq keys while the server was otherwise unresponsive. (Note - The server WAS responding to pings, but all other actions failed.) So it sounds like a soft lockup, possibly due to the raid card or driver or both. Not sure that nmi_watchdog will trigger a dump in this case. We've run Dell 670's for a while now and don't see any of these problems, but I do not believe we run that raid card in them.
1. serial console / ssh console at prio 90 2. lower tasks to RT 49 3. elevate softirq/high to prio 81 4. tasks at RR policy 5. consider utilizing pthread_cond and pthread_mutex
Created attachment 289532 [details] Backtrace of all running tasks at time of dump
I have added a semaphore warning feature to the SLERT 10 SP2 Kernel-of-the-day, which can be downloaded from here: http://ftp.suse.com/pub/projects/kernel/kotd/SLERT10_BRANCH/i386/kernel-rt.rpm This Kernel is very close to to the latest released Kernel; there are a number features that are under qualification to be queued for the next update. The particular feature asserts a warning to the console / syslog, if a RT-priority task locks a semaphore that does not support priority inheritance. The warning will look something like: "RT task <string>:<PID> with priority <PRIO> using non PI futex" The feature is enabled as follows: (as root) echo "1" > /proc/sys/kernel/futex_rt_pi_warning (confirm) cat /proc/sys/kernel/futex_rt_pi_warning returns: 1 The overhead is a single test and branch, and should not impact performance. If you do encounter the warning, this may indicate a possible priority inversion situation. I will attach an example program demonstrating in another post.
Kernel Update 5: v2.6.22.19-0.21 is available on the update channel. Please update to the latest released Kernel as soon as possible.
(In reply to comment #30) > Kernel Update 5: v2.6.22.19-0.21 is available on the update channel. > Please update to the latest released Kernel as soon as possible. Should they be running this kernel, or the RC candidate for update 6?
They should be running the Update 5 Kernel for general testing when re-producing the core problem.
(In reply to comment #22) > Created an attachment (id=289743) [details] > Semaphore code from Horiba > > This is the code the Sven requested during the call last week. I have reviewed this code, and compiled it into a simple test application. At this point the only code that I see as possibly problematic (depending on implementation), is the sem_nw_call() function. Depending on implementation, this function could loop on acquiring a semaphore. If a FIFO process were to use sem_nw_call at a priority > 49, and for example were waiting for a signal to terminate the semaphore-polling loop, the system could theoretically hang.
We have lowered the rt priority of all processes to 49 and below. We made this change on the SLERT prec 670 system as well as several RHEL systems. The SLERT system was destroyed in an experiment, but ran for 10 days without issue with the lower rt priority. Prior to the change lockups were seen every 4 days on average. While this isn't a certain indication of anything it does suggest the change may have helped. We have had two RHEL systems with the rt priority lowered encounter lockups, however investigation was not thorough of these two incidents as appropriate resources were not available to investigate. As we are currently rebuilding the SLERT prec 670 system we will implement the new kernel v2.6.22.19-0.21.
At the risk of creating confusion: The release candidate for update 6 is here: http://ftp.suse.com/pub/projects/kernel/kotd/SLERT10_SP2_BRANCH/ This has the above-mentioned additions to the code. The difference between SLERT10_BRANCH, and SLERT10_SP2_BRANCH, is that the latter is an official release branch, while the former is a development branch. It would therefore be better to direct the customer to the SLERT10_SP2_BRANCH, since the difference in the code base between the released Update 5, and the additions I have made since, are much smaller (and therefore easier to follow), when comparing the source between the released update 5 and the pre-release Update 6 tree on SLERT10_SP2_BRANCH.
Sven, that link is no longer available. Can you please advise which kernel should be installed on the system? As well, please advise any options that need to be set and what we should be looking for in which log files.
As per our meeting today, this system is not running the third party CANBUS kernel module nor does it have the pci card installed. The system has locked up 3 times since being installed last week and on all three occasions the third party CANBUS related hardware and software were not installed.
***** System non-responsive to keyboard and mouse resumed functioning after ps and top were executed on an rt priority serial console connection. ***** The test PC was moved to another cell to relieve the interruptions to the customer. This new cell configuration only required one serial port which enabled our team to setup the rt priority serial port console for monitoring. This system, prior and after this lockup, is experiencing a new issue with the mouse cursor intermittently freezing when moved across the screen. This system is currently running and has not been restarted since the incident. At once up and running, the following environment was configured: 1) once a minute the date command is output to write to /tmp/watchdog.log 2) tail -f /tmp/watchdog.log is in open terminal 3) serial console is up and running on a separate PC with bash set to RT Priority. top displayed: 4633 root -91 0 4060 1884 1428 S 0.0 0.0 0:00.08 bash On June 18, the system was observed to be non responsive to keyboard and mouse interaction. - ping works - ssh was unable to connect - tail stopped updating at 21:52 on June 17, the file it was monitoring was also later seen to have time stamps missing from 21:52 to 07:40 when the system recovered. - Serial console is alive - System recovered from locked state while top and ps commands were executed on the serial console. -The date watchdog file continued collecting date after the lockup was overcome. However, there is a gap from 21:52:06 to 07:40:55 From the serial console top was executed and is showed: -Process read_pt was taking 96.5% CPU and was in "D" status. (D means "uninterruptible sleep".) - read_pt was issued from editProfile.pl. - This read_pt is reading our control software's _annunc_al_warn data point. - "zombie sh <defunc>" PID 8444, was created by PID 8443 snpst_sched
I am pulling the cores now.
Please be aware those core files are not related to the incident described in comment 47.
Which exact kernel were these cores generated on? Thanks.
Ok, I'm looking at the June17 dump now. It looks like it was generated on our update4 kernel (vmlinux-2.6.22.19-0.19-rt_bigsmp). At the time of the dump, the load average was [65.50, 65.19, 64.67], so this machine was extremely overloaded. Since the 15 min load average was also at 64.67, it was overloaded for a long time. Since this is a single CPU system, it is not surprising that the machine appeared to be hanging. I will look into the dump some more, but there is probably a runaway user task that is forking too many processes.
Looking at the June16 dump, the load was [83.14, 81.66, 69.17], so again, the system was way overloaded. I don't see a great many tasks on either core, so I'm not sure that there this is a runaway situation. However, the system will not be responsive with these load numbers, especially if some of the tasks are running at RT priority. The June15 core seems to be for a different kernel. Which kernel were you running on June 15? Thanks.
That is the rt_bigsmp kernel from a PTF per bug 441214 https://bugzilla.novell.com/show_bug.cgi?id=441214
(In reply to comment #54) > That is the rt_bigsmp kernel from a PTF per bug 441214 > > https://bugzilla.novell.com/show_bug.cgi?id=441214 This particular Kernel has been superseded by Update 4, and should be discarded.
Created attachment 299097 [details] Output of ps -t on June 17 core PID: 10764 TASK: ef54a780 CPU: 0 COMMAND: "read_pt" RUN TIME: 10:58:23 START TIME: 25641 USER TIME: 2 SYSTEM TIME: 37920530 This is by far the largest amount of system time used by any process. Also, I see heavy activity on IRQ 16. Can you please provide the output of cat /proc/interrupts?
crash> bt 10764 PID: 10764 TASK: ef54a780 CPU: 0 COMMAND: "read_pt" #0 [ce51bd58] __schedule at c031820e #1 [ce51bdb4] preempt_schedule_irq at c03189b5 #2 [ce51bdc4] need_resched at c01041c3 #3 [ce51be00] __set_page_address at c01668d0 #4 [ce51be1c] kmap_high at c0166f0a #5 [ce51beb8] __handle_mm_fault at c016a457 #6 [ce51bf54] do_page_fault at c031c079 #7 [ce51bfb4] error_code (via page_fault) at c031a330 Apparently, this task is rescheduling pending allocation of new memory at the time of the crash dump.
What is the priority and scheduling policy of "read_pt"?
It should be noted that since we moved the PC, we removed a digi pci serial card. As well, please note we are having mouse and keyboard delays on this system at all times while our Software is running. # cat /proc/interrupts CPU0 0: 959 IO-APIC-edge timer 1: 2 IO-APIC-edge i8042 3: 2 IO-APIC-edge 4: 8694 IO-APIC-edge serial 6: 5 IO-APIC-edge floppy 7: 0 IO-APIC-edge parport0 8: 27 IO-APIC-edge rtc 9: 0 IO-APIC-fasteoi acpi 12: 4 IO-APIC-edge i8042 15: 478580 IO-APIC-edge ide1 16: 6315250 IO-APIC-fasteoi ohci1394, uhci_hcd:usb1, uhci_hcd:usb5, eth1, nvidia 17: 270869 IO-APIC-fasteoi libata, aacraid, uhci_hcd:usb4 18: 35289357 IO-APIC-fasteoi eth0 19: 0 IO-APIC-fasteoi uhci_hcd:usb2 20: 2 IO-APIC-fasteoi ehci_hcd:usb3 NMI: 0 Non-maskable interrupts LOC: 49444015 Local interrupts RES: 0 Rescheduling interrupts CAL: 0 function call interrupts TLB: 0 TLB shootdowns TRM: 0 Thermal event interrupts SPU: 0 Spurious interrupts ERR: 0
ps -eo ppid,pid,user,args,ni,rtprio --sort rtprio|grep -i read_pt 20795 24407 root grep -i read_pt 0 - 8369 24404 adacs read_pt -t e ENGINE_ON - 1 # ps 8369 PID TTY STAT TIME COMMAND 8369 ? SNL 3:15 /usr/bin/perl /opt/adacs/local/mc_uploader.pl -ia_info_msqid 0 -ia_stand e Can you tell me the status of read_pt in the cores we sent you? As well it would be helpful if you included the full command line parameters of read_pt.
(In reply to comment #59) > It should be noted that since we moved the PC, we removed a digi pci serial > card. As well, please note we are having mouse and keyboard delays on this > system at all times while our Software is running. > > > 16: 6315250 IO-APIC-fasteoi ohci1394, uhci_hcd:usb1, uhci_hcd:usb5, > eth1, nvidia This nvidia driver, is this also a proprietary nvidia driver?
19: 0 IO-APIC-fasteoi uhci_hcd:usb2 20: 2 IO-APIC-fasteoi ehci_hcd:usb3 These usb ports are unused. It should be possible to connect mouse and keyboard here (might need to experiment which USB port corresponds to these IRQs). When you have the right one, keystrokes and mouse moves will slowly increment the IRQ counts. When you have found that, lets try elevation the prio of the corresponding IRQ-19 and IRQ-20 threads. That could help address the mouse issues, and would isolate nvidia and eth1 on IRQ-16. Are there other USB devices on that IRQ-16 tree? Is eth1 configured? Is eth1 heavily used?
(In reply to comment #60) > ps -eo ppid,pid,user,args,ni,rtprio --sort rtprio|grep -i read_pt > 20795 24407 root grep -i read_pt 0 - > 8369 24404 adacs read_pt -t e ENGINE_ON - 1 > > # ps 8369 > PID TTY STAT TIME COMMAND > 8369 ? SNL 3:15 /usr/bin/perl /opt/adacs/local/mc_uploader.pl > -ia_info_msqid 0 -ia_stand e > > > Can you tell me the status of read_pt in the cores we sent you? As well it > would be helpful if you included the full command line parameters of read_pt. 10764 13947 0 ef54a780 RU 0.0 7096 1156 read_pt The process is running, and appears to be scheduled away, pending memory allocation. Per back-trace, its healthy, and preempted on the run-queue. > > > Can you tell me the status of read_pt in the cores we sent you? As well it > would be helpful if you included the full command line parameters of read_pt. I am unable to extract the full command line from the core at this moment. Do you get anything from something like: ps -lf `pidof read_pt` This would give better insight into the effective prio.
(In reply to comment #62) > > These usb ports are unused. It should be possible to connect mouse and keyboard > here (might need to experiment which USB port corresponds to these IRQs). BTW - this would be just a workaround until we get to the bottom of all this.
We know in the case of our lockup this morning, the one which we were able to recover without taking a sysrq core dump, that read_pt is spawned from a script called mc_uploader.pl, not our original suspicion of editprofile.pl. mc_uploader on this cell executes read_pt 54 times every 10 minutes. read_pt is a binary that utilizes our dblib library. In this case it is finding the datatype of the 54 data points. mc_uploader.pl also reads the data point values through a perl module known as accessdb every 10 seconds. Both accessdb and read_pt use the same underlying dblib library for interacting with the real time database. It would be interesting to know if in all of the cores we have submitted if read_pt had been running for an extended period of time. Each instance of read_pt should take less than 1 second to execute. Many of our scripts use accessdb and read_pt for interacting with the real time database. It is therefore possible that they will be running a different priorities depending on which parent process spawned the inquiry. In this sense it would be extremely helpful if your team could give us the parent process of read_pt from these cores. Is it correct, that you have found in the Jun 17 core file that read_pt had been running for 10 hours? For example, one instance of read_pt, which I know is not related to mc_uploader.pl: # ps -lf `pidof read_pt`|grep -v bash F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 0 R adacs 21056 8650 0 99 19 - 979 - 16:24 ? 0:00 read_pt e SI_LastFrPoint.b ps 8650 PID TTY STAT TIME COMMAND 8650 ? SN 0:15 /usr/bin/perl /opt/adacs/local/readLastFrictionPoint.pl
We just experienced a lockup on the cell running SLERT Update 4. The on site staff were able to run top from the serial console and observe that read_pt was consuming 96.4 and 96.1 percent of the cpu. When the technician hit C to show the full command parameters the system locked up completed. He then executed a sysrq core dump which I will upload and provide the file name in a future update. Which commands in order of priority should we be executing from the serial console when we encounter these lockups?
(In reply to comment #65) > In this sense it would > be extremely helpful if your team could give us the parent process of read_pt > from these cores. Core June 17: crash> ps -p 10764 PID: 0 TASK: c03ad3a0 CPU: 0 COMMAND: "swapper" PID: 1 TASK: dfce3770 CPU: 0 COMMAND: "init" PID: 13770 TASK: c979d9f0 CPU: 0 COMMAND: "database" PID: 13772 TASK: ed966740 CPU: 0 COMMAND: "prog_launcher" PID: 13947 TASK: ec979320 CPU: 0 COMMAND: "perl" PID: 10764 TASK: ef54a780 CPU: 0 COMMAND: "read_pt"
Is there any way to find out what perl script is being run by perl? You said earlier you couldn't get the full command parameters, is this something that can be overcome?
Created attachment 299117 [details] ps -t for June 16 core For the June 16 core, it seems to be the complimentary write task, generating the load spike: crash> ps -t 25725 PID: 25725 TASK: eaa35970 CPU: 0 COMMAND: "write_pt" RUN TIME: 00:31:26 START TIME: 16771 USER TIME: 6 SYSTEM TIME: 1761111 crash> ps -p 25725 PID: 0 TASK: c03ad3a0 CPU: 0 COMMAND: "swapper" PID: 1 TASK: dfce3770 CPU: 0 COMMAND: "init" PID: 7155 TASK: dff1c1f0 CPU: 0 COMMAND: "database" PID: 7157 TASK: edde8700 CPU: 0 COMMAND: "prog_launcher" PID: 7314 TASK: c96c51e0 CPU: 0 COMMAND: "perl" PID: 25725 TASK: eaa35970 CPU: 0 COMMAND: "write_pt"
(In reply to comment #68) > Is there any way to find out what perl script is being run by perl? You said > earlier you couldn't get the full command parameters, is this something that > can be overcome? crash> ps -a 7314 PID: 7314 TASK: c96c51e0 CPU: 0 COMMAND: "perl" ps: cannot access user stack address: bfb3421f Stumped on this one more the moment.
(In reply to comment #66) > We just experienced a lockup on the cell running SLERT Update 4. The on site > staff were able to run top from the serial console and observe that read_pt was > consuming 96.4 and 96.1 percent of the cpu. When the technician hit C to show > the full command parameters the system locked up completed. He then executed a > sysrq core dump which I will upload and provide the file name in a future > update. Which commands in order of priority should we be executing from the > serial console when we encounter these lockups? You could try to attach gdb to the process to see what its doing.
(In reply to comment #65) > mc_uploader on this cell executes read_pt 54 times every 10 minutes. read_pt > is a binary that utilizes our dblib library. In this case it is finding the > datatype of the 54 data points. mc_uploader.pl also reads the data point > values through a perl module known as accessdb every 10 seconds. Both accessdb > and read_pt use the same underlying dblib library for interacting with the real > time database. Wild guess... Is there a loop, that repeat-tries in case a database record is locked? Something like that ould just create enough of a DoS situation, blocking the task trying to release the lock on the record.
(In reply to comment #71) > (In reply to comment #66) > > You could try to attach gdb to the process to see what its doing. If you can inject the read_pt process via strace, and append the output of strace to a logfile the I/O might rate-limit things enough to keep control of the system. Should be easy enough to trap the situation, given that the disk containing said logfile should become very busy.
(In reply to comment #70) > (In reply to comment #68) > > Is there any way to find out what perl script is being run by perl? You said > > earlier you couldn't get the full command parameters, is this something that > > can be overcome? > > crash> ps -a 7314 > PID: 7314 TASK: c96c51e0 CPU: 0 COMMAND: "perl" > ps: cannot access user stack address: bfb3421f > > Stumped on this one more the moment. crash> task 13947 | grep comm comm = "perl\000loader.pl\000", So it looks like the perl script run b7y process 13947 was loader.pl
For the June 17 dump, the following is all existing perl processes with command lines and run times: crash> ps | grep perl 13894 13772 0 c979c2b0 IN 0.1 5340 2760 perl 13929 13772 0 c979c880 IN 0.0 4644 1744 perl 13947 13772 0 ec979320 IN 1.2 48584 48584 perl 14270 1 0 edb46d90 RU 1.1 44412 44412 perl 14271 1 0 ed3f2c50 IN 0.0 4644 1800 perl 14273 1 0 ec9798f0 IN 0.0 4644 1728 perl crash> task 13894 | grep comm comm = "perl\000rSave.pl\000\000", crash> task 13929 | grep comm comm = "perl\000t.pl\000her\000\000", crash> task 13947 | grep comm comm = "perl\000loader.pl\000", crash> task 14270 | grep comm comm = "perl\000rtSparc_ms", crash> task 14271 | grep comm comm = "perl\000romSparc.p", crash> task 14273 | grep comm comm = "perl\000astFrictio", crash> ps -t 13894 13929 13947 14270 14271 14273 PID: 13894 TASK: c979c2b0 CPU: 0 COMMAND: "perl" RUN TIME: 18:01:37 START TIME: 247 USER TIME: 26 SYSTEM TIME: 6 PID: 13929 TASK: c979c880 CPU: 0 COMMAND: "perl" RUN TIME: 18:01:36 START TIME: 248 USER TIME: 1618 SYSTEM TIME: 6172 PID: 13947 TASK: ec979320 CPU: 0 COMMAND: "perl" RUN TIME: 18:01:36 START TIME: 248 USER TIME: 36333 SYSTEM TIME: 52910 PID: 14270 TASK: edb46d90 CPU: 0 COMMAND: "perl" RUN TIME: 18:01:15 START TIME: 269 USER TIME: 202 SYSTEM TIME: 418 PID: 14271 TASK: ed3f2c50 CPU: 0 COMMAND: "perl" RUN TIME: 18:01:15 START TIME: 269 USER TIME: 4 SYSTEM TIME: 9 PID: 14273 TASK: ec9798f0 CPU: 0 COMMAND: "perl" RUN TIME: 18:01:15 START TIME: 269 USER TIME: 5 SYSTEM TIME: 4
(In reply to comment #75) > For the June 17 dump, the following is all existing perl processes with command > lines and run times: And this particular set would be the one in the call tree for the runaway read_pt > > crash> ps | grep perl > 13947 13772 0 ec979320 IN 1.2 48584 48584 perl > > crash> task 13947 | grep comm > comm = "perl\000loader.pl\000", > > > PID: 13947 TASK: ec979320 CPU: 0 COMMAND: "perl" > RUN TIME: 18:01:36 > START TIME: 248 > USER TIME: 36333 > SYSTEM TIME: 52910
(In reply to comment #76) > (In reply to comment #75) > > For the June 17 dump, the following is all existing perl processes with command > > lines and run times: > > > > > PID: 13947 TASK: ec979320 CPU: 0 COMMAND: "perl" > > RUN TIME: 18:01:36 > > START TIME: 248 > > USER TIME: 36333 > > SYSTEM TIME: 52910 Which notably also has the highest system time of all the perl processes...
BTW: # crash /boot/vmlinux-2.6.22.19-0.19-rt_bigsmp vmcore Prior to that, gzip -d /boot/vmlinux-2.6.22.19-0.19-rt_bigsmp and install the debuginfo, which I am uploading now...
(In reply to comment #78) > and install the debuginfo, which I am uploading now... You can grab the debug info for the Update 4 Kernel here, if you want to be able to look at the core images on-site; this may accelerate the process by eliminating some uploads. ftp://ftp.novell.com/dev/lsg/Horiba/kernel-rt_bigsmp-debuginfo-2.6.22.19-0.19.i586.rpm
To correct comment 66, it was write_pt that was taking over the CPU not read_pt: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 21696 user -2 -20 40540 2228 1328 R 96.1 0.1 3:02.32 write_pt 8159 user 021 19 59976 58M 41M S 2.0 1.5 40:38.36 database Both read_pt and write_pt use the same library and function in much the same way. Unfortunately, we were unable to take a sysrq + c core dump as the the progression of the lockup caused by pressing 'c' in top on the rt serial console put the system in such a state that the sysrq keys would not respond. Can the Novell team advise of a script or sequence of commands we should run from the rt serial console to collect the most meaningful information?
Based on the current data, I wrote a series of tests to stress the system with read_pt. Under normal operating conditions read_pt takes less than 1 second to execute and consumes less than 1% of the CPU. The tests: perl script that loops continuously executing: read_pt -t e POINTNAME ->Approximately 5 minutes of run time to lock the system perl script that loops 60 times, pauses 10 seconds using bash sleep, then repeats. ->This test took approximately an hour to lock the system We have noticed that read_pt and write_pt execute at different RTPRIO levels depending on the executing script. Any advice from Novell on how to investigate this issue further would be appreciated.
This sounds like a good approach. I will give it some more thought after wrapping up the ESD issue.
(In reply to comment #81) > Based on the current data, I wrote a series of tests to stress the system with > read_pt. Under normal operating conditions read_pt takes less than 1 second to > execute and consumes less than 1% of the CPU. > > The tests: > > perl script that loops continuously executing: > read_pt -t e POINTNAME > ->Approximately 5 minutes of run time to lock the system > I would focus on this test, which takes less time to failure. - Is it possible to further break-down the test, e.g. eliminating perl, and calling the lower-level routines directly? - highly recommend using strace at all process-dispatch points, to attempt to obtain a log-file of the system call entry and exit that are effectively using so much CPU. - attaching GDB to the lowest-level process, and attempting to trap via breakpoint, whatever loop is taking place. Ultimately, this can IMO only be debugged easily via by eliminating the peripheries around the core implementation as possible, and identifying the core failure in a divide-and-conquer fashion. Without more detailed insight into the code, however, its only guesswork on this end.
We have been able to recover from the lockup approximately 5 times in the last week by interacting with the RT Serial connection. We had 3 other lockups where we had not set the bash session to RTPRIO 90 and we were not able to recover, the serial connection was non responsive. We are using the ps command to investigate the state of the system during a lockup through the RT serial connection. As we execute ps we find the read_pt CPU usage drops to 0 and completes execution. This drop happens so quickly we are only able to observe it if we run two instances of ps in the same command line. The system seems to remain in the locked state for many hours until the ps command is executed on the RT Serial connection.
We are currently streamlining a version of the database and the read/write_pt applications to run in a simple isolated way. We are able to reproduce the issue running just read_pt and the database processes with no extra configuration or processes. We are working to prepare a package of source code and binaries to enable the Novell team to reproduce the issue.
We have now been successful at recreating the lockup situation with a minimum subset of the ADACS application. I will attach a tar file containing some binaries and script files for reproduction. Before starting, ensure that there is a serial console with shell at RT priority. tar xzf lockup_bin.tz, cd ./bin, and execute stress.sh . Some priorities are higher than normal so likely X will be unresponsive. The lockup will occur between 1 minute and 6 hours afterwards. Lockup is evidenced by %sys time at 99+ and usually one process at 99+ %CPU. System sometimes responds to ping for a while, but will not respond to ssh or other network access. The lockup seems to be caused by a tight loop in kernel routine mm/highmem.c in the kmap_high() procedure. Plese view the attachment lockup_trace.txt.gz . The 2.6.22.10.PTF441214 kernel recompiled with latency tracing. From the serial console login session chrt -p 99 $$, running trace-it from ingo molnar latency trace utilities, several thousand lines of tracing was captured. Any command requiring file system access will lock even that session. Recently a repeat test showd interesting result was displayed on the 'top' running on the serial console. It indicated four of the shell scripts all running at 99.9% and the system time at 99.6%. This is on the single core xeon Dell 670 described above. For interest sake we were unable to create the lockup using the much later kernel 2.6.26.8-rt_bigsmp.
Created attachment 304775 [details] tar file containing test code
Created attachment 304776 [details] gzipped text file of latency trace, captured from serial terminal
Created attachment 304777 [details] dump of 'top' output on serial terminal (includes cursor codes)
Thanks. I will try to reproduce this asap.
(In reply to comment #88) > Created an attachment (id=304776) [details] > gzipped text file of latency trace, captured from serial terminal This looks as though it will be helpful. It would be excellent if the trace could be generated using the most recent version of the rt_trace Kernel from the distribution. Having the symbols to match the trace would make this significantly easier yet.
> > Before starting, ensure that there is a serial console with shell at RT > priority. tar xzf lockup_bin.tz, cd ./bin, and execute stress.sh . First try, I seem to be missing libequn.so. I haven't been able to locate that.
Sorry about that, I didn't notice that all our test machines had the usual deployment. There are actually a few things missing (a) some libraries, (b) the user has to have or be in a group with priviledge for setting realtime priority and (c) there needs to be created a writeable path /usr/local/etc (non standard by a long long way, I know) So, I will attach a new tar that contains the libraries needed in a vanilla install. I did not make an installation script so please put them somewhere from where they can be loaded as needed. I just made a ./lib path and put that into /etc/ld.so.conf On the testing front, we tried making one kernel change, removed support for highmem and despite only seeing 1G physical memory, the system ran under very heavy load for roughly 27 hours. I am going to try again with highmem set to 4G. There are 2G physical on the machine, the SLERT kernel we have been deploying was originally deployed on a Dell T7400, 8 core Xeon, 16G RAM and had the highmem configured to 64G. Maybe there needed to some tuning changes for the Dell 670, single core, 2G RAM?
Created attachment 305659 [details] binaries and libraries, slightly simplified stress.sh script
> On the testing front, we tried making one kernel change, removed support for > highmem and despite only seeing 1G physical memory, the system ran under very > heavy load for roughly 27 hours. I am going to try again with highmem set to > 4G. There are 2G physical on the machine, the SLERT kernel we have been > deploying was originally deployed on a Dell T7400, 8 core Xeon, 16G RAM and had > the highmem configured to 64G. Maybe there needed to some tuning changes for > the Dell 670, single core, 2G RAM? I studied the trace in attachment - it appears that heavy contention on the highmem allocation subsystem combined with preemption to produce some kind of ping-pong effect, where concurrent memory allocations fail. This is merely speculation, until I can actually map the symbols. It would be very helpful, if you were able to produce a similar trace from the stock SLERT rt_trace Kernel. Meanwhile, I will pull and attempt to reproduce locally.
These messages look like there is something else awry. There is disk space available. Anything else obvious that I am missing? Thanks Sven read_pt: fatal messaging error (28: No space left on device) in msg_call (local queue creation) message text: database read_pt: signal 6 received, cleaning up and exiting message read_pt: signal 6 received, cleaning up and exiting ia_my_pid: 31594, ia_my_msqid: -1, ia_info_msqid: 0 read_pt: signal 6 received, cleaning up and exiting ./rr.sh: line 13: 31576 Aborted read_pt -ia_info_msqid $qid -t e _DB_SEC >/dev/null mtype: 0, msg_id: 0, req_msg_id: 0, sender_msqid: 0 msg_called: 0, want_reply: 0, sender_pid: 0, sender_errno: 0 message ./rr.sh: line 13: 31581 Aborted read_pt -ia_info_msqid $qid -t e _DB_SEC >/dev/null message text: database mtype: 0, msg_id: 0, req_msg_id: 0, sender_msqid: 0 msg_called: 0, want_reply: 0, sender_pid: 0, sender_errno: 0 read_pt: signal 6 received, cleaning up and exiting
Nevertheless, I may have just reproduced the issue.
I think there may be separate issues here. I had a bit of a delay resolving some kdump issues on this end for the 32-bit machine I am reproducing on. Would you try to issue: echo 0 > /proc/sys/kernel/rtmutex_lateral_steal before starting the app. I have seen significant improvement with running stress.sh after this setting, when not changing priority. I still see lockups when I start stress.sh with chrt -f -p 49, and am focusing on that now.
I have a build that has survived a 9:34 hr overnight run executed with chrt -r -p 49. I am packaging a derivative of this work as RPM, and will test that a bit more, but hope to have that available later today.
Please re-test using the SLERT 10 SP3 RT Kernel located here: ftp://ftp.novell.com/dev/customers/ford.com/498966/kernel-rt-2.6.22.19-0.39.1.x86_64.rpm Installation instructions: 1. reboot into SLES Kernel. 2. uninstall the SLERT 10 32 bit (i386) Kernel. 3. install the SLERT 10 SP3 (x86_64) Kernel: rpm -i --force --ignorearch 4. reboot into the new Kernel and test.