Bug 498966 - Sporadic hangs with SLERT 10 SP2 on a Dell Precision 670
Summary: Sporadic hangs with SLERT 10 SP2 on a Dell Precision 670
Status: RESOLVED FIXED
Alias: None
Product: SUSE Linux Enterprise Real Time 10 SP2 (SLERT 10 SP2)
Classification: SUSE Linux Enterprise Real Time Extension
Component: kernel (show other bugs)
Version: Update4
Hardware: i386 Other
: P1 - Urgent : Critical
Target Milestone: ---
Assignee: E-mail List
QA Contact: E-mail List
URL:
Whiteboard:
Keywords: DSLA_REQUIRED, DSLA_SOLUTION_PROVIDED
Depends on:
Blocks:
 
Reported: 2009-04-28 19:31 UTC by Mike Latimer
Modified: 2010-07-26 11:05 UTC (History)
8 users (show)

See Also:
Found By: Customer
Services Priority: 100
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
Supportconfig output from problem server (1.55 MB, application/x-bzip-compressed-tar)
2009-04-28 19:32 UTC, Mike Latimer
Details
Output of `ps -eo pid,user,args,ni,rtprio --sort rtprio` (8.64 KB, text/plain)
2009-04-28 19:33 UTC, Mike Latimer
Details
Analysis.txt from core (44.07 KB, text/plain)
2009-04-28 19:37 UTC, Mike Latimer
Details
Backtrace of all running tasks at time of dump (92.33 KB, text/plain)
2009-05-01 22:12 UTC, Alex Tsariounov
Details
Output of ps -t on June 17 core (22.55 KB, application/octet-stream)
2009-06-18 19:18 UTC, Sven Dietrich
Details
ps -t for June 16 core (26.64 KB, text/plain)
2009-06-18 20:47 UTC, Sven Dietrich
Details
tar file containing test code (608.50 KB, application/x-gzip)
2009-07-13 17:37 UTC, Roy Kinsella
Details
gzipped text file of latency trace, captured from serial terminal (251.41 KB, application/x-gzip)
2009-07-13 17:38 UTC, Roy Kinsella
Details
dump of 'top' output on serial terminal (includes cursor codes) (4.19 KB, text/plain)
2009-07-13 17:40 UTC, Roy Kinsella
Details
binaries and libraries, slightly simplified stress.sh script (2.01 MB, application/x-gzip)
2009-07-16 19:04 UTC, Roy Kinsella
Details

Note You need to log in before you can comment on or make changes to this bug.
Comment 2 Mike Latimer 2009-04-28 19:33:39 UTC
Created attachment 288694 [details]
Output of `ps -eo pid,user,args,ni,rtprio --sort rtprio`
Comment 4 Mike Latimer 2009-04-28 19:37:30 UTC
Created attachment 288695 [details]
Analysis.txt from core
Comment 7 Alex Tsariounov 2009-04-29 00:53:43 UTC
These are interesting excerpts from the bug report:

        On one system the frequency of shutdown increased to an unusable
        situation of 4 hours runtime.  We suspected the RAID setup and
        the Dell PERC5 driver and so changed to the on board SATA drive
        controller.  The system ran without shutdowns for 1 month and we
        declared the problem solved BUT then a shutdown of the same
        symptoms occurred 1 week later.

And:

        This core was grabbed using sysrq keys while the server was
        otherwise unresponsive. (Note - The server WAS responding to
        pings, but all other actions failed.)

So it sounds like a soft lockup, possibly due to the raid card or driver
or both.  Not sure that nmi_watchdog will trigger a dump in this case.

We've run Dell 670's for a while now and don't see any of these
problems, but I do not believe we run that raid card in them.
Comment 18 Sven Dietrich 2009-04-30 20:26:21 UTC
1. serial console / ssh console at prio 90
2. lower tasks to RT 49
3. elevate softirq/high to prio 81
4. tasks at RR policy
5. consider utilizing pthread_cond and pthread_mutex
Comment 21 Alex Tsariounov 2009-05-01 22:12:09 UTC
Created attachment 289532 [details]
Backtrace of all running tasks at time of dump
Comment 25 Sven Dietrich 2009-05-07 05:59:29 UTC
I have added a semaphore warning feature to the SLERT 10 SP2 Kernel-of-the-day, which can be downloaded from here:

http://ftp.suse.com/pub/projects/kernel/kotd/SLERT10_BRANCH/i386/kernel-rt.rpm

This Kernel is very close to to the latest released Kernel; there are a number features that are under qualification to be queued for the next update.

The particular feature asserts a warning to the console / syslog, if a RT-priority task locks a semaphore that does not support priority inheritance.

The warning will look something like: 

"RT task <string>:<PID> with priority <PRIO> using non PI futex"

The feature is enabled as follows:

(as root)

echo "1" > /proc/sys/kernel/futex_rt_pi_warning

(confirm)

cat /proc/sys/kernel/futex_rt_pi_warning 
returns: 1

The overhead is a single test and branch, and should not impact performance.

If you do encounter the warning, this may indicate a possible priority inversion situation.

I will attach an example program demonstrating in another post.
Comment 30 Sven Dietrich 2009-05-13 00:25:20 UTC
Kernel Update 5: v2.6.22.19-0.21 is available on the update channel.
Please update to the latest released Kernel as soon as possible.
Comment 31 Mike Latimer 2009-05-13 15:34:31 UTC
(In reply to comment #30)
> Kernel Update 5: v2.6.22.19-0.21 is available on the update channel.
> Please update to the latest released Kernel as soon as possible.

Should they be running this kernel, or the RC candidate for update 6?
Comment 32 Sven Dietrich 2009-05-13 17:14:08 UTC
They should be running the Update 5 Kernel for general testing when re-producing the core problem.
Comment 33 Sven Dietrich 2009-05-13 17:21:55 UTC
(In reply to comment #22)
> Created an attachment (id=289743) [details]
> Semaphore code from Horiba
> 
> This is the code the Sven requested during the call last week.

I have reviewed this code, and compiled it into a simple test application. 

At this point the only code that I see as possibly problematic (depending on implementation), is the sem_nw_call() function. 

Depending on implementation, this function could loop on acquiring a semaphore.

If a FIFO process were to use sem_nw_call at a priority > 49, and for example were waiting for a signal to terminate the semaphore-polling loop, the system could theoretically hang.
Comment 34 Alan Walker 2009-05-13 18:35:40 UTC
We have lowered the rt priority of all processes to 49 and below.  We made this change on the SLERT prec 670 system as well as several RHEL systems.  The SLERT system was destroyed in an experiment, but ran for 10 days without issue with the lower rt priority.  Prior to the change lockups were seen every 4 days on average.  While this isn't a certain indication of anything it does suggest the change may have helped.  We have had two RHEL systems with the rt priority lowered encounter lockups, however investigation was not thorough of these two incidents as appropriate resources were not available to investigate.

As we are currently rebuilding the SLERT prec 670 system we will implement the new kernel v2.6.22.19-0.21.
Comment 39 Sven Dietrich 2009-05-15 22:38:46 UTC
At the risk of creating confusion:

The release candidate for update 6 is here:

http://ftp.suse.com/pub/projects/kernel/kotd/SLERT10_SP2_BRANCH/

This has the above-mentioned additions to the code.

The difference between SLERT10_BRANCH, and SLERT10_SP2_BRANCH, is that the latter is an official release branch, while the former is a development branch.

It would therefore be better to direct the customer to the SLERT10_SP2_BRANCH, since the difference in the code base between the released Update 5, and the additions I have made since, are much smaller (and therefore easier to follow), when comparing the source between the released update 5 and the pre-release Update 6 tree on SLERT10_SP2_BRANCH.
Comment 40 Alan Walker 2009-06-15 14:32:38 UTC
Sven, that link is no longer available.  Can you please advise which kernel should be installed on the system?  As well, please advise any options that need to be set and what we should be looking for in which log files.
Comment 46 Alan Walker 2009-06-16 19:50:27 UTC
As per our meeting today, this system is not running the third party CANBUS kernel module nor does it have the pci card installed.  The system has locked up 3 times since being installed last week and on all three occasions the third party CANBUS related hardware and software were not installed.
Comment 47 Alan Walker 2009-06-18 13:44:58 UTC
*****
System non-responsive to keyboard and mouse resumed functioning after ps and top were executed on an rt priority serial console connection. 
*****

The test PC was moved to another cell to relieve the interruptions to the customer.  This new cell configuration only required one serial port which enabled our team to setup the rt priority serial port console for monitoring.  This system, prior and after this lockup, is experiencing a new issue with the mouse cursor intermittently freezing when moved across the screen.  This system is currently running and has not been restarted since the incident.

At once up and running, the following environment was configured:
1) once a minute the date command is output to write to /tmp/watchdog.log
2) tail -f /tmp/watchdog.log is in open terminal
3) serial console is up and running on a separate PC with bash set to RT Priority.
   top displayed: 4633 root     -91   0  4060 1884 1428 S  0.0  0.0   0:00.08 bash  

 
On June 18, the system was observed to be non responsive to keyboard and mouse interaction.
- ping works
- ssh was unable to connect
- tail stopped updating at 21:52 on June 17, the file it was monitoring was also later seen to have time stamps missing from 21:52 to 07:40 when the system recovered.
- Serial console is alive
- System recovered from locked state while top and ps commands were executed on the serial console.
-The date watchdog file continued collecting date after the lockup was overcome. However, there is a gap from 21:52:06 to 07:40:55
 
From the serial console top was executed and is showed: 
-Process read_pt was taking 96.5% CPU and was in "D" status. 
  (D means "uninterruptible sleep".)
- read_pt was issued from editProfile.pl.
- This read_pt is reading our control software's _annunc_al_warn data point.
- "zombie sh <defunc>" PID 8444, was created by PID 8443 snpst_sched
Comment 49 Sven Dietrich 2009-06-18 15:50:44 UTC
I am pulling the cores now.
Comment 50 Alan Walker 2009-06-18 17:27:07 UTC
Please be aware those core files are not related to the incident described in comment 47.
Comment 51 Alex Tsariounov 2009-06-18 17:32:34 UTC
Which exact kernel were these cores generated on?  Thanks.
Comment 52 Alex Tsariounov 2009-06-18 17:53:08 UTC
Ok, I'm looking at the June17 dump now.  It looks like it was generated on our update4 kernel (vmlinux-2.6.22.19-0.19-rt_bigsmp).  At the time of the dump, the load average was [65.50, 65.19, 64.67], so this machine was extremely overloaded.  Since the 15 min load average was also at 64.67, it was overloaded for a long time.  Since this is a single CPU system, it is not surprising that the machine appeared to be hanging.  I will look into the dump some more, but there is probably a runaway user task that is forking too many processes.
Comment 53 Alex Tsariounov 2009-06-18 18:43:39 UTC
Looking at the June16 dump, the load was [83.14, 81.66, 69.17], so again, the system was way overloaded.  I don't see a great many tasks on either core, so I'm not sure that there this is a runaway situation.  However, the system will not be responsive with these load numbers, especially if some of the tasks are running at RT priority.

The June15 core seems to be for a different kernel.  Which kernel were you running on June 15?  Thanks.
Comment 54 Alan Walker 2009-06-18 18:56:21 UTC
That is the rt_bigsmp kernel from a PTF per bug 441214

https://bugzilla.novell.com/show_bug.cgi?id=441214
Comment 55 Sven Dietrich 2009-06-18 19:05:42 UTC
(In reply to comment #54)
> That is the rt_bigsmp kernel from a PTF per bug 441214
> 
> https://bugzilla.novell.com/show_bug.cgi?id=441214

This particular Kernel has been superseded by Update 4, and should be discarded.
Comment 56 Sven Dietrich 2009-06-18 19:18:19 UTC
Created attachment 299097 [details]
Output of ps -t on June 17 core

PID: 10764  TASK: ef54a780  CPU: 0   COMMAND: "read_pt"
    RUN TIME: 10:58:23
  START TIME: 25641
   USER TIME: 2
 SYSTEM TIME: 37920530

This is by far the largest amount of system time used by any process.

Also, I see heavy activity on IRQ 16.

Can you please provide the output of cat /proc/interrupts?
Comment 57 Sven Dietrich 2009-06-18 19:21:09 UTC
crash> bt 10764
PID: 10764  TASK: ef54a780  CPU: 0   COMMAND: "read_pt"
 #0 [ce51bd58] __schedule at c031820e
 #1 [ce51bdb4] preempt_schedule_irq at c03189b5
 #2 [ce51bdc4] need_resched at c01041c3
 #3 [ce51be00] __set_page_address at c01668d0
 #4 [ce51be1c] kmap_high at c0166f0a
 #5 [ce51beb8] __handle_mm_fault at c016a457
 #6 [ce51bf54] do_page_fault at c031c079
 #7 [ce51bfb4] error_code (via page_fault) at c031a330

Apparently, this task is rescheduling pending allocation of new memory at the time of the crash dump.
Comment 58 Sven Dietrich 2009-06-18 19:34:21 UTC
What is the priority and scheduling policy of  "read_pt"?
Comment 59 Alan Walker 2009-06-18 19:35:37 UTC
It should be noted that since we moved the PC, we removed a digi pci serial
card.  As well, please note we are having mouse and keyboard delays on this
system at all times while our Software is running.  


# cat /proc/interrupts
           CPU0
  0:        959   IO-APIC-edge      timer
  1:          2   IO-APIC-edge      i8042
  3:          2   IO-APIC-edge
  4:       8694   IO-APIC-edge      serial
  6:          5   IO-APIC-edge      floppy
  7:          0   IO-APIC-edge      parport0
  8:         27   IO-APIC-edge      rtc
  9:          0   IO-APIC-fasteoi   acpi
 12:          4   IO-APIC-edge      i8042
 15:     478580   IO-APIC-edge      ide1
 16:    6315250   IO-APIC-fasteoi   ohci1394, uhci_hcd:usb1, uhci_hcd:usb5,
eth1, nvidia
 17:     270869   IO-APIC-fasteoi   libata, aacraid, uhci_hcd:usb4
 18:   35289357   IO-APIC-fasteoi   eth0
 19:          0   IO-APIC-fasteoi   uhci_hcd:usb2
 20:          2   IO-APIC-fasteoi   ehci_hcd:usb3
NMI:          0   Non-maskable interrupts
LOC:   49444015   Local interrupts
RES:          0   Rescheduling interrupts
CAL:          0   function call interrupts
TLB:          0   TLB shootdowns
TRM:          0   Thermal event interrupts
SPU:          0   Spurious interrupts
ERR:          0
Comment 60 Alan Walker 2009-06-18 19:44:39 UTC
ps -eo ppid,pid,user,args,ni,rtprio --sort rtprio|grep -i read_pt
20795 24407 root     grep -i read_pt               0      -
 8369 24404 adacs    read_pt -t e ENGINE_ON        -      1

# ps 8369
  PID TTY      STAT   TIME COMMAND
 8369 ?        SNL    3:15 /usr/bin/perl /opt/adacs/local/mc_uploader.pl -ia_info_msqid 0 -ia_stand e


Can you tell me the status of read_pt in the cores we sent you?  As well it would be helpful if you included the full command line parameters of read_pt.
Comment 61 Sven Dietrich 2009-06-18 19:54:39 UTC
(In reply to comment #59)
> It should be noted that since we moved the PC, we removed a digi pci serial
> card.  As well, please note we are having mouse and keyboard delays on this
> system at all times while our Software is running.  
> 
> 

>  16:    6315250   IO-APIC-fasteoi   ohci1394, uhci_hcd:usb1, uhci_hcd:usb5,
> eth1, nvidia

This nvidia driver, is this also a proprietary nvidia driver?
Comment 62 Sven Dietrich 2009-06-18 19:58:50 UTC
 19:          0   IO-APIC-fasteoi   uhci_hcd:usb2
 20:          2   IO-APIC-fasteoi   ehci_hcd:usb3

These usb ports are unused. It should be possible to connect mouse and keyboard here (might need to experiment which USB port corresponds to these IRQs).
When you have the right one, keystrokes and mouse moves will slowly increment the IRQ counts. When you have found that, lets try elevation the prio of the corresponding IRQ-19 and IRQ-20 threads.
That could help address the mouse issues, and would isolate nvidia and eth1 on IRQ-16.
Are there other USB devices on that IRQ-16 tree?
Is eth1 configured? Is eth1 heavily used?
Comment 63 Sven Dietrich 2009-06-18 20:07:46 UTC
(In reply to comment #60)
> ps -eo ppid,pid,user,args,ni,rtprio --sort rtprio|grep -i read_pt
> 20795 24407 root     grep -i read_pt               0      -
>  8369 24404 adacs    read_pt -t e ENGINE_ON        -      1
> 
> # ps 8369
>   PID TTY      STAT   TIME COMMAND
>  8369 ?        SNL    3:15 /usr/bin/perl /opt/adacs/local/mc_uploader.pl
> -ia_info_msqid 0 -ia_stand e
> 
> 
> Can you tell me the status of read_pt in the cores we sent you?  As well it
> would be helpful if you included the full command line parameters of read_pt.

10764  13947   0  ef54a780  RU   0.0    7096   1156  read_pt

The process is running, and appears to be scheduled away, pending memory allocation. Per back-trace, its healthy, and preempted on the run-queue.
> 
> 
> Can you tell me the status of read_pt in the cores we sent you?  As well it
> would be helpful if you included the full command line parameters of read_pt.

I am unable to extract the full command line from the core at this moment.

Do you get anything from something like:

ps -lf `pidof read_pt`

This would give better insight into the effective prio.
Comment 64 Sven Dietrich 2009-06-18 20:08:48 UTC
(In reply to comment #62)

> 
> These usb ports are unused. It should be possible to connect mouse and keyboard
> here (might need to experiment which USB port corresponds to these IRQs).

BTW - this would be just a workaround until we get to the bottom of all this.
Comment 65 Alan Walker 2009-06-18 20:28:42 UTC
We know in the case of our lockup this morning, the one which we were able to recover without taking a sysrq core dump, that read_pt is spawned from a script called mc_uploader.pl, not our original suspicion of editprofile.pl.  mc_uploader on this cell executes read_pt 54 times every 10 minutes.  read_pt is a binary that utilizes our dblib library.  In this case it is finding the datatype of the 54 data points.  mc_uploader.pl also reads the data point values through a perl module known as accessdb every 10 seconds.  Both accessdb and read_pt use the same underlying dblib library for interacting with the real time database.  It would be interesting to know if in all of the cores we have submitted if read_pt had been running for an extended period of time.  Each instance of read_pt should take less than 1 second to execute.  Many of our scripts use accessdb and read_pt for interacting with the real time database.  It is therefore possible that they will be running a different priorities depending on which parent process spawned the inquiry.  In this sense it would be extremely helpful if your team could give us the parent process of read_pt from these cores.  Is it correct, that you have found in the Jun 17 core file that read_pt had been running for 10 hours? 

For example, one instance of read_pt, which I know is not related to mc_uploader.pl:

# ps -lf `pidof read_pt`|grep -v bash
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY        TIME CMD
0 R adacs    21056  8650  0  99  19 -   979 -      16:24 ?          0:00 read_pt e SI_LastFrPoint.b

ps 8650
  PID TTY      STAT   TIME COMMAND
 8650 ?        SN     0:15 /usr/bin/perl /opt/adacs/local/readLastFrictionPoint.pl
Comment 66 Alan Walker 2009-06-18 20:36:20 UTC
We just experienced a lockup on the cell running SLERT Update 4.  The on site staff were able to run top from the serial console and observe that read_pt was consuming 96.4 and 96.1 percent of the cpu.  When the technician hit C to show the full command parameters the system locked up completed.  He then executed a sysrq core dump which I will upload and provide the file name in a future update.  Which commands in order of priority should we be executing from the serial console when we encounter these lockups?
Comment 67 Sven Dietrich 2009-06-18 20:39:40 UTC
(In reply to comment #65)

> In this sense it would
> be extremely helpful if your team could give us the parent process of read_pt
> from these cores. 

Core June 17:

crash> ps -p 10764
PID: 0      TASK: c03ad3a0  CPU: 0   COMMAND: "swapper"
 PID: 1      TASK: dfce3770  CPU: 0   COMMAND: "init"
  PID: 13770  TASK: c979d9f0  CPU: 0   COMMAND: "database"
   PID: 13772  TASK: ed966740  CPU: 0   COMMAND: "prog_launcher"
    PID: 13947  TASK: ec979320  CPU: 0   COMMAND: "perl"
     PID: 10764  TASK: ef54a780  CPU: 0   COMMAND: "read_pt"
Comment 68 Alan Walker 2009-06-18 20:42:17 UTC
Is there any way to find out what perl script is being run by perl?  You said earlier you couldn't get the full command parameters, is this something that can be overcome?
Comment 69 Sven Dietrich 2009-06-18 20:47:59 UTC
Created attachment 299117 [details]
ps -t for June 16 core

For the June 16 core, it seems to be the complimentary write task, generating the load spike:
 
crash> ps -t 25725
PID: 25725  TASK: eaa35970  CPU: 0   COMMAND: "write_pt"
    RUN TIME: 00:31:26
  START TIME: 16771
   USER TIME: 6
 SYSTEM TIME: 1761111

crash> ps -p 25725
PID: 0      TASK: c03ad3a0  CPU: 0   COMMAND: "swapper"
 PID: 1      TASK: dfce3770  CPU: 0   COMMAND: "init"
  PID: 7155   TASK: dff1c1f0  CPU: 0   COMMAND: "database"
   PID: 7157   TASK: edde8700  CPU: 0   COMMAND: "prog_launcher"
    PID: 7314   TASK: c96c51e0  CPU: 0   COMMAND: "perl"
     PID: 25725  TASK: eaa35970  CPU: 0   COMMAND: "write_pt"
Comment 70 Sven Dietrich 2009-06-18 20:58:01 UTC
(In reply to comment #68)
> Is there any way to find out what perl script is being run by perl?  You said
> earlier you couldn't get the full command parameters, is this something that
> can be overcome?

crash> ps -a 7314
PID: 7314   TASK: c96c51e0  CPU: 0   COMMAND: "perl"
ps: cannot access user stack address: bfb3421f

Stumped on this one more the moment.
Comment 71 Sven Dietrich 2009-06-18 20:59:08 UTC
(In reply to comment #66)
> We just experienced a lockup on the cell running SLERT Update 4.  The on site
> staff were able to run top from the serial console and observe that read_pt was
> consuming 96.4 and 96.1 percent of the cpu.  When the technician hit C to show
> the full command parameters the system locked up completed.  He then executed a
> sysrq core dump which I will upload and provide the file name in a future
> update.  Which commands in order of priority should we be executing from the
> serial console when we encounter these lockups?

You could try to attach gdb to the process to see what its doing.
Comment 72 Sven Dietrich 2009-06-18 21:04:16 UTC
(In reply to comment #65)
> mc_uploader on this cell executes read_pt 54 times every 10 minutes.  read_pt
> is a binary that utilizes our dblib library.  In this case it is finding the
> datatype of the 54 data points.  mc_uploader.pl also reads the data point
> values through a perl module known as accessdb every 10 seconds.  Both accessdb
> and read_pt use the same underlying dblib library for interacting with the real
> time database. 

Wild guess...

Is there a loop, that repeat-tries in case a database record is locked?

Something like that ould just create enough of a DoS situation, blocking the task trying to release the lock on the record.
Comment 73 Sven Dietrich 2009-06-18 21:07:45 UTC
(In reply to comment #71)
> (In reply to comment #66)

> 
> You could try to attach gdb to the process to see what its doing.

If you can inject the read_pt process via strace, and append the output of strace to a logfile the I/O might rate-limit things enough to keep control of the system. Should be easy enough to trap the situation, given that the disk containing said logfile should become very busy.
Comment 74 Alex Tsariounov 2009-06-18 21:11:12 UTC
(In reply to comment #70)
> (In reply to comment #68)
> > Is there any way to find out what perl script is being run by perl?  You said
> > earlier you couldn't get the full command parameters, is this something that
> > can be overcome?
> 
> crash> ps -a 7314
> PID: 7314   TASK: c96c51e0  CPU: 0   COMMAND: "perl"
> ps: cannot access user stack address: bfb3421f
> 
> Stumped on this one more the moment.

crash> task 13947 | grep comm
  comm = "perl\000loader.pl\000", 

So it looks like the perl script run b7y process 13947 was loader.pl
Comment 75 Alex Tsariounov 2009-06-18 21:17:42 UTC
For the June 17 dump, the following is all existing perl processes with command lines and run times:

crash> ps | grep perl
  13894  13772   0  c979c2b0  IN   0.1    5340   2760  perl
  13929  13772   0  c979c880  IN   0.0    4644   1744  perl
  13947  13772   0  ec979320  IN   1.2   48584  48584  perl
  14270      1   0  edb46d90  RU   1.1   44412  44412  perl
  14271      1   0  ed3f2c50  IN   0.0    4644   1800  perl
  14273      1   0  ec9798f0  IN   0.0    4644   1728  perl

crash> task 13894 | grep comm
  comm = "perl\000rSave.pl\000\000", 
crash> task 13929 | grep comm
  comm = "perl\000t.pl\000her\000\000", 
crash> task 13947 | grep comm
  comm = "perl\000loader.pl\000", 
crash> task 14270 | grep comm
  comm = "perl\000rtSparc_ms", 
crash> task 14271 | grep comm
  comm = "perl\000romSparc.p", 
crash> task 14273 | grep comm
  comm = "perl\000astFrictio", 

crash> ps -t 13894 13929 13947 14270 14271 14273
PID: 13894  TASK: c979c2b0  CPU: 0   COMMAND: "perl"
    RUN TIME: 18:01:37
  START TIME: 247
   USER TIME: 26
 SYSTEM TIME: 6

PID: 13929  TASK: c979c880  CPU: 0   COMMAND: "perl"
    RUN TIME: 18:01:36
  START TIME: 248
   USER TIME: 1618
 SYSTEM TIME: 6172

PID: 13947  TASK: ec979320  CPU: 0   COMMAND: "perl"
    RUN TIME: 18:01:36
  START TIME: 248
   USER TIME: 36333
 SYSTEM TIME: 52910

PID: 14270  TASK: edb46d90  CPU: 0   COMMAND: "perl"
    RUN TIME: 18:01:15
  START TIME: 269
   USER TIME: 202
 SYSTEM TIME: 418

PID: 14271  TASK: ed3f2c50  CPU: 0   COMMAND: "perl"
    RUN TIME: 18:01:15
  START TIME: 269
   USER TIME: 4
 SYSTEM TIME: 9

PID: 14273  TASK: ec9798f0  CPU: 0   COMMAND: "perl"
    RUN TIME: 18:01:15
  START TIME: 269
   USER TIME: 5
 SYSTEM TIME: 4
Comment 76 Sven Dietrich 2009-06-18 21:22:05 UTC
(In reply to comment #75)
> For the June 17 dump, the following is all existing perl processes with command
> lines and run times:

And this particular set would be the one in the call tree for the runaway read_pt

> 
> crash> ps | grep perl
>   13947  13772   0  ec979320  IN   1.2   48584  48584  perl
> 
> crash> task 13947 | grep comm
>   comm = "perl\000loader.pl\000", 
> 
> 
> PID: 13947  TASK: ec979320  CPU: 0   COMMAND: "perl"
>     RUN TIME: 18:01:36
>   START TIME: 248
>    USER TIME: 36333
>  SYSTEM TIME: 52910
Comment 77 Sven Dietrich 2009-06-18 21:23:32 UTC
(In reply to comment #76)
> (In reply to comment #75)
> > For the June 17 dump, the following is all existing perl processes with command
> > lines and run times:
> 
> > 
> > PID: 13947  TASK: ec979320  CPU: 0   COMMAND: "perl"
> >     RUN TIME: 18:01:36
> >   START TIME: 248
> >    USER TIME: 36333
> >  SYSTEM TIME: 52910

Which notably also has the highest system time of all the perl processes...
Comment 78 Sven Dietrich 2009-06-18 21:27:29 UTC
BTW:

# crash /boot/vmlinux-2.6.22.19-0.19-rt_bigsmp  vmcore

Prior to that, gzip -d  /boot/vmlinux-2.6.22.19-0.19-rt_bigsmp

and install the debuginfo, which I am uploading now...
Comment 79 Sven Dietrich 2009-06-18 21:37:26 UTC
(In reply to comment #78)

> and install the debuginfo, which I am uploading now...

You can grab the debug info for the Update 4 Kernel here, if you want to be able to look at the core images on-site; this may accelerate the process by eliminating some uploads.

ftp://ftp.novell.com/dev/lsg/Horiba/kernel-rt_bigsmp-debuginfo-2.6.22.19-0.19.i586.rpm
Comment 80 Alan Walker 2009-06-19 13:04:33 UTC
To correct comment 66, it was write_pt that was taking over the CPU not read_pt:

PID     USER PR  NI VIRT   RES SHR S %CPU %MEM TIME+   COMMAND 
21696 user   -2 -20 40540 2228 1328 R 96.1    0.1        3:02.32 write_pt 
8159   user  021 19 59976 58M  41M S 2.0     1.5       40:38.36 database 

Both read_pt and write_pt use the same library and function in much the same way.

Unfortunately, we were unable to take a sysrq + c core dump as the the progression of the lockup caused by pressing 'c' in top on the rt serial console put the system in such a state that the sysrq keys would not respond.  

Can the Novell team advise of a script or sequence of commands we should run from the rt serial console to collect the most meaningful information?
Comment 81 Alan Walker 2009-06-22 18:12:24 UTC
Based on the current data, I wrote a series of tests to stress the system with read_pt.  Under normal operating conditions read_pt takes less than 1 second to execute and consumes less than 1% of the CPU.  

The tests:

perl script that loops continuously executing:
read_pt -t e POINTNAME
->Approximately 5 minutes of run time to lock the system

perl script that loops 60 times, pauses 10 seconds using bash sleep, then repeats.
->This test took approximately an hour to lock the system

We have noticed that read_pt and write_pt execute at different RTPRIO levels depending on the executing script.

Any advice from Novell on how to investigate this issue further would be appreciated.
Comment 82 Sven Dietrich 2009-06-23 15:30:53 UTC
This sounds like a good approach. I will give it some more thought after wrapping up the ESD issue.
Comment 83 Sven Dietrich 2009-06-23 19:46:14 UTC
(In reply to comment #81)
> Based on the current data, I wrote a series of tests to stress the system with
> read_pt.  Under normal operating conditions read_pt takes less than 1 second to
> execute and consumes less than 1% of the CPU.  
> 
> The tests:
> 
> perl script that loops continuously executing:
> read_pt -t e POINTNAME
> ->Approximately 5 minutes of run time to lock the system
> 

I would focus on this test, which takes less time to failure. 
- Is it possible to further break-down the test, e.g. eliminating perl, and calling the lower-level routines directly?
- highly recommend using strace at all process-dispatch points, to attempt to obtain a log-file of the system call entry and exit that are effectively using so much CPU.
- attaching GDB to the lowest-level process, and attempting to trap via breakpoint, whatever loop is taking place.

Ultimately, this can IMO only be debugged easily via by eliminating the peripheries around the core implementation as possible, and identifying the core failure in a divide-and-conquer fashion.

Without more detailed insight into the code, however, its only guesswork on this end.
Comment 84 Alan Walker 2009-06-24 16:18:15 UTC
We have been able to recover from the lockup approximately 5 times in the last week by interacting with the RT Serial connection.  We had 3 other lockups where we had not set the bash session to RTPRIO 90 and we were not able to recover, the serial connection was non responsive.  We are using the ps command to investigate the state of the system during a lockup through the RT serial connection.  As we execute ps we find the read_pt CPU usage drops to 0 and completes execution.  This drop happens so quickly we are only able to observe it if we run two instances of ps in the same command line.  The system seems to remain in the locked state for many hours until the ps command is executed on the RT Serial connection.
Comment 85 Alan Walker 2009-06-29 18:13:06 UTC
We are currently streamlining a version of the database and the read/write_pt applications to run in a simple isolated way.  We are able to reproduce the issue running just read_pt and the database processes with no extra configuration or processes.  We are working to prepare a package of source code and binaries to enable the Novell team to reproduce the issue.
Comment 86 Roy Kinsella 2009-07-13 17:34:59 UTC
We have now been successful at recreating the lockup situation with a minimum subset of the ADACS application.  I will attach a tar file containing some binaries and script files for reproduction.

Before starting, ensure that there is a serial console with shell at RT priority.  tar xzf lockup_bin.tz, cd ./bin,  and execute stress.sh .  Some priorities are higher than normal so likely X will be unresponsive.  The lockup will occur between 1 minute and 6 hours afterwards.  Lockup is evidenced by %sys time at 99+ and usually one process at 99+ %CPU.  System sometimes responds to ping for a while, but will not respond to ssh or other network access.

The lockup seems to be caused by a tight loop in kernel routine mm/highmem.c in the kmap_high() procedure. Plese view the attachment lockup_trace.txt.gz .
The 2.6.22.10.PTF441214 kernel recompiled with latency tracing.  From the serial console login session chrt -p 99 $$, running trace-it from ingo molnar latency trace utilities, several thousand lines of tracing was captured.  Any command requiring file system access will lock even that session.

Recently a repeat test showd interesting result was displayed on the 'top' running on the serial console.  It indicated four of the shell scripts all running at 99.9% and the system time at 99.6%.  This is on the single core xeon Dell 670 described above.

For interest sake we were unable to create the lockup using the much later kernel 2.6.26.8-rt_bigsmp.
Comment 87 Roy Kinsella 2009-07-13 17:37:29 UTC
Created attachment 304775 [details]
tar file containing test code
Comment 88 Roy Kinsella 2009-07-13 17:38:47 UTC
Created attachment 304776 [details]
gzipped text file of latency trace, captured from serial terminal
Comment 89 Roy Kinsella 2009-07-13 17:40:10 UTC
Created attachment 304777 [details]
dump of 'top' output on serial terminal (includes cursor codes)
Comment 90 Sven Dietrich 2009-07-15 00:09:27 UTC
Thanks. I will try to reproduce this asap.
Comment 91 Sven Dietrich 2009-07-16 00:19:14 UTC
(In reply to comment #88)
> Created an attachment (id=304776) [details]
> gzipped text file of latency trace, captured from serial terminal

This looks as though it will be helpful. 

It would be excellent if the trace could be generated using the most recent version of the rt_trace Kernel from the distribution.

Having the symbols to match the trace would make this significantly easier yet.
Comment 92 Sven Dietrich 2009-07-16 00:23:26 UTC
> 
> Before starting, ensure that there is a serial console with shell at RT
> priority.  tar xzf lockup_bin.tz, cd ./bin,  and execute stress.sh .  

First try, I seem to be missing libequn.so. I haven't been able to locate that.
Comment 93 Roy Kinsella 2009-07-16 18:58:19 UTC
Sorry about that, I didn't notice that all our test machines had the usual deployment.

There are actually a few things missing (a) some libraries, (b) the user has to have or be in a group with priviledge for setting realtime priority and (c) there needs to be created a writeable path /usr/local/etc (non standard by a long long way, I know)

So, I will attach a new tar that contains the libraries needed in a vanilla install.  I did not make an installation script so please put them somewhere from where they can be loaded as needed.  I just made a ./lib path and put that into /etc/ld.so.conf

On the testing front, we tried making one kernel change, removed support for highmem and despite only seeing 1G physical memory, the system ran under very heavy load for roughly 27 hours.  I am going to try again with highmem set to 4G.  There are 2G physical on the machine, the SLERT kernel we have been deploying was originally deployed on a Dell T7400, 8 core Xeon, 16G RAM and had the highmem configured to 64G.  Maybe there needed to some tuning changes for the Dell 670, single core, 2G RAM?
Comment 94 Roy Kinsella 2009-07-16 19:04:06 UTC
Created attachment 305659 [details]
binaries and libraries, slightly simplified stress.sh script
Comment 95 Sven Dietrich 2009-07-16 23:54:48 UTC
 
> On the testing front, we tried making one kernel change, removed support for
> highmem and despite only seeing 1G physical memory, the system ran under very
> heavy load for roughly 27 hours.  I am going to try again with highmem set to
> 4G.  There are 2G physical on the machine, the SLERT kernel we have been
> deploying was originally deployed on a Dell T7400, 8 core Xeon, 16G RAM and had
> the highmem configured to 64G.  Maybe there needed to some tuning changes for
> the Dell 670, single core, 2G RAM?

I studied the trace in attachment - it appears that heavy contention on the highmem allocation subsystem combined with preemption to produce some kind of ping-pong effect, where concurrent memory allocations fail.

This is merely speculation, until I can actually map the symbols.

It would be very helpful, if you were able to produce a similar trace from the stock SLERT rt_trace Kernel.

Meanwhile, I will pull and attempt to reproduce locally.
Comment 96 Sven Dietrich 2009-07-17 00:31:35 UTC
These messages look like there is something else awry.

There is disk space available. Anything else obvious that I am missing?

Thanks

Sven

read_pt: fatal messaging error (28: No space left on device) in msg_call (local queue creation)
message text:
database
read_pt: signal 6 received, cleaning up and exiting
message
read_pt: signal 6 received, cleaning up and exiting
ia_my_pid: 31594, ia_my_msqid: -1, ia_info_msqid: 0
read_pt: signal 6 received, cleaning up and exiting
./rr.sh: line 13: 31576 Aborted                 read_pt -ia_info_msqid $qid -t e _DB_SEC >/dev/null
mtype: 0, msg_id: 0, req_msg_id: 0, sender_msqid: 0
msg_called: 0, want_reply: 0, sender_pid: 0, sender_errno: 0
message
./rr.sh: line 13: 31581 Aborted                 read_pt -ia_info_msqid $qid -t e _DB_SEC >/dev/null
message text:
database
mtype: 0, msg_id: 0, req_msg_id: 0, sender_msqid: 0
msg_called: 0, want_reply: 0, sender_pid: 0, sender_errno: 0
read_pt: signal 6 received, cleaning up and exiting
Comment 97 Sven Dietrich 2009-07-17 00:32:39 UTC
Nevertheless, I may have just reproduced the issue.
Comment 99 Sven Dietrich 2009-07-26 17:40:06 UTC
I think there may be separate issues here. 

I had a bit of a delay resolving some kdump issues on this end for the 32-bit machine I am reproducing on.

Would you try to issue:

echo 0 > /proc/sys/kernel/rtmutex_lateral_steal

before starting the app.

I have seen significant improvement with running stress.sh after this setting, when not changing priority.

I still see lockups when I start stress.sh with chrt -f -p 49, and am focusing on that now.
Comment 100 Sven Dietrich 2009-07-27 16:38:26 UTC
I have a build that has survived a 9:34 hr overnight run executed with chrt -r -p 49.

I am packaging a derivative of this work as RPM, and will test that a bit more, but hope to have that available later today.
Comment 112 Sven Dietrich 2010-03-09 20:13:46 UTC
Please re-test using the SLERT 10 SP3 RT Kernel located here:

ftp://ftp.novell.com/dev/customers/ford.com/498966/kernel-rt-2.6.22.19-0.39.1.x86_64.rpm


Installation instructions:

1. reboot into SLES Kernel.
2. uninstall the SLERT 10 32 bit (i386) Kernel.
3. install the SLERT 10 SP3 (x86_64) Kernel:
        rpm -i --force --ignorearch 
4. reboot into the new Kernel and test.