Bug 241787 - Kernel hangs when using a cifs on a multiprocessor opteron server using opensuse 10.2 32bit
Summary: Kernel hangs when using a cifs on a multiprocessor opteron server using opens...
Status: RESOLVED INVALID
Alias: None
Product: openSUSE 10.2
Classification: openSUSE
Component: Kernel (show other bugs)
Version: Final
Hardware: x86-64 SUSE Other
: P5 - None : Major with 16 votes (vote)
Target Milestone: ---
Assignee: Lars Müller
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-02-02 20:12 UTC by Rosario Lombardo
Modified: 2008-06-18 20:08 UTC (History)
6 users (show)

See Also:
Found By: Other
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
Support file created by Yast (523.35 KB, text/plain)
2007-02-02 20:30 UTC, Rosario Lombardo
Details
Script used on the server (1.36 KB, application/x-shellscript)
2007-03-01 20:10 UTC, Rosario Lombardo
Details
what was left in /var/log/messages after pam_mount tried to mount cifs volumes (4.31 KB, text/plain)
2007-04-12 20:05 UTC, Ivo Blöchliger
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Rosario Lombardo 2007-02-02 20:12:20 UTC
When working on cifs filesystem on a 4 processor opteron machine with 8gb ram, opensuse 10.2 32bit, kernel bigsmp, the kernel hangs with the following:

Jan 31 22:03:50 fileserver sshd[5024]: Accepted keyboard-interactive/pam for rosario from 10.1.10.40 port 57808 ssh2
Jan 31 22:03:59 fileserver su: (to root) rosario on /dev/pts/0
Jan 31 22:04:16 fileserver syslog-ng[3183]: STATS: dropped 0
Jan 31 22:09:27 fileserver su: (to postgres) rosario on /dev/pts/0
Jan 31 22:11:56 fileserver kernel: BUG: unable to handle kernel paging request at virtual address 80b8e484
Jan 31 22:11:56 fileserver kernel:  printing eip:
Jan 31 22:11:56 fileserver kernel: c0161308
Jan 31 22:11:56 fileserver kernel: *pde = 00000000
Jan 31 22:11:56 fileserver kernel: Oops: 0002 [#1]
Jan 31 22:11:56 fileserver kernel: SMP 
Jan 31 22:11:56 fileserver kernel: last sysfs file: /firmware/edd/int13_dev81/extensions
Jan 31 22:11:56 fileserver kernel: Modules linked in: nls_utf8 cifs button battery ac apparmor aamatch_pcre loop dm_mod tg3 r8169 i2c_amd8111 i2c_amd756 ide_cd ohci_hcd cdrom usbcore i2c_core amd_rng parport_pc lp parport ext3 mbcache jbd edd fan sg aacraid amd74xx thermal processor sd_mod scsi_mod ide_disk ide_core
Jan 31 22:11:56 fileserver kernel: CPU:    3
Jan 31 22:11:56 fileserver kernel: EIP:    0060:[<c0161308>]    Tainted: G     U VLI
Jan 31 22:11:56 fileserver kernel: EFLAGS: 00010082   (2.6.18.2-34-bigsmp #1) 
Jan 31 22:11:56 fileserver kernel: EIP is at free_block+0x5c/0xed
Jan 31 22:11:56 fileserver kernel: eax: f25fa9e0   ebx: dfffdec0   ecx: f2a50740   edx: 80b8e480
Jan 31 22:11:56 fileserver kernel: esi: f2703000   edi: dfff9a80   ebp: dfc8abe0   esp: dff09ef4
Jan 31 22:11:56 fileserver kernel: ds: 007b   es: 007b   ss: 0068
Jan 31 22:11:56 fileserver kernel: Process events/3 (pid: 13, ti=dff08000 task=dff076f0 task.ti=dff08000)
Jan 31 22:11:56 fileserver kernel: Stack: dffcd614 00000005 00000003 dfc8abd4 00000005 dfc8abc0 dfffdec0 c0161411 
Jan 31 22:11:56 fileserver kernel:        00000000 dfff9a80 dfffdec0 dfff9a80 dfc8a7c0 00000286 c016286c 00000000 
Jan 31 22:11:56 fileserver kernel:        00000000 c602bc00 c602bc04 c012f679 ffffffff ffffffff ffffffff c0162819 
Jan 31 22:11:56 fileserver kernel: Call Trace:
Jan 31 22:11:56 fileserver kernel:  [<c0161411>] drain_array+0x78/0x97
Jan 31 22:11:56 fileserver kernel:  [<c016286c>] cache_reap+0x53/0x117
Jan 31 22:11:56 fileserver kernel:  [<c012f679>] run_workqueue+0x83/0xc5
Jan 31 22:11:56 fileserver kernel:  [<c0162819>] cache_reap+0x0/0x117
Jan 31 22:11:56 fileserver kernel:  [<c012ff94>] worker_thread+0xd9/0x10d
Jan 31 22:11:56 fileserver kernel:  [<c011b15f>] default_wake_function+0x0/0xc
Jan 31 22:11:56 fileserver kernel:  [<c01324d4>] kthread+0xec/0x11c
Jan 31 22:11:56 fileserver kernel:  [<c012febb>] worker_thread+0x0/0x10d
Jan 31 22:11:56 fileserver kernel:  [<c01323e8>] kthread+0x0/0x11c
Jan 31 22:11:56 fileserver kernel:  [<c0102005>] kernel_thread_helper+0x5/0xb
Jan 31 22:11:56 fileserver kernel: Code: 8b 02 f6 c4 40 74 03 8b 52 0c 8b 02 84 c0 78 08 0f 0b 60 02 86 81 2c c0 8b 4a 1c 8b 44 24 20 8b 11 8b 9c 87 10 02 00 00 8b 41 04 <89> 42 04 89 10 31 d2 2b 71 0c c7 01 00 01 10 00 c7 41 04 00 02 
Jan 31 22:11:56 fileserver kernel: EIP: [<c0161308>] free_block+0x5c/0xed SS:ESP 0068:dff09ef4
Comment 1 Rosario Lombardo 2007-02-02 20:30:15 UTC
Created attachment 117182 [details]
Support file created by Yast

Support file created by Yast on the machine affected by the problem
Comment 3 Rosario Lombardo 2007-02-03 16:59:09 UTC
Processors are amd opteron 64-bit, while the openSUSE 10.2 used is the 32bit version. I'll try in the next days if opensuse 10.2 64bit has the same problem and i'll tell to you.
Comment 4 Rosario Lombardo 2007-02-10 07:29:56 UTC
The same problem with openSUSE 10.2 64bit
Comment 5 Bodo Wippermann 2007-02-14 19:34:58 UTC
seems the same on 32bit Intel Single-Processor with openSUSE 10.2.
( but here only when running a VMWare machine while accessing the cifs share )

Feb 12 13:30:48 precision kernel: ------------[ cut here ]------------
Feb 12 13:30:48 precision kernel: kernel BUG at mm/slab.c:608!
Feb 12 13:30:48 precision kernel: invalid opcode: 0000 [#1]
Feb 12 13:30:48 precision kernel: SMP
Feb 12 13:30:48 precision kernel: last sysfs file: /devices/system/cpu/cpu0/cpufreq/scaling_governor
Feb 12 13:30:48 precision syslog-ng[2558]: Changing permissions on special file /dev/xconsole
Feb 12 13:30:48 precision syslog-ng[2558]: Changing permissions on special file /dev/tty10
Feb 12 13:30:48 precision kernel: Modules linked in: nls_utf8 cifs vmnet parport_pc parport vmmon snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device ipv6 af_packet cpufreq_conservative cpufreq_ondemand cpufreq_userspace cpufreq_powersave speedstep_centrino freq_table button battery ac apparmor aamatch_pcre loop dm_mod pcmcia usbhid yenta_socket nvidia rsrc_nonstatic tg3 pcmcia_core ipw2200 i2c_core uhci_hcd ehci_hcd ieee80211 ieee80211_crypt firmware_class snd_intel8x0 snd_ac97_codec snd_ac97_bus snd_pcm snd_timer snd soundcore snd_page_alloc usbcore intel_agp irda agpgart crc_ccitt reiserfs sr_mod cdrom edd fan sg ata_piix ahci libata thermal processor sd_mod scsi_mod
Feb 12 13:30:48 precision kernel: CPU:    0
Feb 12 13:30:48 precision kernel: EIP:    0060:[<c0161315>]    Tainted: P     U VLI
Feb 12 13:30:48 precision kernel: EFLAGS: 00010006   (2.6.18.2-34-default #1)
Feb 12 13:30:48 precision kernel: EIP is at free_block+0x41/0xed
Feb 12 13:30:48 precision kernel: eax: c001006c   ebx: f242bc94   ecx: 00000002   edx: c1800000
Feb 12 13:30:48 precision kernel: esi: 00000000   edi: f6ca9380   ebp: f242bc94   esp: c20f7f1c
Feb 12 13:30:48 precision kernel: ds: 007b   es: 007b   ss: 0068
Feb 12 13:30:48 precision kernel: Process events/0 (pid: 4, ti=c20f6000 task=dfc805a0 task.ti=c20f6000)
Feb 12 13:30:48 precision kernel: Stack: eda99394 00000002 00000000 f242bc94 00000002 f242bc80 f2927980 c0161439
Feb 12 13:30:48 precision kernel:        00000000 f6ca9380 f2927980 f6ca9380 dfca85c0 00000296 c0162861 00000000
Feb 12 13:30:48 precision kernel:        00000000 c200d980 c200d984 c012e639 00000000 c200c8a4 dfca85c0 c016280e
Feb 12 13:30:48 precision kernel: Call Trace:
Feb 12 13:30:48 precision kernel:  [<c0161439>] drain_array+0x78/0x97
Feb 12 13:30:48 precision kernel:  [<c0162861>] cache_reap+0x53/0x117
Feb 12 13:30:48 precision kernel:  [<c012e639>] run_workqueue+0x83/0xc5
Feb 12 13:30:48 precision kernel:  [<c016280e>] cache_reap+0x0/0x117
Feb 12 13:30:48 precision kernel:  [<c012ef54>] worker_thread+0xd9/0x10d
Feb 12 13:30:48 precision kernel:  [<c011a7e2>] default_wake_function+0x0/0xc
Feb 12 13:30:48 precision kernel:  [<c012ee7b>] worker_thread+0x0/0x10d
Feb 12 13:30:48 precision kernel:  [<c0131420>] kthread+0xc0/0xec
Feb 12 13:30:48 precision kernel:  [<c0131360>] kthread+0x0/0xec
Feb 12 13:30:48 precision kernel:  [<c0102005>] kernel_thread_helper+0x5/0xb
Feb 12 13:30:48 precision kernel: Code: 00 e9 bb 00 00 00 8b 75 00 8d 96 00 00 00 40 c1 ea 0c c1 e2 05 03 15 10 36 41 c0 8b 02 f6 c4 40 74 03 8b 52 0c 8b 02 84 c0 78 08 <0f> 0b 60 02 c4 75 2c c0 8b 4a 1c 8b 44 24 20 8b 11 8b 9c 87 90
Feb 12 13:30:48 precision kernel: EIP: [<c0161315>] free_block+0x41/0xed SS:ESP 0068:c20f7f1c
Feb 12 13:30:48 precision kernel: klogd 1.4.1, ---------- state change ----------
Comment 6 Lars Müller 2007-03-01 10:26:00 UTC
Rosario: Are you performing a particular action?  File copy (cp, rsync) or removal?

Steve: 10.2 uses cifs 1.45.  Have you seen such a defect before?
Comment 7 Shirish Pargaonkar 2007-03-01 18:08:38 UTC
I think Steve fixed this.  Please refer to kernel bugzilla bug 7093.
The same fix will apply here IMHO.
Comment 8 Rosario Lombardo 2007-03-01 20:07:57 UTC
Lars: No, operations on CIFS filesystem are performed by a script run by cron. But I've noticed that the system freezes more frequently if I SSH to the server and I do some staff on the server. I'm attaching the script used. Thanks in advance!
Comment 9 Rosario Lombardo 2007-03-01 20:10:03 UTC
Created attachment 121887 [details]
Script used on the server
Comment 10 Rosario Lombardo 2007-03-13 21:48:48 UTC
Hi! Any news?
Comment 11 Rosario Lombardo 2007-03-14 22:22:13 UTC
Samba Team told me:

## https://bugzilla.samba.org/show_bug.cgi?id=4403
##
## ------- Comment #2 from sfrench@us.ibm.com  2007-03-14 17:18 MST -------
## Have the fixes for the hang fixed in cifs 1.48 (in either current mainline or
the cifs-backport-for-old-kernels) been tried on this?
Comment 12 Ivo Blöchliger 2007-04-12 20:05:22 UTC
Created attachment 130826 [details]
what was left in /var/log/messages after pam_mount tried to mount cifs volumes

I had the same problem with cifs. In most cases it crashed immediately while trying to mount a cifs volume, regardless whether done by pam_mount or on the command line by mount -t cifs //server/share....
Comment 13 jolz j 2007-05-03 06:31:27 UTC
Seems to be the same problem here. But the kernel also crashes, when the users is currently not really using CIFS, but has active mounts (the mounts via pam_mount do not seem to make troubles.) 

in this case, the user did nothing related to CIFS (i.e. he was not working on a CIFS mount.). The User just logged in (pam_mount mounted 5 CIFS-shares without problems) and wanted to enter his kwalled password. 

The "Call Trace" is quite similar, so i assume, it's the same problem. 

[snip]
Apr 26 13:16:46 swkpc zmd: Daemon (WARN): Not starting remote web server
Apr 26 13:22:18 swkpc kernel: BUG: unable to handle kernel paging request at virtual address 8094e484
Apr 26 13:22:18 swkpc kernel:  printing eip:
Apr 26 13:22:18 swkpc kernel: c0161330
Apr 26 13:22:18 swkpc kernel: *pde = 00000000
Apr 26 13:22:18 swkpc kernel: Oops: 0002 [#1]
Apr 26 13:22:18 swkpc kernel: SMP
Apr 26 13:22:18 swkpc kernel: last sysfs file: /devices/system/cpu/cpu0/cpufreq/ondemand/ignore_nice_load
Apr 26 13:22:18 swkpc kernel: Modules linked in: nls_utf8 cifs autofs4 nfsd ipv6 exportfs snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device nf
s lockd nfs_acl sunrpc af_packet cpufreq_conservative cpufreq_ondemand cpufreq_userspace cpufreq_powersave powernow_k8 freq_table button batt
ery ac apparmor aamatch_pcre loop dm_mod ide_cd cdrom pata_atiixp snd_atiixp snd_ac97_codec snd_ac97_bus atiixp tg3 snd_pcm snd_timer snd gen
eric soundcore snd_page_alloc ati_agp ide_core agpgart ohci_hcd i2c_piix4 ehci_hcd i2c_core usbcore parport_pc lp parport reiserfs edd fan at
a_piix sg sata_sil libata thermal processor sd_mod scsi_mod
Apr 26 13:22:18 swkpc kernel: CPU:    0
Apr 26 13:22:18 swkpc kernel: EIP:    0060:[<c0161330>]    Tainted: G    U VLI
Apr 26 13:22:18 swkpc kernel: EFLAGS: 00010082  (2.6.18.2-34-default #1)
Apr 26 13:22:18 swkpc kernel: EIP is at free_block+0x5c/0xed
Apr 26 13:22:18 swkpc kernel: eax: f5dda400  ebx: dfcc5ec0  ecx: f5dda7a0  edx: 8094e480
Apr 26 13:22:18 swkpc kernel: esi: f5e23000  edi: dfcc8580  ebp: dfffb8d4  esp: dfcb1f1c
Apr 26 13:22:18 swkpc kernel: ds: 007b  es: 007b  ss: 0068
Apr 26 13:22:18 swkpc kernel: Process events/0 (pid: 6, ti=dfcb0000 task=dfcdd0c0 task.ti=dfcb0000)
Apr 26 13:22:18 swkpc kernel: Stack: dfcc3e14 00000002 00000000 dfffb8d4 00000002 dfffb8c0 dfcc5ec0 c0161439
Apr 26 13:22:18 swkpc kernel:        00000000 dfcc8580 dfcc5ec0 dfcc8580 c18e7140 00000296 c0162861 00000000
Apr 26 13:22:18 swkpc kernel:        00000000 c1807e00 c1807e04 c012e639 00000000 c180719c ffffffff c016280e
Apr 26 13:22:18 swkpc kernel: Call Trace:
Apr 26 13:22:18 swkpc kernel:  [<c0161439>] drain_array+0x78/0x97
Apr 26 13:22:18 swkpc kernel:  [<c0162861>] cache_reap+0x53/0x117
Apr 26 13:22:18 swkpc kernel:  [<c012e639>] run_workqueue+0x83/0xc5
Apr 26 13:22:18 swkpc kernel:  [<c016280e>] cache_reap+0x0/0x117
Apr 26 13:22:18 swkpc kernel:  [<c012ef54>] worker_thread+0xd9/0x10d
Apr 26 13:22:18 swkpc kernel:  [<c011a7e2>] default_wake_function+0x0/0xc
Apr 26 13:22:18 swkpc kernel:  [<c012ee7b>] worker_thread+0x0/0x10d
Apr 26 13:22:18 swkpc kernel:  [<c0131420>] kthread+0xc0/0xec
Apr 26 13:22:18 swkpc kernel:  [<c0131360>] kthread+0x0/0xec
Apr 26 13:22:18 swkpc kernel:  [<c0102005>] kernel_thread_helper+0x5/0xb
Apr 26 13:22:18 swkpc kernel: Code: 8b 02 f6 c4 40 74 03 8b 52 0c 8b 02 84 c0 78 08 0f 0b 60 02 c4 75 2c c0 8b 4a 1c 8b 44 24 20 8b 11 8b 9c
87 90 00 00 00 8b 41 04 <89> 42 04 89 10 31 d2 2b 71 0c c7 01 00 01 10 00 c7 41 04 00 02
Apr 26 13:22:18 swkpc kernel: EIP: [<c0161330>] free_block+0x5c/0xed SS:ESP 0068:dfcb1f1c
[crashed -> reboot]
Comment 14 Felix Flemming 2007-08-02 13:03:36 UTC
We are running into the same problem on a Dual Opteron 64bit machine and unfortunately have to reboot the machine quite frequently in order to get it back to normal operation.  It seems to appear, when to many users try to access too many files at the same time on different cifs mounts of the same NAS server.  The kernel does not spit out any messages like above, but the var/log/messages looks quite similar to comment #12:

Aug  2 08:40:13 yors0353 kernel:  CIFS VFS: server not responding
Aug  2 08:40:13 yors0353 kernel:  CIFS VFS: server not responding
Aug  2 08:40:13 yors0353 kernel:  CIFS VFS: No response to cmd 115 mid 56997
Aug  2 08:40:13 yors0353 kernel:  CIFS VFS: No response for cmd 114 mid 56995
Aug  2 08:40:13 yors0353 kernel:  CIFS VFS: Send error in SessSetup = -11
Aug  2 08:40:13 yors0353 kernel:  CIFS VFS: Send error in SessSetup = -11
Aug  2 08:40:13 yors0353 kernel:  CIFS VFS: Error 0xffffff90 on cifs_get_inode_i
nfo in lookup of \Cambambe\svwg\sv00wg05\mlcfd\t01\svwg\svwg_001.res
Aug  2 08:40:13 yors0353 kernel:  CIFS VFS: No response to cmd 115 mid 56998
Aug  2 08:40:13 yors0353 kernel:  CIFS VFS: Send error in SessSetup = -11
Aug  2 08:40:13 yors0353 kernel:  CIFS VFS: No response for cmd 114 mid 56996
Aug  2 08:40:13 yors0353 kernel:  CIFS VFS: Error 0xffffff90 on cifs_get_inode_i
nfo in lookup of \Ligga_G3\RunnerThicknessSimulateFin\ga00_50pctThicker\mlcfd\t1
0\runner\cfxpost_error.log
Aug  2 08:40:13 yors0353 kernel:  CIFS VFS: Send error in SessSetup = -11
Aug  2 08:40:43 yors0353 kernel:  CIFS VFS: server not responding
Aug  2 08:40:43 yors0353 kernel:  CIFS VFS: server not responding
Aug  2 08:40:43 yors0353 kernel:  CIFS VFS: No response for cmd 114 mid 57004
Aug  2 08:40:43 yors0353 kernel:  CIFS VFS: No response to cmd 115 mid 57007
Aug  2 08:40:43 yors0353 kernel:  CIFS VFS: Send error in SessSetup = -11
Aug  2 08:40:43 yors0353 kernel:  CIFS VFS: Error 0xfffffff5 on cifs_get_inode_i
nfo in lookup of \Cambambe\svwg\sv00wg05\mlcfd\t01\svwg\cfxpost_error.log
Aug  2 08:40:43 yors0353 kernel:  CIFS VFS: No response for cmd 114 mid 57005
Aug  2 08:40:43 yors0353 kernel:  CIFS VFS: No response for cmd 114 mid 57006
Aug  2 08:40:43 yors0353 kernel:  CIFS VFS: Error 0xffffff90 on cifs_get_inode_i
nfo in lookup of \Ligga_G3\RunnerThicknessSimulateFin\ga00_50pctThicker\mlcfd\t1
0\runner\runner_001.res
Aug  2 08:40:43 yors0353 kernel:  CIFS VFS: Send error in SessSetup = -11
Aug  2 08:40:43 yors0353 kernel:  CIFS VFS: Send error in SessSetup = -11


Is there any chance that the "fix" indicated by Rosario in comment #11 will be integrated in a future update of OpenSuse 10.2?  This is a serious issue, since we are using the machine in a productive environment.

Any help is appreciated and if there is anything I can supply you with, just let me know.
Felix 
Comment 15 Steve French 2007-08-14 16:53:01 UTC
The problem described in comments #1 and 13 appear different from problem in comment #14.  No indication of a a cifs problem in comments #1 and #13 (mount failed, then later something unrelated to cifs oopsed).

The problem in comment #14 seems to be a server hang (no response to SessionSetup, server stopped responding as the client was trying to setup a session).  What is the server type in the comment #14 (we ran into one Sun server that required us to add a sleep to wait for a while between negotiate protocol and session setup - but this code has long been in cifs).

Any indication whether this also occurs with the current cifs backport for old kernel (should not make any difference, as this looks unrelated to cifs so far).

http://pserver.samba.org/samba/ftp/cifs-cvs/cifs-1.50.tar.gz
Comment 16 Felix Flemming 2007-12-03 22:15:35 UTC
I finally got to test a few things related to comment #14 and #15.  The same problem occurred independent of the cifs module version - I tried 1.45, 1.50 and 1.50c - in combination with kernel 2.6.18.8-0.5-default (openSUSE 10.2).

At the same time, our storage server started to run out of space with only 50Gb left and we noticed, that also on the Windows client side, the server timed out more often.  After removing a lot of stuff, we got plenty space available and all of a sudden, the problem did not reappear with the timeouts gone as well.  This points towards a problem of our storage server hardware (Windows based RAID system) and not the cifs module, as suggested in comment #15 by Steve French.

Thank you for your support!
Felix
Comment 17 Rosario Lombardo 2008-06-06 10:10:30 UTC
This problem seems to appeares only if working with big files (>2GB, I think).
Maybe can help.
Comment 18 Lars Müller 2008-06-18 20:08:53 UTC
Closing as requested with comment #16.