Bugzilla – Bug 241787
Kernel hangs when using a cifs on a multiprocessor opteron server using opensuse 10.2 32bit
Last modified: 2008-06-18 20:08:53 UTC
When working on cifs filesystem on a 4 processor opteron machine with 8gb ram, opensuse 10.2 32bit, kernel bigsmp, the kernel hangs with the following: Jan 31 22:03:50 fileserver sshd[5024]: Accepted keyboard-interactive/pam for rosario from 10.1.10.40 port 57808 ssh2 Jan 31 22:03:59 fileserver su: (to root) rosario on /dev/pts/0 Jan 31 22:04:16 fileserver syslog-ng[3183]: STATS: dropped 0 Jan 31 22:09:27 fileserver su: (to postgres) rosario on /dev/pts/0 Jan 31 22:11:56 fileserver kernel: BUG: unable to handle kernel paging request at virtual address 80b8e484 Jan 31 22:11:56 fileserver kernel: printing eip: Jan 31 22:11:56 fileserver kernel: c0161308 Jan 31 22:11:56 fileserver kernel: *pde = 00000000 Jan 31 22:11:56 fileserver kernel: Oops: 0002 [#1] Jan 31 22:11:56 fileserver kernel: SMP Jan 31 22:11:56 fileserver kernel: last sysfs file: /firmware/edd/int13_dev81/extensions Jan 31 22:11:56 fileserver kernel: Modules linked in: nls_utf8 cifs button battery ac apparmor aamatch_pcre loop dm_mod tg3 r8169 i2c_amd8111 i2c_amd756 ide_cd ohci_hcd cdrom usbcore i2c_core amd_rng parport_pc lp parport ext3 mbcache jbd edd fan sg aacraid amd74xx thermal processor sd_mod scsi_mod ide_disk ide_core Jan 31 22:11:56 fileserver kernel: CPU: 3 Jan 31 22:11:56 fileserver kernel: EIP: 0060:[<c0161308>] Tainted: G U VLI Jan 31 22:11:56 fileserver kernel: EFLAGS: 00010082 (2.6.18.2-34-bigsmp #1) Jan 31 22:11:56 fileserver kernel: EIP is at free_block+0x5c/0xed Jan 31 22:11:56 fileserver kernel: eax: f25fa9e0 ebx: dfffdec0 ecx: f2a50740 edx: 80b8e480 Jan 31 22:11:56 fileserver kernel: esi: f2703000 edi: dfff9a80 ebp: dfc8abe0 esp: dff09ef4 Jan 31 22:11:56 fileserver kernel: ds: 007b es: 007b ss: 0068 Jan 31 22:11:56 fileserver kernel: Process events/3 (pid: 13, ti=dff08000 task=dff076f0 task.ti=dff08000) Jan 31 22:11:56 fileserver kernel: Stack: dffcd614 00000005 00000003 dfc8abd4 00000005 dfc8abc0 dfffdec0 c0161411 Jan 31 22:11:56 fileserver kernel: 00000000 dfff9a80 dfffdec0 dfff9a80 dfc8a7c0 00000286 c016286c 00000000 Jan 31 22:11:56 fileserver kernel: 00000000 c602bc00 c602bc04 c012f679 ffffffff ffffffff ffffffff c0162819 Jan 31 22:11:56 fileserver kernel: Call Trace: Jan 31 22:11:56 fileserver kernel: [<c0161411>] drain_array+0x78/0x97 Jan 31 22:11:56 fileserver kernel: [<c016286c>] cache_reap+0x53/0x117 Jan 31 22:11:56 fileserver kernel: [<c012f679>] run_workqueue+0x83/0xc5 Jan 31 22:11:56 fileserver kernel: [<c0162819>] cache_reap+0x0/0x117 Jan 31 22:11:56 fileserver kernel: [<c012ff94>] worker_thread+0xd9/0x10d Jan 31 22:11:56 fileserver kernel: [<c011b15f>] default_wake_function+0x0/0xc Jan 31 22:11:56 fileserver kernel: [<c01324d4>] kthread+0xec/0x11c Jan 31 22:11:56 fileserver kernel: [<c012febb>] worker_thread+0x0/0x10d Jan 31 22:11:56 fileserver kernel: [<c01323e8>] kthread+0x0/0x11c Jan 31 22:11:56 fileserver kernel: [<c0102005>] kernel_thread_helper+0x5/0xb Jan 31 22:11:56 fileserver kernel: Code: 8b 02 f6 c4 40 74 03 8b 52 0c 8b 02 84 c0 78 08 0f 0b 60 02 86 81 2c c0 8b 4a 1c 8b 44 24 20 8b 11 8b 9c 87 10 02 00 00 8b 41 04 <89> 42 04 89 10 31 d2 2b 71 0c c7 01 00 01 10 00 c7 41 04 00 02 Jan 31 22:11:56 fileserver kernel: EIP: [<c0161308>] free_block+0x5c/0xed SS:ESP 0068:dff09ef4
Created attachment 117182 [details] Support file created by Yast Support file created by Yast on the machine affected by the problem
Processors are amd opteron 64-bit, while the openSUSE 10.2 used is the 32bit version. I'll try in the next days if opensuse 10.2 64bit has the same problem and i'll tell to you.
The same problem with openSUSE 10.2 64bit
seems the same on 32bit Intel Single-Processor with openSUSE 10.2. ( but here only when running a VMWare machine while accessing the cifs share ) Feb 12 13:30:48 precision kernel: ------------[ cut here ]------------ Feb 12 13:30:48 precision kernel: kernel BUG at mm/slab.c:608! Feb 12 13:30:48 precision kernel: invalid opcode: 0000 [#1] Feb 12 13:30:48 precision kernel: SMP Feb 12 13:30:48 precision kernel: last sysfs file: /devices/system/cpu/cpu0/cpufreq/scaling_governor Feb 12 13:30:48 precision syslog-ng[2558]: Changing permissions on special file /dev/xconsole Feb 12 13:30:48 precision syslog-ng[2558]: Changing permissions on special file /dev/tty10 Feb 12 13:30:48 precision kernel: Modules linked in: nls_utf8 cifs vmnet parport_pc parport vmmon snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device ipv6 af_packet cpufreq_conservative cpufreq_ondemand cpufreq_userspace cpufreq_powersave speedstep_centrino freq_table button battery ac apparmor aamatch_pcre loop dm_mod pcmcia usbhid yenta_socket nvidia rsrc_nonstatic tg3 pcmcia_core ipw2200 i2c_core uhci_hcd ehci_hcd ieee80211 ieee80211_crypt firmware_class snd_intel8x0 snd_ac97_codec snd_ac97_bus snd_pcm snd_timer snd soundcore snd_page_alloc usbcore intel_agp irda agpgart crc_ccitt reiserfs sr_mod cdrom edd fan sg ata_piix ahci libata thermal processor sd_mod scsi_mod Feb 12 13:30:48 precision kernel: CPU: 0 Feb 12 13:30:48 precision kernel: EIP: 0060:[<c0161315>] Tainted: P U VLI Feb 12 13:30:48 precision kernel: EFLAGS: 00010006 (2.6.18.2-34-default #1) Feb 12 13:30:48 precision kernel: EIP is at free_block+0x41/0xed Feb 12 13:30:48 precision kernel: eax: c001006c ebx: f242bc94 ecx: 00000002 edx: c1800000 Feb 12 13:30:48 precision kernel: esi: 00000000 edi: f6ca9380 ebp: f242bc94 esp: c20f7f1c Feb 12 13:30:48 precision kernel: ds: 007b es: 007b ss: 0068 Feb 12 13:30:48 precision kernel: Process events/0 (pid: 4, ti=c20f6000 task=dfc805a0 task.ti=c20f6000) Feb 12 13:30:48 precision kernel: Stack: eda99394 00000002 00000000 f242bc94 00000002 f242bc80 f2927980 c0161439 Feb 12 13:30:48 precision kernel: 00000000 f6ca9380 f2927980 f6ca9380 dfca85c0 00000296 c0162861 00000000 Feb 12 13:30:48 precision kernel: 00000000 c200d980 c200d984 c012e639 00000000 c200c8a4 dfca85c0 c016280e Feb 12 13:30:48 precision kernel: Call Trace: Feb 12 13:30:48 precision kernel: [<c0161439>] drain_array+0x78/0x97 Feb 12 13:30:48 precision kernel: [<c0162861>] cache_reap+0x53/0x117 Feb 12 13:30:48 precision kernel: [<c012e639>] run_workqueue+0x83/0xc5 Feb 12 13:30:48 precision kernel: [<c016280e>] cache_reap+0x0/0x117 Feb 12 13:30:48 precision kernel: [<c012ef54>] worker_thread+0xd9/0x10d Feb 12 13:30:48 precision kernel: [<c011a7e2>] default_wake_function+0x0/0xc Feb 12 13:30:48 precision kernel: [<c012ee7b>] worker_thread+0x0/0x10d Feb 12 13:30:48 precision kernel: [<c0131420>] kthread+0xc0/0xec Feb 12 13:30:48 precision kernel: [<c0131360>] kthread+0x0/0xec Feb 12 13:30:48 precision kernel: [<c0102005>] kernel_thread_helper+0x5/0xb Feb 12 13:30:48 precision kernel: Code: 00 e9 bb 00 00 00 8b 75 00 8d 96 00 00 00 40 c1 ea 0c c1 e2 05 03 15 10 36 41 c0 8b 02 f6 c4 40 74 03 8b 52 0c 8b 02 84 c0 78 08 <0f> 0b 60 02 c4 75 2c c0 8b 4a 1c 8b 44 24 20 8b 11 8b 9c 87 90 Feb 12 13:30:48 precision kernel: EIP: [<c0161315>] free_block+0x41/0xed SS:ESP 0068:c20f7f1c Feb 12 13:30:48 precision kernel: klogd 1.4.1, ---------- state change ----------
Rosario: Are you performing a particular action? File copy (cp, rsync) or removal? Steve: 10.2 uses cifs 1.45. Have you seen such a defect before?
I think Steve fixed this. Please refer to kernel bugzilla bug 7093. The same fix will apply here IMHO.
Lars: No, operations on CIFS filesystem are performed by a script run by cron. But I've noticed that the system freezes more frequently if I SSH to the server and I do some staff on the server. I'm attaching the script used. Thanks in advance!
Created attachment 121887 [details] Script used on the server
Hi! Any news?
Samba Team told me: ## https://bugzilla.samba.org/show_bug.cgi?id=4403 ## ## ------- Comment #2 from sfrench@us.ibm.com 2007-03-14 17:18 MST ------- ## Have the fixes for the hang fixed in cifs 1.48 (in either current mainline or the cifs-backport-for-old-kernels) been tried on this?
Created attachment 130826 [details] what was left in /var/log/messages after pam_mount tried to mount cifs volumes I had the same problem with cifs. In most cases it crashed immediately while trying to mount a cifs volume, regardless whether done by pam_mount or on the command line by mount -t cifs //server/share....
Seems to be the same problem here. But the kernel also crashes, when the users is currently not really using CIFS, but has active mounts (the mounts via pam_mount do not seem to make troubles.) in this case, the user did nothing related to CIFS (i.e. he was not working on a CIFS mount.). The User just logged in (pam_mount mounted 5 CIFS-shares without problems) and wanted to enter his kwalled password. The "Call Trace" is quite similar, so i assume, it's the same problem. [snip] Apr 26 13:16:46 swkpc zmd: Daemon (WARN): Not starting remote web server Apr 26 13:22:18 swkpc kernel: BUG: unable to handle kernel paging request at virtual address 8094e484 Apr 26 13:22:18 swkpc kernel: printing eip: Apr 26 13:22:18 swkpc kernel: c0161330 Apr 26 13:22:18 swkpc kernel: *pde = 00000000 Apr 26 13:22:18 swkpc kernel: Oops: 0002 [#1] Apr 26 13:22:18 swkpc kernel: SMP Apr 26 13:22:18 swkpc kernel: last sysfs file: /devices/system/cpu/cpu0/cpufreq/ondemand/ignore_nice_load Apr 26 13:22:18 swkpc kernel: Modules linked in: nls_utf8 cifs autofs4 nfsd ipv6 exportfs snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device nf s lockd nfs_acl sunrpc af_packet cpufreq_conservative cpufreq_ondemand cpufreq_userspace cpufreq_powersave powernow_k8 freq_table button batt ery ac apparmor aamatch_pcre loop dm_mod ide_cd cdrom pata_atiixp snd_atiixp snd_ac97_codec snd_ac97_bus atiixp tg3 snd_pcm snd_timer snd gen eric soundcore snd_page_alloc ati_agp ide_core agpgart ohci_hcd i2c_piix4 ehci_hcd i2c_core usbcore parport_pc lp parport reiserfs edd fan at a_piix sg sata_sil libata thermal processor sd_mod scsi_mod Apr 26 13:22:18 swkpc kernel: CPU: 0 Apr 26 13:22:18 swkpc kernel: EIP: 0060:[<c0161330>] Tainted: G U VLI Apr 26 13:22:18 swkpc kernel: EFLAGS: 00010082 (2.6.18.2-34-default #1) Apr 26 13:22:18 swkpc kernel: EIP is at free_block+0x5c/0xed Apr 26 13:22:18 swkpc kernel: eax: f5dda400 ebx: dfcc5ec0 ecx: f5dda7a0 edx: 8094e480 Apr 26 13:22:18 swkpc kernel: esi: f5e23000 edi: dfcc8580 ebp: dfffb8d4 esp: dfcb1f1c Apr 26 13:22:18 swkpc kernel: ds: 007b es: 007b ss: 0068 Apr 26 13:22:18 swkpc kernel: Process events/0 (pid: 6, ti=dfcb0000 task=dfcdd0c0 task.ti=dfcb0000) Apr 26 13:22:18 swkpc kernel: Stack: dfcc3e14 00000002 00000000 dfffb8d4 00000002 dfffb8c0 dfcc5ec0 c0161439 Apr 26 13:22:18 swkpc kernel: 00000000 dfcc8580 dfcc5ec0 dfcc8580 c18e7140 00000296 c0162861 00000000 Apr 26 13:22:18 swkpc kernel: 00000000 c1807e00 c1807e04 c012e639 00000000 c180719c ffffffff c016280e Apr 26 13:22:18 swkpc kernel: Call Trace: Apr 26 13:22:18 swkpc kernel: [<c0161439>] drain_array+0x78/0x97 Apr 26 13:22:18 swkpc kernel: [<c0162861>] cache_reap+0x53/0x117 Apr 26 13:22:18 swkpc kernel: [<c012e639>] run_workqueue+0x83/0xc5 Apr 26 13:22:18 swkpc kernel: [<c016280e>] cache_reap+0x0/0x117 Apr 26 13:22:18 swkpc kernel: [<c012ef54>] worker_thread+0xd9/0x10d Apr 26 13:22:18 swkpc kernel: [<c011a7e2>] default_wake_function+0x0/0xc Apr 26 13:22:18 swkpc kernel: [<c012ee7b>] worker_thread+0x0/0x10d Apr 26 13:22:18 swkpc kernel: [<c0131420>] kthread+0xc0/0xec Apr 26 13:22:18 swkpc kernel: [<c0131360>] kthread+0x0/0xec Apr 26 13:22:18 swkpc kernel: [<c0102005>] kernel_thread_helper+0x5/0xb Apr 26 13:22:18 swkpc kernel: Code: 8b 02 f6 c4 40 74 03 8b 52 0c 8b 02 84 c0 78 08 0f 0b 60 02 c4 75 2c c0 8b 4a 1c 8b 44 24 20 8b 11 8b 9c 87 90 00 00 00 8b 41 04 <89> 42 04 89 10 31 d2 2b 71 0c c7 01 00 01 10 00 c7 41 04 00 02 Apr 26 13:22:18 swkpc kernel: EIP: [<c0161330>] free_block+0x5c/0xed SS:ESP 0068:dfcb1f1c [crashed -> reboot]
We are running into the same problem on a Dual Opteron 64bit machine and unfortunately have to reboot the machine quite frequently in order to get it back to normal operation. It seems to appear, when to many users try to access too many files at the same time on different cifs mounts of the same NAS server. The kernel does not spit out any messages like above, but the var/log/messages looks quite similar to comment #12: Aug 2 08:40:13 yors0353 kernel: CIFS VFS: server not responding Aug 2 08:40:13 yors0353 kernel: CIFS VFS: server not responding Aug 2 08:40:13 yors0353 kernel: CIFS VFS: No response to cmd 115 mid 56997 Aug 2 08:40:13 yors0353 kernel: CIFS VFS: No response for cmd 114 mid 56995 Aug 2 08:40:13 yors0353 kernel: CIFS VFS: Send error in SessSetup = -11 Aug 2 08:40:13 yors0353 kernel: CIFS VFS: Send error in SessSetup = -11 Aug 2 08:40:13 yors0353 kernel: CIFS VFS: Error 0xffffff90 on cifs_get_inode_i nfo in lookup of \Cambambe\svwg\sv00wg05\mlcfd\t01\svwg\svwg_001.res Aug 2 08:40:13 yors0353 kernel: CIFS VFS: No response to cmd 115 mid 56998 Aug 2 08:40:13 yors0353 kernel: CIFS VFS: Send error in SessSetup = -11 Aug 2 08:40:13 yors0353 kernel: CIFS VFS: No response for cmd 114 mid 56996 Aug 2 08:40:13 yors0353 kernel: CIFS VFS: Error 0xffffff90 on cifs_get_inode_i nfo in lookup of \Ligga_G3\RunnerThicknessSimulateFin\ga00_50pctThicker\mlcfd\t1 0\runner\cfxpost_error.log Aug 2 08:40:13 yors0353 kernel: CIFS VFS: Send error in SessSetup = -11 Aug 2 08:40:43 yors0353 kernel: CIFS VFS: server not responding Aug 2 08:40:43 yors0353 kernel: CIFS VFS: server not responding Aug 2 08:40:43 yors0353 kernel: CIFS VFS: No response for cmd 114 mid 57004 Aug 2 08:40:43 yors0353 kernel: CIFS VFS: No response to cmd 115 mid 57007 Aug 2 08:40:43 yors0353 kernel: CIFS VFS: Send error in SessSetup = -11 Aug 2 08:40:43 yors0353 kernel: CIFS VFS: Error 0xfffffff5 on cifs_get_inode_i nfo in lookup of \Cambambe\svwg\sv00wg05\mlcfd\t01\svwg\cfxpost_error.log Aug 2 08:40:43 yors0353 kernel: CIFS VFS: No response for cmd 114 mid 57005 Aug 2 08:40:43 yors0353 kernel: CIFS VFS: No response for cmd 114 mid 57006 Aug 2 08:40:43 yors0353 kernel: CIFS VFS: Error 0xffffff90 on cifs_get_inode_i nfo in lookup of \Ligga_G3\RunnerThicknessSimulateFin\ga00_50pctThicker\mlcfd\t1 0\runner\runner_001.res Aug 2 08:40:43 yors0353 kernel: CIFS VFS: Send error in SessSetup = -11 Aug 2 08:40:43 yors0353 kernel: CIFS VFS: Send error in SessSetup = -11 Is there any chance that the "fix" indicated by Rosario in comment #11 will be integrated in a future update of OpenSuse 10.2? This is a serious issue, since we are using the machine in a productive environment. Any help is appreciated and if there is anything I can supply you with, just let me know. Felix
The problem described in comments #1 and 13 appear different from problem in comment #14. No indication of a a cifs problem in comments #1 and #13 (mount failed, then later something unrelated to cifs oopsed). The problem in comment #14 seems to be a server hang (no response to SessionSetup, server stopped responding as the client was trying to setup a session). What is the server type in the comment #14 (we ran into one Sun server that required us to add a sleep to wait for a while between negotiate protocol and session setup - but this code has long been in cifs). Any indication whether this also occurs with the current cifs backport for old kernel (should not make any difference, as this looks unrelated to cifs so far). http://pserver.samba.org/samba/ftp/cifs-cvs/cifs-1.50.tar.gz
I finally got to test a few things related to comment #14 and #15. The same problem occurred independent of the cifs module version - I tried 1.45, 1.50 and 1.50c - in combination with kernel 2.6.18.8-0.5-default (openSUSE 10.2). At the same time, our storage server started to run out of space with only 50Gb left and we noticed, that also on the Windows client side, the server timed out more often. After removing a lot of stuff, we got plenty space available and all of a sudden, the problem did not reappear with the timeouts gone as well. This points towards a problem of our storage server hardware (Windows based RAID system) and not the cifs module, as suggested in comment #15 by Steve French. Thank you for your support! Felix
This problem seems to appeares only if working with big files (>2GB, I think). Maybe can help.
Closing as requested with comment #16.