|
Bugzilla – Full Text Bug Listing |
| Summary: | MALLOC_CHECKING reminder (turn off) hwinfo --framebuffer suddenly takes very long | ||
|---|---|---|---|
| Product: | [openSUSE] openSUSE 11.3 | Reporter: | Stefan Seyfried <seife> |
| Component: | Basesystem | Assignee: | Marcus Meissner <meissner> |
| Status: | RESOLVED FIXED | QA Contact: | E-mail List <qa-bugs> |
| Severity: | Major | ||
| Priority: | P2 - High | CC: | snwint |
| Version: | Factory | ||
| Target Milestone: | --- | ||
| Hardware: | Other | ||
| OS: | Other | ||
| Whiteboard: | |||
| Found By: | Community User | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
| Attachments: |
output of "mdt --modes --show dump"
malloc test program test run logs updated test program |
||
Did you eventually switch to libx86 or similar CPU emulation? It's especially annoing since hwinfo --framebuffer is called during mkinitrd, apparently for every kernel image, which can lead to quite some waiting ;) Your cpu is probably in some deep sleep mode or so. :-) Pick up 'mdt' from BS and try 'mdt --modes'. Does it also take that long? If so, please attach 'mdt --modes --show dump'. Created attachment 349949 [details]
output of "mdt --modes --show dump"
Yes. Each mode seems to take ~500 ms to get.
Additionally, both hwinfo and mdt seem to use obscene amounts of memory:
strolchi:~ # /usr/bin/time mdt --modes > /dev/null
2.69user 17.20system 0:25.80elapsed 77%CPU (0avgtext+0avgdata 2355200maxresident)k
16inputs+0outputs (1major+3039779minor)pagefaults 0swaps
The log shows about 2ms per mode. So that is not the problem. Is 'mdt --mode 0x0101' also that slow? Is it also using 2GB of memory? Is that a 32 bit machine? Yes, 32bits, non-PAE, kernel-desktop strolchi:~ # /usr/bin/time mdt --mode 0x0101 0x0101[bb]: 640x480+640, 8 bpp, fb: 0x98000000, a0000.7/a0000.0: 64k 0.35user 0.83system 0:01.86elapsed 63%CPU (0avgtext+0avgdata 2355200maxresident)k 0inputs+0outputs (0major+147363minor)pagefaults 0swaps strolchi:~ # /usr/bin/time hwinfo --framebuffer > /dev/null 3.63user 16.65system 0:32.91elapsed 61%CPU (0avgtext+0avgdata 1585856maxresident)k 7600inputs+0outputs (5major+3039276minor)pagefaults 0swaps So mdt --mode 0x0101 also takes long (almost two seconds) and uses lots of ram. hwinfo --framebuffer in fact does not need *that much* of it (still enough ;) I somehow suspect it's a gcc45 thing. Ok, I can reproduce it. I'll check. Created attachment 350048 [details]
malloc test program
The kernel makes the difference. The attached test program gets oom-killed
in the second mmap() if I call it with e.g.'t3 2 200000000' on a 512MB machine.
For you it will probably first fill up swap, then get killed.
I don't know if it is intentional that you can't use more virtual memory
than physical memory. Looks like a bug to me.
After playing a bit with /proc/sys/vm/overcommit_memory it seems the problem is that now the kernel seems to explicitly initialize (!?) the memory with 0. Wait what? seife@strolchi:/tmp> ./t3 16 200000000
16 x 200000000
works for me :-) Swaps like hell, but does not kill anything.
I have 1.5G of swap because of hibernation:
seife@strolchi:/tmp> free
total used free shared buffers cached
Mem: 1030116 375620 654496 0 1672 149744
-/+ buffers/cache: 224204 805912
Swap: 1574360 327188 1247172
Yes, but it shouldn't swap; that's the point. It didn't do it before. I don't think this is a kernel issue. I think it's a glibc issue. When I boot the openSUSE 11.3 M3 kernel on my SLES11 SP1 install, I can't reproduce the issue. When I boot the openSUSE 11.2 kernel on my Factory install, I can reproduce the issue. Upping the priority since this will affect performance all over the place. You are right; when I replace malloc() with mmap() the problem is gone. Huh, I'm confused... SLE11SP1 and Factory use virtually the same glibc. I cannot reproduce the problem with SLE11 kernel and HEAD userspace; the testcase returns immediately even with 16 200000000. Can you post a strace of the problematic testcase run? Do we have any SUSE-internal machine exhibiting the issue? Created attachment 350606 [details]
test run logs
I also can't reproduce this in a factory chroot with 11.2 kernel. That's
why I assigned it to kernel in the first place.
Fact is, though, that running factory on a factory kernel shows a difference
between using malloc() and mmap(). The attached log (taken in a vmware vm)
shows the time between mmap() calls increasing because it starts swapping
if the memory is used up. If I take the swap away, the process gets oom-killed.
Hm, but what is happenning between the calls is then a question. Does unfiltered strace show anything extra? Doesn't the small extra mmap() size make a difference? Created attachment 350615 [details]
updated test program
In case I'm missing something obvious, here is the test program I used.
No there are no other syscalls between the large mmaps(). > Doesn't the small extra mmap() size make a difference?
No. Using the values from the malloc() log for mmap() shows no difference.
Ok, can you please also check your environment for funky stuff? export | grep MALLOC_. I still can't see where could it spend the extra time otherwise. :( Would it be possible to oprofile it? I had it running in the rescue system from our install dvd. I don't think there's anything unusual in the environment there. Petr, you're dead on: strolchi:~ # /usr/bin/time hwinfo --framebuffer >/dev/null 3.34user 16.89system 0:29.99elapsed 67%CPU (0avgtext+0avgdata 1586256maxresident)k 5480inputs+0outputs (10major+3039317minor)pagefaults 0swaps strolchi:~ # MALLOC_CHECK_= MALLOC_PERTURB_= /usr/bin/time hwinfo --framebuffer > /dev/null 1.10user 0.07system 0:01.67elapsed 70%CPU (0avgtext+0avgdata 20688maxresident)k 1160inputs+0outputs (0major+9544minor)pagefaults 0swaps strolchi:~ # env|grep MALLOC MALLOC_CHECK_=3 MALLOC_PERTURB_=69 strolchi:~ # The difference is MALLOC_PERTURB_ = E vs. 69. What is 'E' (on my 11.2 machine) supposed to mean? An alias for 0? Yes, anything non-numeric is alias for 0 aka "don't perturb" (perturb will fill allocated/freed space with non-zero bytes). What on your system is setting these environment variables? My system has nothing of this. We should get rid of them if this is some OpenSUSE default now. it is default enabled on FACTORY, which is not too bad, given that FACTORY is for adventurous people who like to find bugs ;) seife@susi:~> env|grep MALLOC MALLOC_CHECK_=3 MALLOC_PERTURB_=69 I did fix it for me by not doing it for root like that (added the "if"): susi:~ # cat /etc/profile.d/malloc-debug.sh # remove this file before shipment if test x"$UID" != x0; then export MALLOC_CHECK_=3 export MALLOC_PERTURB_=69 fi Okay, then I'm reassigning this to Coolo to decide whether we want to change or document the defaults in any way. Thank you all. no, we will get rid of this file before the final. But they are there (also for root) for a reason - to find bugs it also shows that hwinfo --framebuffer does way too much allocations ;) malloc checking is now off in rc1. |
Always running current FACTORY, this bug appeared ~4 weeks ago. strolchi:~ # time hwinfo --framebuffer 02: None 00.0: 11001 VESA Framebuffer [Created at bios.459] Unique ID: rdCR.bQVdqIybBKC Hardware Class: framebuffer Model: "ATI P10" Vendor: "ATI Technologies Inc." Device: "P10" SubVendor: "ATI MOBILITY RADEON 9600" SubDevice: Revision: "01.00" Memory Size: 31 MB + 960 kB Memory Range: 0x98000000-0x99feffff (rw) Mode 0x0382: 320x200 (+320), 8 bits [... lots of modes... ] Mode 0x0344: 1400x1050 (+5600), 24 bits Config Status: cfg=no, avail=yes, need=no, active=unknown real 0m42.932s user 0m3.566s sys 0m16.508s It is running with 100% CPU usage, in stage > bios.4.2: mode info most of the time. The run time varies between ~25 and ~50 seconds, depending on idleness of the machine. Hardware is a Compaq nc6000 with Pentium M 1.7GHz and 1GB of RAM.