Bugzilla – Bug 308698
grep slow in utf8 locale
Last modified: 2009-07-21 21:50:26 UTC
grep from 10.2: time distpin foobla >/dev/null real 0m6.311s user 0m5.964s sys 0m0.448s grep from 10.3: time distpin foobla >/dev/null real 1m5.049s user 1m3.012s sys 0m1.224s export LC_ALL=C time distpin foobla >/dev/null real 0m6.024s user 0m5.692s sys 0m0.400s
grep really is horribly slow
Ludwig, you should know: Critical Crash, loss of data, corruption of data, severe memory leak
the 2.5.3 release does not help either, but using at least a released version would probably still be better.
just found that I can confirm the regression. openSuSE 10.3 RC1 without utf8 > time nm *.o | LANG=C time grep -v " W " > link.nm 0.12user 0.13system 0:02.16elapsed 12%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+189minor)pagefaults 0swaps real 0m2.179s user 0m1.840s sys 0m0.564s openSuSE 10.3 RC1 with utf8 > time nm *.o | time grep -v " W " > link.nm 335.10user 0.08system 5:36.78elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+252minor)pagefaults 0swaps real 5m36.819s user 5m36.817s sys 0m0.528s For comparision SuSE 9.0 with the same hardware: > time nm *.o | LANG=C time grep -v " W " > link.nm 0.04user 0.12system 0:08.58elapsed 1%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (134major+22minor)pagefaults 0swaps real 0m8.611s user 0m8.240s sys 0m0.530s Looking at the inner time output we are talking here about hundredths of seconds vs. hundreds of seconds, a factor of over 1000 at the given input size. grep is used by so many scripts within about every distro, that this problem could become a show-stopper.
according to micha: http://cvs.fedora.redhat.com/viewcvs/devel/grep/grep-2.5.1-egf-speedup.patch?rev=1.16&view=log could help here.
This patch doesn't apply as is on our grep, but debian also uses this patch, maybe theirs works. It's part of http://ftp.debian.org/debian/pool/main/g/grep/grep_2.5.3~dfsg-3.diff.gz The redhat dfa-optional patch might also be interesting, but needs thorough testing. It might or might not actually be more correct, see https://bugzilla.redhat.com/show_bug.cgi?id=121313#10 . Also see https://bugzilla.redhat.com/show_bug.cgi?id=69900 for the original discussion leading to the egf-speedup patch.
*** Bug 338995 has been marked as a duplicate of this bug. ***
the more data the slower grep gets - exponentially: yes | head -10000 | LC_ALL=de_DE.UTF-8 time grep . > /dev/null ... 10s yes | head -30000 | LC_ALL=de_DE.UTF-8 time grep . > /dev/null ... 45s a quite simple grep on a 70MB log file is a *real* problem in a UTF-8 locale.
ping..
*** Bug 353718 has been marked as a duplicate of this bug. ***
>a quite simple grep on a 70MB log file is a *real* problem in a UTF-8 locale. yes. i have come across today (opensuse 11 alpha1) after i wondered why grepping trough larger amount of data did nothing but burning cpu cycles. i needed a while to find out that this is a locale problem. this is not obvious to the end-user and people will have major hassle because of this. major annoyance,imho. please fix !
As a workaround, I installed grep from Suse 10.2 on my 10.3 system and it worked without problems, performance was back to normal. Why not just _down_grade grep to the version in Suse 10.2? It is better to have an 'old' version that works than a new version that has been broken for 5 months now.
well, _my_ workaround is unsetting the LC_CTYPE envvar - but i don`t like workarounds ;) i personally don`t have a real problem if i know the workaround, but i`m sure other people will have a problem with this if they don`t know what`s the issue. and this is why they will tell: nahh. suse is crap. even such basic tools like grep have issues on that. i get back to ubuntu. or use cygwin on windoze. or whatever. (ok, apparently, other distro`s may have that issue, too) imho it`s a quality characteristic if things "just work" without bells and whistles, and that`s why i complain.
vote for it! ;)
apparently, it seems it`s not only grep which is suffering from this problem: linux:/tmp # echo $LC_CTYPE de_DE.UTF-8 linux:/tmp # time cat largefile |cut -d ":" -f 2- >/dev/null real 0m43.832s user 0m6.008s sys 0m0.160s linux:/tmp # unset LC_CTYPE linux:/tmp # time cat largefile |cut -d ":" -f 2- >/dev/null real 0m2.620s user 0m0.208s sys 0m0.132s
Maybe it doesn't cause a crash, but it's still very bad if you had being relying on grep not to suddenly be slower by an order of magnitude (20x), especially on a new machine that's 2x faster, but overall 10X slower with grep "Vista Edition" installed. 20 times slower is terrible. e.g. grep = grep-2.5.2-28 grep251 = grep-2.5.1a-40 from a 10.2 box, which isn't fast either - in fact it's quite slow too in some cases. time grep251 -E "(duosdwrwr|asdsadd)" messages-20080218 real 0m2.145s user 0m1.920s sys 0m0.224s time grep -E "(duosdwrwr|asdsadd)" messages-20080218 real 0m44.081s user 0m43.819s sys 0m0.256s time grep251 -E "(duosdwrwr|asdsadd)" messages-20080218 real 0m2.132s user 0m1.868s sys 0m0.264s time grep251 -E "asdadd|trerwer" messages-20080218 real 0m49.606s user 0m48.527s sys 0m0.356s Note that for the following dual pattern query grep251 is also very slow. time grep -E "asdadd|trerwer" messages-20080218 real 0m50.177s user 0m49.939s sys 0m0.232s ime grep251 -E "asdadd" messages-20080218 real 0m0.759s user 0m0.492s sys 0m0.224s But 251 is not too bad one pattern at a time, while grep-2.5.2-28 is bad. time grep251 -E "trerwer" messages-20080218 real 0m0.686s user 0m0.460s sys 0m0.228s time grep -E "trerwer" messages-20080218 real 0m43.013s user 0m42.719s sys 0m0.204s time grep -E "asdadd" messages-20080218 real 0m42.810s user 0m42.515s sys 0m0.272s export LC_ALL="C" time grep251 -E "(duosdwrwr|asdsadd)" messages-20080218 real 0m1.535s user 0m1.280s sys 0m0.256s time grep -E "(duosdwrwr|asdsadd)" messages-20080218 real 0m1.568s user 0m1.312s sys 0m0.248s time grep -E "asdadd|trerwer" messages-20080218 real 0m2.152s user 0m1.968s sys 0m0.180s time grep251 -E "asdadd|trerwer" messages-20080218 real 0m2.154s user 0m1.840s sys 0m0.312s Must we start turning locale off by default? Maybe getting locale right is not easy, but 40 seconds for an foo|bar search? why not just 2 x longer than a single pattern search?
the reason for that is that grep added mbrtowc in the most hottest code path, and that is probably one of the slowest glibc functions to call. I're rewritten that code, and it is now ~ factor 10 faster (which makes it still a factor 2 regression). the debian patches remove that code path alltogether, but they disable the CW algorithm, which give an even bigger speed regression. so while I like getting rid of the code path, disabling the CW algorithm is a no-go. the correct solution would be to adapt the cw matcher to use wide chars, which would be roughly en par with the old speed and allow it to be used. the new suse patches that were added are just entirely broken ;(
so no fix for this in factory yet? i updated to latest today, but.... linux:/tmp # time cat rpm-no-files.txt |grep -v "^/proc" |grep -v "^/sys" |grep -v "^/dev" >rpm-no-files.txt2 real 2m9.914s user 1m46.443s sys 0m1.348s linux-trkh:~ # echo $LC_CTYPE de_DE.UTF-8 linux:/tmp # unset LC_CTYPE linux:/tmp # time cat rpm-no-files.txt |grep -v "^/proc" |grep -v "^/sys" |grep -v "^/dev" >rpm-no-files.txt2 real 0m1.938s user 0m0.860s sys 0m1.052s and this is just a 20mb file....
*** Bug 381873 has been marked as a duplicate of this bug. ***
the new submission from Andreas Schwab labeled "Some speadups" does improve the situation: 11.0 package: real 0m17.966s user 0m16.301s sys 0m0.308s new submission: real 0m6.727s user 0m5.276s sys 0m0.304s 10.2 package: real 0m3.114s user 0m2.708s sys 0m0.224s so there is room for improvement, but a factor of 3 is already quite good progress.
the numbers above do not match those in comment #1 anymore as I meanwhile have a different machine with a different architecture.
(In reply to comment #22 from Dirk Mueller) > the new submission from Andreas Schwab labeled "Some speadups" does improve the > situation: is this new submission available for testing ?
Forwarded to upstream some time ago. http://www.nabble.com/horrible-utf-8-performace-in-wc-td17094488.html
*** Bug 394665 has been marked as a duplicate of this bug. ***
Fixed.
For some expressions this is indeed much better. But other, very simple regexps (that match many lines) it's still very slow: % time LANG=C grep . /var/log/messages > /dev/null real 0m0.010s user 0m0.004s sys 0m0.004s % time LANG=de_DE.UTF-8 grep . /var/log/messages > /dev/null real 0m2.099s user 0m2.072s sys 0m0.028s Interestingly en_EN.UTF-8 is faster again: % time LANG=en_EN.UTF-8 grep . /var/log/messages > /dev/null real 0m0.010s user 0m0.008s sys 0m0.000s So, something is still fishy. Should I create a new PR or just reopen this one?
That is because en_EN is not a known locale in the first place. Pick one that is, perhaps, valid, before you test?
In case of unknown locales, C/POSIX is used as a fallback and that seems to be the case in Michael’s benchmark.
Bah, indeed :) Okay, that makes my point just more valid, UTF-8 locales are still much slower.
Whoever needs grep searches with utf-8 support should possibly look at pcregrep. Maybe there is enough pain to replace grep with pcregrep, given it is compatible enough. I haven't checked yet. Anything else will involve reimplementing grep from scratch to get proper UTF-8 support. Most occurrences of char will have to be replaced by wchar_t and conversion has to be done on input of both the pattern and the search data, involving regular epression matching of its own.
>Whoever needs grep searches with utf-8 support should possibly look at pcregrep. What would it offer over grep -P?
We got a winner... $ yes | head -n10000 | LC_ALL=de_DE.UTF-8 time grep . >/dev/null 11.94user 0.00system 0:12.20elapsed 97%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+272minor)pagefaults 0swaps $ yes | head -n10000 | LC_ALL=de_DE.UTF-8 time grep -P . >/dev/null 0.00user 0.00system 0:00.01elapsed 57%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+246minor)pagefaults 0swaps
No, pcregrep (or grep -P) is no option, as the regular expression syntax is that of perl, not of grep. Yes, pcre is faster with UTF-8, it always was, but that's irrelevant. Re comment #33: we have this conversation because grep of course _does_ support UTF-8 inputs just fine (as some other multi-byte locales). It is only slow. And yes, grep does implement some specialized versions of the matcher for these locales. No need for any reimplementation from scratch.