Bug 308698 - grep slow in utf8 locale
Summary: grep slow in utf8 locale
Status: RESOLVED FIXED
: 338995 353718 381873 394665 (view as bug list)
Alias: None
Product: openSUSE 10.3
Classification: openSUSE
Component: Basesystem (show other bugs)
Version: Beta 2
Hardware: Other Other
: P5 - None : Major with 17 votes (vote)
Target Milestone: ---
Assignee: Andreas Schwab
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-09-07 16:07 UTC by Dirk Mueller
Modified: 2009-07-21 21:50 UTC (History)
19 users (show)

See Also:
Found By: Development
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Dirk Mueller 2007-09-07 16:07:58 UTC
grep from 10.2:

time distpin foobla >/dev/null

real    0m6.311s
user    0m5.964s
sys     0m0.448s


grep from 10.3:

time distpin foobla >/dev/null

real    1m5.049s
user    1m3.012s
sys     0m1.224s

export LC_ALL=C
time distpin foobla >/dev/null

real    0m6.024s
user    0m5.692s
sys     0m0.400s
Comment 1 Ludwig Nussel 2007-09-12 09:49:20 UTC
grep really is horribly slow
Comment 2 Stephan Kulow 2007-09-14 08:25:19 UTC
Ludwig, you should know:

Critical
  Crash, loss of data, corruption of data, severe memory leak
Comment 3 Dirk Mueller 2007-09-14 12:10:31 UTC
the 2.5.3 release does not help either, but using at least a released version would probably still be better. 
Comment 4 Bernd Strieder 2007-09-27 13:47:59 UTC
just found that I can confirm the regression.

openSuSE 10.3 RC1 without utf8

> time nm *.o | LANG=C time grep -v " W " > link.nm
0.12user 0.13system 0:02.16elapsed 12%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+189minor)pagefaults 0swaps

real    0m2.179s
user    0m1.840s
sys     0m0.564s

openSuSE 10.3 RC1 with utf8

> time nm *.o | time grep -v " W " > link.nm
335.10user 0.08system 5:36.78elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+252minor)pagefaults 0swaps

real    5m36.819s
user    5m36.817s
sys     0m0.528s

For comparision SuSE 9.0 with the same hardware:

> time nm *.o | LANG=C time grep -v " W " > link.nm
0.04user 0.12system 0:08.58elapsed 1%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (134major+22minor)pagefaults 0swaps

real    0m8.611s
user    0m8.240s
sys     0m0.530s

Looking at the inner time output we are talking here about hundredths of seconds vs.
hundreds of seconds, a factor of over 1000 at the given input size.

grep is used by so many scripts within about every distro, that this problem could
become a show-stopper.
Comment 5 Dirk Mueller 2007-11-12 14:14:23 UTC
according to micha:

http://cvs.fedora.redhat.com/viewcvs/devel/grep/grep-2.5.1-egf-speedup.patch?rev=1.16&view=log

could help here.
Comment 6 Michael Matz 2007-11-12 14:30:46 UTC
This patch doesn't apply as is on our grep, but debian also uses this patch,
maybe theirs works.  It's part of
  http://ftp.debian.org/debian/pool/main/g/grep/grep_2.5.3~dfsg-3.diff.gz 

The redhat dfa-optional patch might also be interesting, but needs
thorough testing.  It might or might not actually be more correct, see
https://bugzilla.redhat.com/show_bug.cgi?id=121313#10 .

Also see https://bugzilla.redhat.com/show_bug.cgi?id=69900 for the original
discussion leading to the egf-speedup patch.
Comment 7 Bjoern Jacke 2007-12-21 14:05:03 UTC
*** Bug 338995 has been marked as a duplicate of this bug. ***
Comment 8 Bjoern Jacke 2007-12-21 14:30:07 UTC
the more data the slower grep gets - exponentially:

yes | head -10000 | LC_ALL=de_DE.UTF-8 time grep . > /dev/null
... 10s
yes | head -30000 | LC_ALL=de_DE.UTF-8 time grep . > /dev/null
... 45s

a quite simple grep on a 70MB log file is a *real* problem in a UTF-8 locale.
Comment 9 Dirk Mueller 2008-01-03 14:10:39 UTC
ping..
Comment 10 Michal Marek 2008-01-15 09:10:05 UTC
*** Bug 353718 has been marked as a duplicate of this bug. ***
Comment 11 Forgotten User qMyteedNxa 2008-02-03 14:03:43 UTC
>a quite simple grep on a 70MB log file is a *real* problem in a UTF-8 locale.

yes. i have come across today (opensuse 11 alpha1) after i wondered why grepping trough larger amount of data did nothing but burning cpu cycles.

i needed a while to find out that this is a locale problem. 

this is not obvious to the end-user and people will have major hassle because of this.

major annoyance,imho.

please fix !
Comment 12 Stefan Nordhausen 2008-02-03 16:44:28 UTC
As a workaround, I installed grep from Suse 10.2 on my 10.3 system and it worked without problems, performance was back to normal.

Why not just _down_grade grep to the version in Suse 10.2? It is better to have an 'old' version that works than a new version that has been broken for 5 months now.
Comment 13 Forgotten User qMyteedNxa 2008-02-03 21:24:37 UTC
well,  _my_ workaround is unsetting the LC_CTYPE envvar - but i don`t like workarounds ;)

i personally don`t have a real problem if i know the workaround, but i`m sure other people will have a problem with this if they don`t know what`s the issue. and this is why they will tell: nahh. suse is crap. even such basic tools like grep have issues on that. i get back to ubuntu. or use cygwin on windoze. or whatever.  (ok, apparently, other distro`s may have that issue, too)

imho it`s a quality characteristic if things "just work" without bells and whistles, and that`s why i complain.
Comment 14 Dirk Mueller 2008-02-04 09:53:31 UTC
vote for it! ;)
Comment 15 Forgotten User qMyteedNxa 2008-02-04 23:45:40 UTC
apparently, it seems it`s not only grep which is suffering from this problem:

linux:/tmp # echo $LC_CTYPE
de_DE.UTF-8

linux:/tmp # time cat largefile |cut -d ":" -f 2- >/dev/null

real    0m43.832s
user    0m6.008s
sys     0m0.160s

linux:/tmp # unset LC_CTYPE

linux:/tmp # time cat largefile |cut -d ":" -f 2- >/dev/null

real    0m2.620s
user    0m0.208s
sys     0m0.132s
Comment 16 Lincoln Yeoh 2008-02-18 11:43:03 UTC
Maybe it doesn't cause a crash, but it's still very bad if you had being relying on grep not to suddenly be slower by an order of magnitude (20x), especially on a new machine that's 2x faster, but overall 10X slower with grep "Vista Edition" installed.

20 times slower is terrible.

e.g.

grep = grep-2.5.2-28
grep251 = grep-2.5.1a-40 from a 10.2 box, which isn't fast either - in fact it's quite slow too in some cases.

time grep251 -E "(duosdwrwr|asdsadd)" messages-20080218

real    0m2.145s
user    0m1.920s
sys     0m0.224s

time grep -E "(duosdwrwr|asdsadd)" messages-20080218

real    0m44.081s
user    0m43.819s
sys     0m0.256s

time grep251 -E "(duosdwrwr|asdsadd)" messages-20080218

real    0m2.132s
user    0m1.868s
sys     0m0.264s

time grep251 -E "asdadd|trerwer" messages-20080218

real    0m49.606s
user    0m48.527s
sys     0m0.356s

Note that for the following dual pattern query grep251 is also very slow.

time grep -E "asdadd|trerwer" messages-20080218

real    0m50.177s
user    0m49.939s
sys     0m0.232s

ime grep251 -E "asdadd" messages-20080218

real    0m0.759s
user    0m0.492s
sys     0m0.224s

But 251 is not too bad one pattern at a time, while grep-2.5.2-28 is bad.

time grep251 -E "trerwer" messages-20080218

real    0m0.686s
user    0m0.460s
sys     0m0.228s

time grep -E "trerwer" messages-20080218

real    0m43.013s
user    0m42.719s
sys     0m0.204s

time grep -E "asdadd" messages-20080218

real    0m42.810s
user    0m42.515s
sys     0m0.272s

export LC_ALL="C"
time grep251 -E "(duosdwrwr|asdsadd)" messages-20080218

real    0m1.535s
user    0m1.280s
sys     0m0.256s

time grep -E "(duosdwrwr|asdsadd)" messages-20080218

real    0m1.568s
user    0m1.312s
sys     0m0.248s

time grep -E "asdadd|trerwer" messages-20080218

real    0m2.152s
user    0m1.968s
sys     0m0.180s

time grep251 -E "asdadd|trerwer" messages-20080218

real    0m2.154s
user    0m1.840s
sys     0m0.312s

Must we start turning locale off by default? 

Maybe getting locale right is not easy, but 40 seconds for an foo|bar search? why not just 2 x longer than a single pattern search?

Comment 17 Dirk Mueller 2008-02-18 14:05:06 UTC
the reason for that is that grep added mbrtowc in the most hottest code path, and that is probably one of the slowest glibc functions to call. 

I're rewritten that code, and it is now ~ factor 10 faster (which makes it still a   factor 2 regression). the debian patches remove that code path alltogether, but they disable the CW algorithm, which give an even bigger speed regression. so while I like getting rid of the code path, disabling the CW algorithm is a no-go. 

the correct solution would be to adapt the cw matcher to use wide chars, which would be roughly en par with the old speed and allow it to be used. 

the new suse patches that were added are just entirely broken ;(
Comment 18 Forgotten User qMyteedNxa 2008-03-29 18:46:49 UTC
so no fix for this in factory yet?

i updated to latest today, but....

linux:/tmp # time cat rpm-no-files.txt |grep -v "^/proc" |grep -v "^/sys" |grep -v "^/dev" >rpm-no-files.txt2

real    2m9.914s
user    1m46.443s
sys     0m1.348s

linux-trkh:~ # echo $LC_CTYPE
de_DE.UTF-8

linux:/tmp # unset LC_CTYPE

linux:/tmp # time cat rpm-no-files.txt |grep -v "^/proc" |grep -v "^/sys" |grep -v "^/dev" >rpm-no-files.txt2

real    0m1.938s
user    0m0.860s
sys     0m1.052s

and this is just a 20mb file....
Comment 19 Stephan Kulow 2008-05-20 13:29:49 UTC
*** Bug 381873 has been marked as a duplicate of this bug. ***
Comment 22 Dirk Mueller 2008-05-28 11:47:01 UTC
the new submission from Andreas Schwab labeled "Some speadups" does improve the situation:

11.0 package:

real    0m17.966s
user    0m16.301s
sys     0m0.308s

new submission:

real    0m6.727s
user    0m5.276s
sys     0m0.304s

10.2 package:

real    0m3.114s
user    0m2.708s
sys     0m0.224s


so there is room for improvement, but a factor of 3 is already quite good progress.

Comment 23 Dirk Mueller 2008-05-28 11:47:52 UTC
the numbers above do not match those in comment #1 anymore as I meanwhile have a different machine with a different architecture.
Comment 25 Harald Koenig 2008-05-28 16:51:54 UTC
(In reply to comment #22 from Dirk Mueller)
> the new submission from Andreas Schwab labeled "Some speadups" does improve the
> situation:

is this new submission available for testing ?
Comment 26 Jan Engelhardt 2008-05-29 16:16:57 UTC
Forwarded to upstream some time ago.

http://www.nabble.com/horrible-utf-8-performace-in-wc-td17094488.html
Comment 27 Andreas Schwab 2008-06-10 13:53:45 UTC
*** Bug 394665 has been marked as a duplicate of this bug. ***
Comment 28 Andreas Schwab 2008-11-05 10:29:02 UTC
Fixed.
Comment 29 Michael Matz 2008-11-05 13:37:20 UTC
For some expressions this is indeed much better.  But other, very simple
regexps (that match many lines) it's still very slow:

% time LANG=C grep . /var/log/messages > /dev/null
real    0m0.010s
user    0m0.004s
sys     0m0.004s

% time LANG=de_DE.UTF-8 grep . /var/log/messages > /dev/null
real    0m2.099s
user    0m2.072s
sys     0m0.028s

Interestingly en_EN.UTF-8 is faster again:

% time LANG=en_EN.UTF-8 grep . /var/log/messages > /dev/null
real    0m0.010s
user    0m0.008s
sys     0m0.000s

So, something is still fishy.  Should I create a new PR or just reopen this
one?
Comment 30 Jan Engelhardt 2008-11-05 14:57:39 UTC
That is because en_EN is not a known locale in the first place. Pick one that is, perhaps, valid, before you test?
Comment 31 Mike Fabian 2008-11-05 15:14:43 UTC
In case of unknown locales, C/POSIX is used as a fallback
and that seems to be the case in Michael’s benchmark.
Comment 32 Michael Matz 2008-11-05 15:20:43 UTC
Bah, indeed :)  Okay, that makes my point just more valid, UTF-8 locales
are still much slower.
Comment 33 Bernd Strieder 2008-11-06 18:32:48 UTC
Whoever needs grep searches with utf-8 support should possibly look at pcregrep. Maybe there is enough pain to replace grep with pcregrep, given it is compatible enough. I haven't checked yet.

Anything else will involve reimplementing grep from scratch to get proper UTF-8 support. Most occurrences of char will have to be replaced by wchar_t and conversion has to be done on input of both the pattern and the search data, involving regular epression matching of its own.
Comment 34 Jan Engelhardt 2008-11-06 23:19:06 UTC
>Whoever needs grep searches with utf-8 support should possibly look at
pcregrep.

What would it offer over grep -P?
Comment 35 Jan Engelhardt 2008-11-06 23:22:46 UTC
We got a winner...

$ yes | head -n10000 | LC_ALL=de_DE.UTF-8 time grep . >/dev/null
11.94user 0.00system 0:12.20elapsed 97%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+272minor)pagefaults 0swaps

$ yes | head -n10000 | LC_ALL=de_DE.UTF-8 time grep -P . >/dev/null
0.00user 0.00system 0:00.01elapsed 57%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+246minor)pagefaults 0swaps
Comment 36 Michael Matz 2008-11-07 12:29:27 UTC
No, pcregrep (or grep -P) is no option, as the regular expression syntax is
that of perl, not of grep.  Yes, pcre is faster with UTF-8, it always was,
but that's irrelevant.

Re comment #33: we have this conversation because grep of course _does_ support
UTF-8 inputs just fine (as some other multi-byte locales).  It is only slow.
And yes, grep does implement some specialized versions of the matcher for
these locales.  No need for any reimplementation from scratch.