Bug 911347

Summary: Journal default settings use quite lot of disk store
Product: [openSUSE] openSUSE Tumbleweed Reporter: Tomáš Chvátal <tchvatal>
Component: BasesystemAssignee: systemd maintainers <systemd-maintainers>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Enhancement    
Priority: P5 - None CC: chcao, fcrozat, gnyers, mail, raul.malea, stefan.bruens, tchvatal, thomas.blume
Version: 201412*   
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: Records of time taken to flush the runtime journal to disk on 120GB unit
Effect of journal fragmentation on flushing time
Flushing times with SystemMaxFileSize=34M
Flushing times with Systemd patched to btrfs
Flushing times with /var/log mounted with autodefrag option

Description Tomáš Chvátal 2014-12-29 18:28:54 UTC
As we by default create 40gb partition for / we simply say journal can log 4gb of data in its storage before turncating.

We should set default to something like 100megs in /etc/journalctl.conf:

SystemMaxUse=100M

And also to match up what rsyslog does we should probably set the 12th terminal to print the syslog

ForwardToConsole=yes
TTYPath=/dev/tty12
MaxLevelConsole=info
Comment 1 Tomáš Chvátal 2014-12-29 18:29:49 UTC
Of course the config is /etc/systemd/journald.conf just my brain got ahead of me...
Comment 2 Thomas Blume 2015-01-07 11:33:16 UTC
Werner, I've successfully tested the following settings in /etc/systemd/journald.conf:

SystemMaxUse=100M
RuntimeMaxUse=100M
ForwardToConsole=yes
TTYPath=/dev/tty12

Any objections to add them als default (MaxLevelConsole=info is already the default)?
Comment 3 Dr. Werner Fink 2015-01-07 11:40:37 UTC
(In reply to Thomas Blume from comment #2)

AFAICR the default log console was /dev/tty10
also I'd like to know if we should use relative values which fits to 100Mb with a 40Gb root partition ... better we should check the size of /var which could be a seperate parition
Comment 4 Bruno Pesavento 2015-01-08 15:10:05 UTC
FYI, my test laptop has been running with "SystemMaxUse=160M" since OS 13.2RC1, yielding some 10 days of disk logs with average "office" use on Gnome X86_64.
Gnome-session is known to be "chatty" to the journal, so a much longer timespan should be expected with KDE.
Based on this experience, anything between 100M and 200M should do a fair service to the average desktop user IMHO.
Wasted disk space is a nuisance, but the really annoying problem is upsetting the boot process when using the 120-160 GB disks popular with laptops 5-6 years ago.
Flushing the journal clogs those disks just after mounting /home, meanwhile blocking all other processes since flushing appears to be in the critical chain (I don't understand why...).
With "SystemMaxUse=160M", a 120GB, 40MB/s disk takes about 1s on average to flush, with occasional peaks up to 20s whick seems still acceptable.
With default settings, after a few weeks the boot process becomes so slow that often a further 3 minutes delay for "udev wait for complete device initialization" kicks in.
All that is hardly noticeable with an SSD on a similar notebook even with current defaults.
If you prefer relative limits, maybe defaulting to 0,5 or 1% of available /var is more sensible for average setups than the current 10%.
Comment 5 Thomas Blume 2015-01-09 16:28:07 UTC
Thanks for the feedback.
Using relative values for disk space is indeed more sensitive.

The code checks for the filesystem size of the mount where the journal is stored.
So, this covers a separate /var partition.
Still, it is very hard to estimate an average size for /var.
If it is 1G we would have only 10Mb for the journal if we set SystemMaxUse to 1%.
I guess we'd need an absolute lower limit for small systems and an relative upper limit for big systems.
Comment 6 Bruno Pesavento 2015-01-09 20:27:32 UTC
(In reply to Thomas Blume from comment #5)
> I guess we'd need an absolute lower limit for small systems and an relative
> upper limit for big systems.

Maybe there is a simpler solution. What really hurts slow disks is max file size.
So something like "SystemMaxUse=400M" (or 1% of 40GB) and "SystemMaxFileSize=26M" (or three 8.4 MB log chunks as seen on my 64bit systems) is likely to make everybody happy. Anyway that sounds more reasonable than the current 4GB and 480MB, respectively, with the suggested 40 GB root/.
If anybody really needs more, or less, he/she is not an "average" user IMHO and should know how to tweak system defaults.
I'll try that on my slow test disk and report back anything of interest.
Comment 7 Thomas Blume 2015-01-12 16:08:43 UTC
(In reply to Bruno Pesavento from comment #6)
> (In reply to Thomas Blume from comment #5)
> > I guess we'd need an absolute lower limit for small systems and an relative
> > upper limit for big systems.
> 
> Maybe there is a simpler solution. What really hurts slow disks is max file
> size.
> So something like "SystemMaxUse=400M" (or 1% of 40GB) and
> "SystemMaxFileSize=26M" (or three 8.4 MB log chunks as seen on my 64bit
> systems) is likely to make everybody happy. Anyway that sounds more
> reasonable than the current 4GB and 480MB, respectively, with the suggested
> 40 GB root/.
> If anybody really needs more, or less, he/she is not an "average" user IMHO
> and should know how to tweak system defaults.
> I'll try that on my slow test disk and report back anything of interest.

Thanks, would be good if you could test this.
Btw. the code includes some general limits on the journal size:

-->--
/* This is the minimum journal file size */
#define JOURNAL_FILE_SIZE_MIN (4ULL*1024ULL*1024ULL)           /* 4 MiB */

/* These are the lower and upper bounds if we deduce the max_use value
 * from the file system size */
#define DEFAULT_MAX_USE_LOWER (1ULL*1024ULL*1024ULL)           /* 1 MiB */
#define DEFAULT_MAX_USE_UPPER (4ULL*1024ULL*1024ULL*1024ULL)   /* 4 GiB */

/* This is the upper bound if we deduce max_size from max_use */
#define DEFAULT_MAX_SIZE_UPPER (128ULL*1024ULL*1024ULL)        /* 128 MiB */
--<--

but it seems they are too generous for old machines.
Comment 8 Bruno Pesavento 2015-01-13 00:14:46 UTC
Created attachment 619347 [details]
Records of time taken to flush the runtime journal to disk on 120GB unit

With "SystemMaxUse=400M" my test disk takes from 53ms to 35s to flush the runtime journal to disk, with a system.journal file on disk growing from 0 to 41.9 MB in 8.4MB chunks and then back to a new file since 
(41.9MB+8.4MB)>(400MB/8).
See the attached file for more details.
During this test, getting to the Gnome desktop took from 90 to 120 s, more than twice what OS 11.x needed on the same box.
I'll keep testing with larger files, but I think this to be the practical limit on this box. FYI, similar laptops were sold around 2008 with SLED preinstalled by HP.
Comment 9 Thomas Blume 2015-01-13 11:47:39 UTC
Thanks for the feedback.
I'm still searching for alternatives to generally reducing the journal size, because we might need a full journal log for debugging.

Seems that a similar issue was reported here:

https://bugzilla.redhat.com/show_bug.cgi?id=1006386

can you please take a look at comment#95 and following in this bug and check wheter journal fragmentation has an influence on your machine?
Comment 10 Bruno Pesavento 2015-01-13 18:53:37 UTC
Created attachment 619455 [details]
Effect of journal fragmentation on flushing time
Comment 11 Bruno Pesavento 2015-01-13 18:56:45 UTC
(In reply to Thomas Blume from comment #9)
> Thanks for the feedback.
> I'm still searching for alternatives to generally reducing the journal size,
> because we might need a full journal log for debugging.

I still think that limiting SystemMaxFileSize= well under DEFAULT_MAX_SIZE_UPPER (I would set it to 32 MiB on my test system) prevents most collateral damage on slow systems and has no adverse effects, to my understanding.
Then the defaults will take care of small systems, trying to leave some free space on /var anyway.

> Seems that a similar issue was reported here:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1006386
> 
> can you please take a look at comment#95 and following in this bug and check
> wheter journal fragmentation has an influence on your machine?

Fragmentation is definitely an issue on my test system! (See attached file).
Thanks for pointing at the upstream bug: a little too technical to my eyes, but what I understand seems exactly what I'm witnessing here.
It seems to me that the btrfs default in OS 13.2 and my tiny (11GB) test root/ compound to an original design problem.
Fragmentation is almost nonexistent in my "productive" laptop so far (Tumbleweed, SSD, EXT4, 30GB root/).
So, waiting for the designers to solve the root problem, I'm not complaining: the test disk is going to be wiped by the next RC anyway.
But even halving the times I'm witnessing, I think that many laptops more than 5 years old are going to hit troubles with current defaults.
Feel free to ask for more testing if needed.
Comment 12 Thomas Blume 2015-01-14 09:39:03 UTC
(In reply to Bruno Pesavento from comment #11)
>
> I still think that limiting SystemMaxFileSize= well under
> DEFAULT_MAX_SIZE_UPPER (I would set it to 32 MiB on my test system) prevents
> most collateral damage on slow systems and has no adverse effects, to my
> understanding.
> Then the defaults will take care of small systems, trying to leave some free
> space on /var anyway.

Hm,  https://bugzilla.redhat.com/show_bug.cgi?id=1006386 comment#84 indicates that a small MaxFileSize will hurt the journal efficiency.
However, it doesn't seem that we will loose logs therewith.
And 32Mb is 3 times more than suggested in the RH bug.
Maybe we won't have too bad negative effects therewith.
Does SystemMaxFileSize=32M also help if you have SystemMaxUse unset?

> Fragmentation is almost nonexistent in my "productive" laptop so far
> (Tumbleweed, SSD, EXT4, 30GB root/).
> So, waiting for the designers to solve the root problem, I'm not
> complaining: the test disk is going to be wiped by the next RC anyway.
> But even halving the times I'm witnessing, I think that many laptops more
> than 5 years old are going to hit troubles with current defaults.

Sure, we should continue pursue this, independently of the  SystemMaxFileSize settings.
Can you confirm that you see the fragmentation and the long journal flushes only on btrfs?
If so, I should probably open a separate btrfs bug.

> Feel free to ask for more testing if needed.

Thanks, would be good if you could also test the other fix, e.g. edit /etc/sysctl.d/99-sysctl.conf and put:

net.unix.max_dgram_qlen=1000

there.
Does this has any further influence on the journal flush time?

Tomas, could you also test all the above?
I really would need testing on more hardware here.
Comment 13 Bruno Pesavento 2015-01-14 11:05:25 UTC
(In reply to Thomas Blume from comment #12)
> Can you confirm that you see the fragmentation and the long journal flushes
> only on btrfs?
> If so, I should probably open a separate btrfs bug.

Confirming that I see almost no fragmentation on EXT4 (extents from 2 to 6 with files ranging 8.4 - 50.3 MB, no custom journald.conf).
Having still 26GB free on root/ and the SSD firmware possibly hiding some frags might have a role here, though.
 
> > Feel free to ask for more testing if needed.
> 
> Thanks, would be good if you could also test the other fix, e.g. edit
> /etc/sysctl.d/99-sysctl.conf and put:
> 
> net.unix.max_dgram_qlen=1000
> 
> there.
> Does this has any further influence on the journal flush time?

Doing the other tests tonight, stay tuned.
Comment 14 Frederic Crozat 2015-01-14 11:54:32 UTC
please note there has been a lot of changes btrfs specific in systemd git recently to try to lower fragmentation on journal (or at least, lower its impact).
Comment 15 Bruno Pesavento 2015-01-14 20:23:16 UTC
Created attachment 619596 [details]
Flushing times with SystemMaxFileSize=34M

Here are the results with SystemMaxFileSize=34M and no explicit limit to SystemMaxUse.
Flush time grows to 25s, then cycles back to 57ms. No side effect visible to the user.
Adding net.unix.max_dgram_qlen=1000 has no effect, apparently.
Comment 16 Thomas Blume 2015-01-15 10:15:51 UTC
(In reply to Bruno Pesavento from comment #15)
> Created attachment 619596 [details]
> Flushing times with SystemMaxFileSize=34M
> 
> Here are the results with SystemMaxFileSize=34M and no explicit limit to
> SystemMaxUse.
> Flush time grows to 25s, then cycles back to 57ms. No side effect visible to
> the user.
> Adding net.unix.max_dgram_qlen=1000 has no effect, apparently.

Hm, still 25s, not really good.
I'm currently looking at the upstream commits to improve journal flush performance on btrfs (see comment#14).
Let's check wheter they can give some better results.
I will let you know when testbuilds are available.
Comment 17 Bruno Pesavento 2015-01-16 16:48:07 UTC
(In reply to Thomas Blume from comment #16)
> Hm, still 25s, not really good.
> I'm currently looking at the upstream commits to improve journal flush
> performance on btrfs (see comment#14).
Agree, this is just a workaround to keep old systems rolling, waiting for the designers to solve the root cause. Journal logs are highly sparse, a 33.6MB file compresses to just 4MB in tar.gz, as a rough measure of info content...
Comment 18 Thomas Blume 2015-01-19 15:42:58 UTC
(In reply to Bruno Pesavento from comment #17)
> (In reply to Thomas Blume from comment #16)
> Agree, this is just a workaround to keep old systems rolling, waiting for
> the designers to solve the root cause. Journal logs are highly sparse, a
> 33.6MB file compresses to just 4MB in tar.gz, as a rough measure of info
> content...

I would be completely fine with the workaround, if we only could limit it to systems that really need it. But I don't know how we should distinguish old from recent machines.
Setting SystemMaxFileSize unconditionally gives me some headaches.
Let's keep it as a last resort for now and first try some other things.

A systemd build including the btrfs patches is now available at:

https://build.opensuse.org/package/binaries/home:tsaupe:branches:Base:System:Legacy/systemd?repository=openSUSE_Factory

can you give it a try?
Comment 19 Bruno Pesavento 2015-01-23 17:34:10 UTC
Created attachment 620725 [details]
Flushing times with Systemd patched to btrfs

No visible effect from btrfs patches in Systemd. Flushing time still grows to 31s with a 35.7MB file, then back to ms...
Tested with systemd-210-10.1.x86_64.rpm on OS 13.2 updated as of 2014-01-23.
Comment 20 Thomas Blume 2015-01-28 15:59:21 UTC
(In reply to Bruno Pesavento from comment #17)
> (In reply to Thomas Blume from comment #16)
> > Hm, still 25s, not really good.
> > I'm currently looking at the upstream commits to improve journal flush
> > performance on btrfs (see comment#14).
> Agree, this is just a workaround to keep old systems rolling, waiting for
> the designers to solve the root cause. Journal logs are highly sparse, a
> 33.6MB file compresses to just 4MB in tar.gz, as a rough measure of info
> content...

Ok, still investigating more options.
A first and quick one is using the autodefrag mount option for /var/log.
For details, see: http://www.spinics.net/lists/linux-btrfs/msg41015.html.
Can you please check wheter mount /var/log with autodefrag has an impact?

Secondly, I've found another commit that could have an influence:
http://cgit.freedesktop.org/systemd/systemd/commit/src/journal/journal-file.c?id=eda4b58b50509dc8ad0428a46e20f6c5cf516d58

I will provide a new testpackages, but first please try the mount option.
Comment 21 Bruno Pesavento 2015-01-29 15:05:07 UTC
Created attachment 621355 [details]
Flushing times with /var/log mounted with autodefrag option

Mounting /var/log autodefrag is definitely a better workaround: flushing time is still less than 4 s with a 41.9 MB system journal.
Now removed the 34MB limit, to be able to test upcoming patches with larger files.
Comment 22 Thomas Blume 2015-01-29 15:39:43 UTC
(In reply to Bruno Pesavento from comment #21)
> Created attachment 621355 [details]
> Flushing times with /var/log mounted with autodefrag option
> 
> Mounting /var/log autodefrag is definitely a better workaround: flushing
> time is still less than 4 s with a 41.9 MB system journal.
> Now removed the 34MB limit, to be able to test upcoming patches with larger
> files.

Ok, thanks for the feedback.
We might still set an upper limit for the journal files, because the manpage states:

-->--
autodefrag
           
Disable/enable auto defragmentation. Auto defragmentation detects small random writes into files and queue them up for the defrag process. Works best for small files; Not well suited for large database workloads.
--<--

But this rather refers to files of gigabyte size.
Can you test with SystemMaxFileSize up to 1Gb and check wheter you see any negative impact?
Comment 23 Bruno Pesavento 2015-01-29 16:11:20 UTC
(In reply to Thomas Blume from comment #22)
> (In reply to Bruno Pesavento from comment #21)
> > Created attachment 621355 [details]
> > Flushing times with /var/log mounted with autodefrag option
> > 
> > Mounting /var/log autodefrag is definitely a better workaround: flushing
> > time is still less than 4 s with a 41.9 MB system journal.
> > Now removed the 34MB limit, to be able to test upcoming patches with larger
> > files.
> 
> Ok, thanks for the feedback.
> We might still set an upper limit for the journal files, because the manpage
> states:
> 
> -->--
> autodefrag
>            
> Disable/enable auto defragmentation. Auto defragmentation detects small
> random writes into files and queue them up for the defrag process. Works
> best for small files; Not well suited for large database workloads.
> --<--
> 
> But this rather refers to files of gigabyte size.
> Can you test with SystemMaxFileSize up to 1Gb and check wheter you see any
> negative impact?

I'll let the file grow unconstrained (unless I hit a wall...), but think the default max is 128MiB (see comment #7).
Comment 24 Thomas Blume 2015-01-30 07:49:08 UTC
(In reply to Bruno Pesavento from comment #23)
> 
> I'll let the file grow unconstrained (unless I hit a wall...), but think the
> default max is 128MiB (see comment #7).

Indeed, please forget my comment and test without a limit.
Comment 25 Swamp Workflow Management 2015-03-31 14:09:05 UTC
SUSE-RU-2015:0638-1: An update that has 21 recommended fixes can now be installed.

Category: recommended (moderate)
Bug References: 897799,897803,898233,901481,902240,902901,903009,903963,904517,904828,905550,906709,907318,907393,908476,910315,910643,911347,912030,916420,918118
CVE References: 
Sources used:
SUSE Linux Enterprise Software Development Kit 12 (src):    aaa_base-13.2+git20140911.61c1681-3.1, systemd-210-55.2
SUSE Linux Enterprise Server 12 (src):    aaa_base-13.2+git20140911.61c1681-3.1, systemd-210-55.2
SUSE Linux Enterprise Desktop 12 (src):    aaa_base-13.2+git20140911.61c1681-3.1, systemd-210-55.2
Comment 26 Tomáš Chvátal 2015-04-01 08:52:14 UTC
For now at least on Factory it behaves quite decently. My journal folder is ~100 megs and it is quite fast to act when running journalctl commands.
Comment 27 Thomas Blume 2015-10-19 10:40:55 UTC
*** Bug 838475 has been marked as a duplicate of this bug. ***
Comment 28 Swamp Workflow Management 2015-11-12 13:16:12 UTC
openSUSE-RU-2015:1969-1: An update that has 21 recommended fixes can now be installed.

Category: recommended (moderate)
Bug References: 897799,897803,898233,901481,902240,902901,903009,903963,904517,904828,905550,906709,907318,907393,908476,910315,910643,911347,912030,916420,918118
CVE References: 
Sources used:
openSUSE Leap 42.1 (src):    aaa_base-13.2+git20140911.61c1681-7.1
Comment 29 Swamp Workflow Management 2016-02-03 14:45:16 UTC
openSUSE-RU-2016:0320-1: An update that has 146 recommended fixes can now be installed.

Category: recommended (moderate)
Bug References: 737690,742774,750845,818044,838475,841544,849870,852015,852021,852232,853293,854884,856389,856392,856858,857204,858864,859072,859365,860574,860937,861316,861489,863217,864745,864904,865834,866732,866933,867128,867663,867664,867840,868019,868230,868439,868931,869142,869603,872929,873432,873444,874665,875502,876587,876694,877021,877674,878525,880438,880732,881125,881559,881942,882393,882714,883565,884271,884403,885232,885288,886211,886599,886852,888178,888215,888612,889297,889357,890977,892096,892162,892300,893797,895087,896664,897799,897801,897803,898233,898240,898432,900558,901481,902240,902901,903009,903963,904214,904517,904828,905550,906709,906900,907318,907393,908476,909358,910643,911347,912030,912334,913517,916420,918118,919095,920195,921831,921898,921920,926169,927250,927457,928265,931388,932284,933365,933512,933521,933533,934077,934901,937512,937900,938908,939571,940264,941576,944132,944799,945282,947212,948458,948555,948705,949574,949683,949739,950510,951265,951663,953241,954336,954781,955635,961576
CVE References: 
Sources used:
openSUSE 13.1 (src):    systemd-210-40.1, systemd-mini-210-40.1
Comment 30 Thomas Blume 2017-09-27 12:34:52 UTC
closing