Bug 388088

Summary: squid sometimes fails to rebind to its ports on log rotation
Product: [openSUSE] openSUSE 10.2 Reporter: Lincoln Yeoh <lyeoh>
Component: OtherAssignee: Klaus Singvogel <kssingvo>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P5 - None    
Version: Final   
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: squid.conf
cache.log when stuff went wrong
cache.log when stuff started ok
much earlier cache.log when the rotate, reconfigure went ok

Description Lincoln Yeoh 2008-05-08 07:30:41 UTC
Squid _sometimes_ fails to rebind to its ports on log rotation.

OS: opensuse 10.2
Squid version: opensuse squid-2.6.STABLE5-7
Note: we have 60 redirectors attached to squid.

/var/log/squid/cache.log
2008/05/06 03:32:48| commBind: Cannot bind socket FD 72 to 202.75.242.254:3128: (98) Address already in use
2008/05/06 03:32:48| commBind: Cannot bind socket FD 72 to 202.75.242.254:8080: (98) Address already in use

I suspect this happens when squid is somehow still busy with something (redirectors or some pending connection?), so the sockets are not unbound by squid before squid tries to reattach. 

This might be because opensuse's /etc/init.d/squid reload does squid -k rotate, sleep 2, squid -k reconfigure. And /etc/logrotate.d/squid has /etc/init.d/squid reload as postrotate for the various squid log files. 

What possibly happens is 2 seconds after squid -k rotate, the squid is still busy restarting the 60 redirectors, then squid -k reload is called, and it is not possible to get rid of the previous squid socket binds in time.

We will be trying a workaround of just doing squid -k rotate for log rotation. 

Log follows (note the HB: DEAD: <PID> messages are from the redirector processes when they die. squid is configured to bind to 127.0.0.1:3128 127.0.0.1:8080 202.75.242.254:3128 202.75.242.254:8080 ). 

2008/05/06 03:32:38| storeDirWriteCleanLogs: Starting...
HB: DEAD: 24962
HB: DEAD: 24964
HB: DEAD: 24960
HB: DEAD: 24976
HB: DEAD: 24977
HB: DEAD: 24981
HB: DEAD: 24959
HB: DEAD: 24967
HB: DEAD: 24970
HB: DEAD: 24975
HB: DEAD: 24982
HB: DEAD: 24941
HB: DEAD: 24916
HB: DEAD: 24979
HB: DEAD: 24917
HB: DEAD: 24958
HB: DEAD: 24921
2008/05/06 03:32:38|     65536 entries written so far.
HB: DEAD: 24926
HB: DEAD: 24929
HB: DEAD: 24966
HB: DEAD: 24908
HB: DEAD: 24947
HB: DEAD: 24915
HB: DEAD: 24920
HB: DEAD: 24923
HB: DEAD: 24925
HB: DEAD: 24928
HB: DEAD: 24930
HB: DEAD: 24961
HB: DEAD: 24969
HB: DEAD: 24938
HB: DEAD: 24957
HB: DEAD: 24942
HB: DEAD: 24974
HB: DEAD: 24948
2008/05/06 03:32:38|    131072 entries written so far.
2008/05/06 03:32:38|   Finished.  Wrote 138190 entries.
2008/05/06 03:32:38|   Took 0.2 seconds (599582.6 entries/sec).
2008/05/06 03:32:38| aioSync: flushing pending I/O operations
2008/05/06 03:32:38| aioSync: done
2008/05/06 03:32:38| logfileRotate: /var/log/squid/access.log
2008/05/06 03:32:38| Pinger socket opened on FD 9
2008/05/06 03:32:38| helperOpenServers: Starting 60 'redirector.pl' processes
2008/05/06 03:32:44| Reconfiguring Squid Cache (version 2.6.STABLE5)...
2008/05/06 03:32:44| FD 72 Closing HTTP connection
2008/05/06 03:32:44| FD 73 Closing HTTP connection
2008/05/06 03:32:44| FD 74 Closing HTTP connection
2008/05/06 03:32:44| FD 75 Closing HTTP connection
2008/05/06 03:32:44| Closing Pinger socket on FD 9
2008/05/06 03:32:44| Cache dir '/var/cache/squid' size remains unchanged at 4096000 KB
2008/05/06 03:32:44| Logformat for 'itt' is ' %ts.%03tu %6tr %>a %Ss/%03Hs %<st %rm %ru %Sh/%<A %mt "%{Referer}>h" "%{User-Age
nt}>h"'
2008/05/06 03:32:44| accessLogParseLogFormat: got definition ' %ts.%03tu %6tr %>a %Ss/%03Hs %<st %rm %ru %Sh/%<A %mt "%{Refere
r}>h" "%{User-Agent}>h"'
2008/05/06 03:32:44| Initialising SSL.
2008/05/06 03:32:44| Store logging disabled
2008/05/06 03:32:44| User-Agent logging is disabled.
2008/05/06 03:32:44| Referer logging is disabled.
2008/05/06 03:32:44| DNS Socket created at 0.0.0.0, port 42512, FD 7
2008/05/06 03:32:44| Adding nameserver 127.0.0.1 from squid.conf
2008/05/06 03:32:44| helperOpenServers: Starting 60 'redirector.pl' processes
HB: DEAD: 28224
HB: DEAD: 28225
HB: DEAD: 28227
HB: DEAD: 28229
HB: DEAD: 28230
HB: DEAD: 28231
HB: DEAD: 28232
HB: DEAD: 28233
HB: DEAD: 28236
HB: DEAD: 28238
HB: DEAD: 28240
HB: DEAD: 28241
HB: DEAD: 28242
HB: DEAD: 28243
HB: DEAD: 28248
HB: DEAD: 28244
HB: DEAD: 28260
2008/05/06 03:32:48| commBind: Cannot bind socket FD 72 to 202.75.242.254:3128: (98) Address already in use
HB: DEAD: 28262
2008/05/06 03:32:48| commBind: Cannot bind socket FD 72 to 202.75.242.254:8080: (98) Address already in use
HB: DEAD: 28263
HB: DEAD: 28291
HB: DEAD: 28269
HB: DEAD: 28270
2008/05/06 03:32:48| Accepting proxy HTTP connections at 127.0.0.1, port 3128, FD 72.
2008/05/06 03:32:48| Accepting proxy HTTP connections at 127.0.0.1, port 8080, FD 73.
Comment 1 Lincoln Yeoh 2008-05-08 07:34:18 UTC
This can cause significant confusion and problems since though squid still ends up running it is not listening on all the required ports. 

So it's up, but not quite up.
Comment 2 Lincoln Yeoh 2008-05-08 14:29:12 UTC
I also notice that:
/etc/logrotate.d/squid 
does not have the sharedscripts directive for all three logs.
Comment 3 Lincoln Yeoh 2008-05-09 03:31:35 UTC
By the way, won't doing "/etc/init.d/squid reload" for a log rotate break existing downloads?

Seems a really bad idea to restart services for log rotation, rather than just doing log rotation. 

Doing a reload during rotation also means that if someone messed up the squid.conf file by mistake, you end up having the users noticing during log rotation time, which is usually when you don't want to be called.
Comment 4 Klaus Singvogel 2008-05-09 10:05:06 UTC
If there is no reload signal, then squid continues writing into rotate logfiles, as the file handle is not changed.

Will have a look...
Comment 5 Klaus Singvogel 2008-05-09 10:31:05 UTC
Please attache (via attachement mechanism - and do NOT paste)
 /etc/squid/squid.conf
 /var/log/squid/cache.log

Your provided information and assumptions aren't enough for me for debugging.
Comment 6 Lincoln Yeoh 2008-05-10 15:51:55 UTC
We have changed /etc/logrotate.d/squid
To:
/var/log/squid/cache.log {
    compress
    dateext
    maxage 365
    rotate 99
    size=+1024k
    notifempty
    missingok
    create 640 squid root
    sharedscripts
    postrotate
     /usr/sbin/squid -k rotate
    endscript
}

/var/log/squid/access.log {
    compress
    dateext
    maxage 365
    rotate 99
    size=+4096k
    notifempty
    missingok
    create 640 squid root
    sharedscripts
    postrotate
     /usr/sbin/squid -k rotate
    endscript
}

/var/log/squid/store.log {
    compress
    dateext
    maxage 365
    rotate 99
    size=+4096k
    notifempty
    missingok
    create 640 squid root
    sharedscripts
    postrotate
     /usr/sbin/squid -k rotate
    endscript
}
--
This appears to rotate logs successfully for us while avoiding the squid rebind, and thus hopefully the problem. 

Not sure if this problem will occur for a /etc/init.d/squid restart 

Comment 7 Lincoln Yeoh 2008-05-10 15:53:22 UTC
Created attachment 214198 [details]
squid.conf
Comment 8 Lincoln Yeoh 2008-05-10 16:09:57 UTC
Created attachment 214199 [details]
cache.log when stuff went wrong

2008/05/04 03:50:19| commBind: Cannot Bind socket FD 72 to 202.75.242.254:3128: (98) Address already in use
2008/05/04 03:50:19| commBind: Cannot bind socket FD 72 to 202.75.242.254:8080: (98) Address already in use
Comment 9 Lincoln Yeoh 2008-05-10 16:19:01 UTC
Created attachment 214200 [details]
cache.log when stuff started ok

this is when stuff was ok.

Not sure if it's relevant but for some reason the Pinger socket was closed in both OK logs and not closed in the bad log.
Comment 10 Lincoln Yeoh 2008-05-10 16:21:02 UTC
Created attachment 214201 [details]
much earlier cache.log when the rotate, reconfigure went ok
Comment 11 Lincoln Yeoh 2008-05-10 16:24:20 UTC
I now think the Pinger bit is not relevant. This is because when squid didn't rebind properly, it stopped receiving user requests, and so that's why the pinger never got around to closing.
Comment 12 Klaus Singvogel 2008-05-14 10:42:31 UTC
Sorry, but at the moment I'm unable to reproduce it.

Currently I've got the feeling that the old squid isn't dying fast enough and therefore hanging a bit to long in the system, locking the port.

So, you've to do some more debugging for me.

Next time when this happens do (as root) in a command line terminal the following command. This might be time critical, so please don't wait.

- Identify the processes which are listening on the 3128 port:
  lsof -i | egrep '(ndl-aas|3128).*LISTEN'

==> This will report you the process name (1st column) and the process id (2nd column).

==> If there is only one process reported, you're too late. Try again.

- Check whether both processes have the same process ids (pid) or at least parent pids. ==> read the output.

- If not, identify the process ids and check when both processes were
started:
  ps  -Flp <<pid1>> <<pid2>>

Tell me the result.

==> still NEEDINFO
Comment 13 Lincoln Yeoh 2008-05-19 16:47:19 UTC
It's probably as what you say, perhaps it is a race condition and only happens under load and is related to the size of the logs being rotated (since suse 10.2 runs /etc/init.d/squid reload for each log file - 3 times).

As far as I know it doesn't happen that often on our live sites but it definitely has happened on more than two sites on different occasions- and given that squid does continue running (albeit only on 127.0.0.1), I do not think that somebody did something strange directly to the site.

The other explanation would be the 202.75.242.254 interface going down around the time of log rotation _and_ somehow coming back up again (since it was obviously not down). I think this is very unlikely to be the cause.

For what it's worth, I have also noticed that at some sites squid is restarting the redirectors as root instead of squid!

Anyway, as per comment #6, we are taking measures to avoid requiring squid to attempt to release and rebind to ports in the first place (which appears to occur with suse's /etc/init.d/squid reload ). So instead of doing suse 10.2 style of squid -k rotate; sleep 2; squid -k reconfigure potentially for _each_ squid log to be rotated, we will now just do squid -k rotate once for all 3 squid logs. 

This should hopefully avoid the following undesirables (estimate in brackets):
1) squid not rebinding on all required ports (very likely to work)
2) squid reloading its config every day when admins might not be around (100%)
3) squid restarting redirector processes as root (might help, but not a big issue)

However this is also likely to reduce the odds of us reproducing the problem and doing your suggested debugging.
Comment 14 Klaus Singvogel 2008-05-19 17:19:25 UTC
Ok, I've added "sharedscripts" to the logrotate configuration.

Please note, that we don't fix this for older SUSE Linux version as it is no security issue (policy). Please note either, that the logrotate config file is a config file and therefore not replaced on next security update at your site.

fixed versions submitted.
Comment 15 Lincoln Yeoh 2008-05-20 10:46:17 UTC
Just like to point out that as per #6 we didn't just add the sharedscripts directive we also changed:

/etc/init.d/squid reload 

to:

/usr/sbin/squid -k rotate

Thanks. We'll be using our version of the config file anyway.
Comment 16 Klaus Singvogel 2008-05-20 11:54:30 UTC
Sorry, but a long time ago there was an issue, because we didn't do that.
The bad news is, that I cannot find the reference anymore.