NFS for HA Lighttpd
Currently, we have lighttpd deployed in a high availability (HA) configuration. In order to provide our users with a consistent experience, regardless of which web server they are connected to, we use NAS and NFS to provide identical content to each web server.
The diagram below should make our setup a little clearer:
Last week, we encountered some problems where requests for our website would hang. Lighttpd had nothing to say about this problem and restarting it did nothing. After about 10-20 minutes, the problem would resolve itself and everything would go back to business as usual.
Personally, this type of problem really frustrates me, so I made it a mission to figure out what was going wrong and fix it. (not that I had much of a choice, this is a production setup) ![]()
I had some initial hunches to check first:
- some sort of short term DOS attack
- a traffic spike
- lighttpd and php (running in fcgi mode) running out of available file descriptors
Our monitoring software showed no signs of abnormal traffic behavior during the times when the problem was occurring. So that ruled out the first two possible causes.
The last possible cause is more difficult to track down. Lighttpd’s own documentation says that the error messages regarding a shortage of file descriptors may not be written to the error log and may only show up in test cases. Well doesn’t that just take the cake?! I checked anyway, but found nothing of interest in lighttpd’s error logs.
Next step? Calculate the number of file descriptors currently being used by lighttpd and php under normal load and compare them to the maximums defined by lighttpd’s configuration file and the Operating System.
It turns out that lighttpd runs pretty light (pun intended) in terms of file descriptors. PHP, on the other hand, uses (at least) 7 file descriptors per child! Since we’re running 4 php processes that each have 128 children, I decided to increase the OS file descriptor limit from 1024 per process to 32768 per process. Also, just to be safe, I increased lighttpd’s server.max-fds configuration option to 16384.
Unfortunately, this didn’t solve the problem. We had another incident the day after I made the above changes.
Not to be deterred, I went digging in the syslog (/var/log/messages on Redhat variants) and found errors regarding lockd and our NFS mounts that corresponded to the times of the incidents. Aha! Now we’re on to something. I added the nolock option to the mounting options in fstab and remounted the directories.
This seems to have solved the problem. It’s been a couple days (with constantly increasing traffic) and we’ve had no more incidents. Bottom line? When using NFS in a HA configuration you should consider mounting the shares with the nolock option.
The next step is to setup HA NAS. I’ll post again when we’ve accomplished this task.
1 comment so far
Leave a reply

[...] we mount our shares using NFS, (see NFS for HA Lighttpd) there’s a little more involved than just replicating the volumes between two NAS servers and [...]