Archive for the ‘NFS’ Category
HA Network Attached Storage (NAS)
Happy Halloween!
So, I know I promised this towards the beginning of the month, but it’s proven to be trickier than I initially expected (and I expected it to be tricky!)
I’ve been doing a lot of research and a lot of trial and error and I’m pretty sure I know what it takes to configure this correctly, but I’m waiting on some new servers before I set this up for real.
Since we mount our shares using NFS, (see NFS for HA Lighttpd) there’s a little more involved than just replicating the volumes between two NAS servers and using a load balancer to distribute traffic between them.
Here is a great reference that really helped me get a good understanding on what’s involved in HA NFS: http://www.linux-ha.org/HaNFS (this site also has a lot of good HA info for Linux admins)
First Attempt:
First, I attempted to create an Active – Active, multi-master configuration between our existing NAS servers (two Linksys NSS6000’s that I will call NAS1 and NAS2).
Unfortunately, the NSS6000 is not configured to do synchronization between NAS servers and since you have very little access to the OS, it’s not really a good option to try to hack it to enable synchronization.
To get around this, I setup Unison on a third server (SRV1) that NFS mounts the volumes from NAS1 and NAS2 and handles the synchronization. At this point, I realized that an Active – Active setup would not be possible using this configuration (as Unison must be scheduled using cron and can’t detect file/directory changes as they occur).
So I settled for an Active – Passive config and setup SRV1 to synchronize the volumes between NAS1 and NAS2, set NAS1 to be the primary server, set NAS2 as the secondary server, and used our load balancer to handle fail over.
Unison was handling the file synchronization just fine (though with considerable network traffic) and the multi-master relationship was solid, but we had a couple instances where NAS1 locked up and fail over to NAS2 was unsuccessful.
Dang! Back to the drawing board….
Bottom Line:
If you check out the link concerning HA NFS, I’m sure you’ll find a bunch of reasons why this config was not successful. HA NAS with Linux servers really should be using Heartbeat to facilitate the complexities that HA NAS requires (especially during fail over).
However, rather than beating my head against the wall trying to make a HA config work with servers that weren’t designed for HA, I decided to purchase new NAS servers that were designed for HA NAS.
I’ll be demoing the new NAS servers next week to make sure they can achieve an Active – Passive configuration with fail over (as the vendor literature says it can).
Cross your fingers!
I’m hoping this is the answer to this problem.
NFS for HA Lighttpd
Currently, we have lighttpd deployed in a high availability (HA) configuration. In order to provide our users with a consistent experience, regardless of which web server they are connected to, we use NAS and NFS to provide identical content to each web server.
The diagram below should make our setup a little clearer:
Last week, we encountered some problems where requests for our website would hang. Lighttpd had nothing to say about this problem and restarting it did nothing. After about 10-20 minutes, the problem would resolve itself and everything would go back to business as usual.
Personally, this type of problem really frustrates me, so I made it a mission to figure out what was going wrong and fix it. (not that I had much of a choice, this is a production setup) ![]()
I had some initial hunches to check first:
- some sort of short term DOS attack
- a traffic spike
- lighttpd and php (running in fcgi mode) running out of available file descriptors
Our monitoring software showed no signs of abnormal traffic behavior during the times when the problem was occurring. So that ruled out the first two possible causes.
The last possible cause is more difficult to track down. Lighttpd’s own documentation says that the error messages regarding a shortage of file descriptors may not be written to the error log and may only show up in test cases. Well doesn’t that just take the cake?! I checked anyway, but found nothing of interest in lighttpd’s error logs.
Next step? Calculate the number of file descriptors currently being used by lighttpd and php under normal load and compare them to the maximums defined by lighttpd’s configuration file and the Operating System.
It turns out that lighttpd runs pretty light (pun intended) in terms of file descriptors. PHP, on the other hand, uses (at least) 7 file descriptors per child! Since we’re running 4 php processes that each have 128 children, I decided to increase the OS file descriptor limit from 1024 per process to 32768 per process. Also, just to be safe, I increased lighttpd’s server.max-fds configuration option to 16384.
Unfortunately, this didn’t solve the problem. We had another incident the day after I made the above changes.
Not to be deterred, I went digging in the syslog (/var/log/messages on Redhat variants) and found errors regarding lockd and our NFS mounts that corresponded to the times of the incidents. Aha! Now we’re on to something. I added the nolock option to the mounting options in fstab and remounted the directories.
This seems to have solved the problem. It’s been a couple days (with constantly increasing traffic) and we’ve had no more incidents. Bottom line? When using NFS in a HA configuration you should consider mounting the shares with the nolock option.
The next step is to setup HA NAS. I’ll post again when we’ve accomplished this task.
Comments (1)
Comments (1)