Project Titanicarus: Part 5 – Building the Filers or “Welcome to the Pit of Despair”

This part of the project is the one I have the least experience with and the one which I’ve spent the most time trying to find a solution that works the way that I need.

To put it bluntly I don’t know if a solution exists that is capable of doing what I want with the level of simplicity I want. Almost every solution I have found has its own unique set of shortcomings, almost all of those are performance or complexity related.

I have been through several levels of insanity trying to get a viable solution implemented, including a momentary period of complete lunacy in which I planned to write my own solution.

Lets look at what I am looking for in a backend filesystem:

  • Multi-chassis striping (for performance & redundancy)
  • Self healing in the event of failure without admin intervention
  • Able to scale up by adding more storage servers
  • It must perform well with lots of small files
  • It must be fast enough that web applications don’t lag
  • Replication over WAN to multiple datacentres
  • Capable of continuing to function when partitioned (WAN down)
  • POSIX style locking (not mandatory, but ideal)

To make this setup work, we need a highly robust, distributed filesystem for configs and application data.

I started out thinking I could do this with RSYNC and NFS, but the number of files I am going to have, especially on the mail stores is just too great. Using RSYNC would have left the filers IO and CPU bound trying to keep hundreds of thousands of tiny files in sync.

DRBD was ruled out as soon as I decided that WAN/Internet replication was required. They do have a WAN proxy you can use, but it costs money and I’m too tight to be paying for it.

I spent 3 solid days trying to make XtreemFS work well. In the end the Fuse client wound up being pretty terrible and then I had a bunch of issues getting Directories and MRC’s to replicate properly. I think I will come back to XtreemFS later on, it has a lot of the atributes I am looking for its just not quite at the stage I need it to be right now.

Next I tried GlusterFS, the initial tests I did worked and the further tuning I’ve documented below wound up making the GlusterFS system perform pretty well. The GlusterFS team are also working on multi-master WAN replication, I believe this year they will have features that will make every node writable globally making it exactly what I am looking for. I even wrote up an amazing blog post about how great Gluster was and was ready to post it last week until I went and did some performance tests comparing normal disk to Glusters performance and I realised I’d fallen in love with a fat chick of a file system. I’m certain that Gluster would be amazing if I had unlimited resources and if I bought it a gym membership, but the ultimate outcome was its been causing lag and all kinds of weird filesystem inconsistencies, its just not fit enough for prime time.

So where the hell do I go from here? Naturally I stay up all night, drink scotch and read…. I read up on so many different solutions, but all came with their own caveats or proprietary licensing that I found unacceptable. I have one last hope – CEPH but you know what, I don’t have the energy left in me to wake up again in the morning only to discover I have to self administer an arm amputation to escape yet another filesystem I’ve thrown my all into.

I’m a cop out. I’ve decided that I’m going to defer any further ambitions of clustering filesystems until I (or someone who’s generous with their time and is willing to help) get the emotional energy to build a testbed I can do a clustering FS shootout on to once and for all work things out.

So what am I going to do? Well the solution turns out to be incredibly easy and it uses some pretty cool tech too! The good folk at Bittorrent labs have developed a product called Sync. Sync is as the name suggests a Bittorrent based file sync tool, very similar to Dropbox or Google Drive, only the data only replicates to locations you choose. Data on the wire is fully encrypted and you can even share in read only mode which is perfect for centrally managed configs – one of the things I’m planning on using it for. Sync also has clients for almost every platform you could hope to install it on – Windows, OSX, Linux all have options.

Setting up the file shares

Because I originally built the app servers to use network filesystems, I’ve only got small drives allocated to them. This is now changing, so we need to add some more storage to them, I’m going to give each app server a 100 Gig disk for shared data. Make sure to spread the allocated disk space over as many physical disks as possible in case you have a disk failure and also to ensure as much IO is available for each app server as possible.

This link has a great howto on adding drives to Ubuntu server, its really easy to do so I won’t bore you with that info here. I’m mounting the new space as /data

I want to share web, mail and config directories between all my app servers, so I’m going to create 3 directories that we will share.

mkdir /data/config
mkdir /data/mail
mkdir /data/www

The config directory will be owned by root (same as /etc/) the mail directory will be owned by “virtual” and the www directory will be owned by “www-data”.

Installing Sync

  1. Install the ppa
    sudo add-apt-repository ppa:tuxpoldo/btsync
  2. Update APT repo’s
    apt-get update
  3. Install btsync
    apt-get install btsync
    
  4. It will ask you to setup a default instance, you really don’t want or need one so I’d avoid it if possible. The only reason I’d do that would be if I wanted to use the web GUI, but that won’t work when you define shares via config files so for our purposes dont bother!

Configuring Shares

The package keeps all configs in /etc/btsync to create a share all you’ve got to do is create a config named like this:

instance_name.user.group.conf 

The daemon will start an instance with the credentials from the filename and the configuration from inside the file. Its pretty easy and the config files tell you what to do, so lets look at a sample config.

root@app01:/etc/btsync# cat configs.root.root.conf 
//!/usr/sbin/btsync-daemon --config
//
// (c) 2013 YeaSoft Int'l - Leo Moll
//
// This btsync configuration file shows how to configure a btsync
// instance running under specific user credentials.
// Credentials can be embeded in the filename of the configuration
// file:
// 
// filename.conf - no credential specified. The
// instance will run as root:root
// filename..conf - Instance will run as  with
// the primary group of 
// filename...conf - Instance will run as user:group
//
// This example will launch an instance running under the credentials
// of the user "jdoe"
// The internal data of the btsync daemon will be written in
// /home/jdoe/.btsync
// Since the web gui is disabled, the user cannot configure anything.
// The instance offers one replicated directory located in
// /home/jdoe/syncdir
//
{
 "device_name": "app01 config",
 "listening_port" : 12345,
 "storage_path" : "/root/.btsync",
 "check_for_updates" : false,
 "use_upnp" : false,
 "download_limit" : 0,
 "upload_limit" : 0, 
 "webui" :
 {
 },
 "shared_folders" :
 [
 {
 "secret" : "XXXX",
 "dir" : "/data/config",
 "use_relay_server" : true,
 "use_dht" : false,
 "search_lan" : true,
 "use_sync_trash" : true
 }
 ]
}

Make sure you generate one secret per share, if you use the same secret in multiple shares you’ll wind up with scary things happening. When I say scary things I mean one great big file share instead of three separate ones.

Now you’ve got all three of your config files installed its time to start the daemon.

service btsync start

To finish the install, repeat this process on each of your app servers. You can use the exact same config files as you used on your first app server, all you have to change is the “device_name” field in each config. The secrets must stay the same as they are on the first server for replication to work properly.

Testing

To test that sync is working, you could drop a file in /data/www on one server and watch that it arrives on the other server but what fun would that be?

  1. Grab the sync client for your desktop machine and install it.
  2. Create 3 folders on your desktop “mail, www, config”.
  3. Open the Sync Client on your machine, create a New Folder and Drop the Shared Secret from your www share into the client.
    Add Sync Folder
  4. Repeat this for all three of the shares you’ve created on the servers.
  5. Once you hit ok, the sync client will go out into the ether and find the shares on your servers.
  6. Jump onto both your app servers and run “watch ls -lah /data/www /data/mail /data/config”
  7. Now copy a file from your desktop into each of the folders you’ve created. The files should pretty quickly start to sync over to the app servers and you’ll see them arriving in your ssh sessions.

Thats it!

You’ve now got a replicating filesystem and a pretty easy way to take data offsite for backup if you want to. Its not what I originally set out to build but it is resilient and much faster than the stuff I’ve tested so far.

I do plan to return to this part of the project as I’m sure there are solutions that will do what i want, but I didn’t want the project to be held up any longer by the amount of time testing has taken. If anyone feels like helping me setup a clustering filesystem shootout, drop me a line. We will need to find some storage with good IO and 5 VM’s to build the testbed on.

Next week we move on to more exciting things – MySQL clusters!

Last updated by at .