I know I promised a Lustre update a while back, but no apologies. I have my own schedule and just now decided to do this, so be happy 😛
First thing first is the Lustre vocabulary that one has to get familar with. Its one of the things that first put me off as I was left scratching my head as to what they mean. With practice of explaing things to the boss, you’ll get it. OSS/OST = Object Storage Server/Object Storage Target: In Lustre, the magic comes from separating the meta data(ownership, file size, access times, etc) and the actual data. The OST is typically a disk array attached to an OSS, and for full failover, they come in pairs(ie in our case two disk arrays were attached to two servers with full multipathing). MDS/MDT = Metadata Server/Metadata Target: Again with the OSS/OST, we have disk storage for holding the metadata attached to a server for providing the metadata. MGS = Management Server: In most cases like ours, the MGS can be on the MDS/MDT as well. All it is is a place to hold information on all the servers and targets, etc.
This is our setup: Three disk arrays and four servers. All disk arrays setup with RAID 5 and a hot spare. Two of the disk arrays are crossconnected to two servers(the OSS) and the last disk array is crossconnected to the last two servers(MDS/MGS). This is all setup for failover. Each server is setup to take over for the other in case of a problem or for upgrades. Lustre currently only supports having multiple OSSs, but only a single MDS, this will change in the 2.x release.
Lustre currently is built on top of ext3 and as such only supports 8Tb LUNs. Also ext3 has issues with partitions greater then 2Tb. To get around this, you have to label each LUN from the disk arrays as ‘gpt.’ You can use parted to do so:
# parted /dev/sdX
GNU Parted 1.7.1
Welcome to GNU Parted! Type 'help' to view a list of commands.
New disk label type? gpt
These steps are to be done on all servers.
First thing is to setup networking, again for failover, by setting up bonding(ie two NICs acting as one). This will increase bandwidth while allowing for a possible network failure. Lets take eth0 and eth1 to be the interfaces to bond. First setup modprobe.conf:
alias bond0 bonding
options bond0 mode=balance-alb miimon=100
Then setup their respective network scripts:
# nano /etc/sysconfig/network-scripts/ifcfg-eth0
# nano /etc/sysconfig/network-scripts/ifcfg-eth1
# nano /etc/sysconfig/network-scripts/ifcfg-bond0
Then reboot. Next is multipathing. This is handled by a daemon that monitors the fiber connection and will switch over should it discover a disconnect.It should be stated that any drivers needed for the fiber cards be installed, but in our case they were part of the kernel.
# yum install device-mapper-multipath
# chkconfig multipathd on
Then edit /etc/multipath.conf for the following:
selector "round-robin 0"
getuid_callout "/sbin/scsi_id -g -u -s /block/%n"
prio_callout "/sbin/mpath_prio_rdac /dev/%n"
Comment out from multipath.conf file:
Reboot. Now we install the lustre software. From the lustre website, download and install: kernel, e2fsprogs, lustre, lustre-modules, lustre-ldiskfs. Reboot for the new kernel to take effect, and verify via ‘uname -a’. Next is to setup LNET, the lustre networking api that does even more magic. Edit modprobe.conf and add “options lnet networks tcp=(bond0)”
These steps are for the MDS only, and to only be done on a single node, do not do these on both MDSs as they both can see the same LUN from the disk array. On the LUN to be used as an MDT, we identify it as a physical volume, then add it to a volume group, then create a logical volume on it(ie we are using LVM2). Now thats done, we format it to be a lustre partition. Its a bit more complicated as the format command also passes varables to be stored there for lustre.
# mkfs.lustre --fsname=lustre --mgs --mdt /dev/lustre-mdt-dg/lv1
# mount -t lustre /dev/lustre-mdt-dg/lv1 /mdt
Simply mounting a lustre filesystem starts all thats needed for it to work, no daemons/etc are used. You can use dmesg to see lustre output messages.
These steps are for the OSS only, and again to only be done on a single node. In this case as we have more then one LUN, we will want to distribute them across the two OSSs. This is done by mounting two on OSS1 and the last two on OSS2, even though since they are crossconnected, each OSS can see all the LUNs, or OSTs(see it gets confusing). You can do all the logical volume and formatting stuff on a single node to keep things easy, but dont forget to mount them on different nodes.
There is/was? a naming convention used by the vendor who came and did this, but its all up to you. Looking back at the MDT setup, and the following, this is our naming convention for the logical volumes: /dev/lustre-ostX-dgX/lv1 So we have two OSTs, each has two LUNs, so we would get something like /dev/lustre-ost2-dg2/lv1 Anyways we ended up with four of them and used the following to format them for lustre:
# mkfs.lustre --fsname=lustre --ost --mgsnode=X.X.X.X@tcp0 /dev/lustre-ost1-dg1/lv1
So now thats done, edit your fstab on each of the OSSs to mount two on the first one and the last two on the second. When failover is setup later, the running OSS will mount the other OSTs that were on the failed server to keep that data available. This is where the LNET magic comes in, it keeps the client waiting, while if this were an NFS server, problems are immediately known on the client, usually by stale nfs handles and the like.
You can setup up a client without patching the kernel, but I found it was easier to use the patched kernel as the other way requires a particular version of the kernel. So install the lustre kernel and lustre and lustre-modules packages and reboot to use the new kernel. Then to mount the lustre filesystem, use ‘mount -t lustre X.X.X.X@tcp0:/lustre /lustre’ where X.X.X.X is the ip of the MDS/MGS server.
Thats the basics. Failover hasn’t even been setup, because thats harder and I’ll detail that later.