Monitoring Lustre via Ganglia and Collectl

This here is just a rudimentary outline of how I setup ganglia to post stats gathered by collectl about Lustre.

All the information used is available from these two sources: Roy Dragseth’s page in the Rocks Clusters wiki page and Lustre Tutorial page from collectl’s website. First and foremost have ganglia, ganglia-gmond, ganglia-gmond-python, and collectl installed.

Edit /etc/collectl.conf and reconfigure the DaemonCommands with the following:  -f /var/log/collectl -r00:00,7 -m -F60 -sl -P –export lexpr,f=/tmp/L This gathers the Lustre information and saves it /tmp/L for reading by ganglia later. Now the important bit here is if this is a Lustre node or client determines the variable prefix exported by collectl. For the MDS, lusmds, OSTs, lusost, and for the clients, lusclt.

For an OST, collectl collects reads, writes, writekbs, and readkbs. For an MDS, its gattrP, sattrP, sync, and unlink. Copy the collectl.py and collectl.pyconf from the wiki page to their respective locations for ganglia and edit and replace the relevant entries. The posted files are setup to monitor Lustre stats on a client, so replace lusclt.reads with lusost.reads. Once those files are updated for that server’s needs, start up the collectl daemon and restart gmond. Wait a few minutes and the new stats should be graphed on server’s ganglia page.

Advertisements

Lustre: LVM Metadata Snafu

It was supposed to be a simple memory upgrade for our lustre nodes, but of course they had something else in mind. I’ve got my suspicions as to how it happened, but the issue was one of the OSTs wasn’t mounting. Checking I find that LVM was showing all but one of the OSTs, its LVM metadata wasn’t registering for some reason. Well I’ve seen this before, just use the backup metadata to relabel the device. Not so fast…

The pvcreate command wasn’t working because it thought there was an existing partition table and reading up on LVM shows that when working with raw disk devices, there cannot be a partition table. Though the pvcreate manpage does provide the answer, use dd to write zeros to the first sector, thus clearing the table. It works. I was able to relabel the device, but again not so fast…

Not only was one device not labeled, but turned out that another device had its label swapped. So when I thought I had it fixed and tried mounting the lustre OSTs, one of them was not reconnecting and the clients seem to be oblivious that it was there. Checking /proc/fs/lustre/devices to see what IDs the mounted OSTs had told me what happened. I think that since the swapped OSTs were being mounted on their failover nodes, the clients were unsure as how to reconnect and thus the issue. Once I swapped the LVM metadata for those two devices, the clients were able to reconnect and everything came back online. This made for a very long night.

Lustre: Recovering LVM metadata

Over the weekend, of course while I’m out of town on vacation, the lustre server decided to take a crap. I checked my email that morning to find notices of lustre clients unable to connect. I check the server to find it had reset itself(unsure exactly why but at the time there was a power failure, though the servers are fully redundant, not sure why one psu failure would cause the system to reset). It was the reset that seemed to cause the problem, lustre wasn’t mounting correctly, one of the OSTs was missing.

Part of the lustre installation is to setup the OSTs as LVM targets, my guess is to make it easier to pass the target from system to system as a simple scan will show the device. So why was one of the targets not showing up in the scan? Multipath was working and the multipath device was there, pvscan was not showing it as a listed physical volume. Luckily CentOS(well Redhat), has great documentation and I found this document to be of great help: http://www.centos.org/docs/5/html/Cluster_Logical_Volume_Manager/mdatarecover.html

Though different from that document, the lvs command was not reporting any errors, it simply just wasn’t showing the missing target. A nice feature of LVM is it keeps a backup of the data used in the creation of the LVM targets, which can be used to restore that information to the drive. I used vgcfgrestore to try to restore the data, now an error messasge saying a particular UUID was not found. Great, with that I can continue.

Using that UUID and the backed up LVM metadata, I used pvcreate to recreate the physical volume using the backed up metadata to restore that metadata to the drive. Now vgcfgrestore was able to find the device and restore it. Then used lvchange to bring it back online and was able to mount the device and get lustre working again.

Setting up Lustre

I know I promised a Lustre update a while back, but no apologies. I have my own schedule and just now decided to do this, so be happy 😛

Basics

First thing first is the Lustre vocabulary that one has to get familar with. Its one of the things that first put me off as I was left scratching my head as to what they mean. With practice of explaing things to the boss, you’ll get it. OSS/OST = Object Storage Server/Object Storage Target: In Lustre, the magic comes from separating the meta data(ownership, file size, access times, etc) and the actual data. The OST is typically a disk array attached to an OSS, and for full failover, they come in pairs(ie in our case two disk arrays were attached to two servers with full multipathing). MDS/MDT = Metadata Server/Metadata Target: Again with the OSS/OST, we have disk storage for holding the metadata attached to a server for providing the metadata. MGS = Management Server: In most cases like ours, the MGS can be on the MDS/MDT as well. All it is is a place to hold information on all the servers and targets, etc.

Hardware

This is our setup: Three disk arrays and four servers. All disk arrays setup with RAID 5 and a hot spare. Two of the disk arrays are crossconnected to two servers(the OSS) and the last disk array is crossconnected to the last two servers(MDS/MGS). This is all setup for failover. Each server is setup to take over for the other in case of a problem or for upgrades. Lustre currently only supports having multiple OSSs, but only a single MDS, this will change in the 2.x release.

Lustre currently is built on top of ext3 and as such only supports 8Tb LUNs. Also ext3 has issues with partitions greater then 2Tb. To get around this, you have to label each LUN from the disk arrays as ‘gpt.’ You can use parted to do so:

# parted /dev/sdX
GNU Parted 1.7.1
Using /dev/sdX
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) mklabel
New disk label type? gpt

Software

These steps are to be done on all servers.

First thing is to setup networking, again for failover, by setting up bonding(ie two NICs acting as one). This will increase bandwidth while allowing for a possible network failure. Lets take eth0 and eth1 to be the interfaces to bond. First setup modprobe.conf:

alias bond0 bonding 
options bond0 mode=balance-alb miimon=100 

Then setup their respective network scripts:

# nano /etc/sysconfig/network-scripts/ifcfg-eth0

DEVICE=eth0
BOOTPROTO=none
ONBOOT=yes
MASTER=bond0
SLAVE=yes
USERCTL=no

# nano /etc/sysconfig/network-scripts/ifcfg-eth1

DEVICE=eth1
BOOTPROTO=none
ONBOOT=yes
MASTER=bond0
SLAVE=yes
USERCTL=no

# nano /etc/sysconfig/network-scripts/ifcfg-bond0

DEVICE=bond0
BOOTPROTO=none
ONBOOT=yes
NETMASK=X.X.X.X
IPADDR=X.X.X.X
GATEWAY=X.X.X.X
USERCTL=no

Then reboot. Next is multipathing. This is handled by a daemon that monitors the fiber connection and will switch over should it discover a disconnect.It should be stated that any drivers needed for the fiber cards be installed, but in our case they were part of the kernel.

# yum install device-mapper-multipath
# chkconfig multipathd on

Then edit /etc/multipath.conf for the following:

defaults {
        udev_dir                /dev
        polling_interval        10
        selector                "round-robin 0"
        path_grouping_policy    multibus
        getuid_callout          "/sbin/scsi_id -g -u -s /block/%n"
        prio_callout "/sbin/mpath_prio_rdac /dev/%n"
        path_checker            rdac
        rr_min_io               100
        max_fds                 8192
        rr_weight               priorities
        failback                immediate
        no_path_retry           fail
        user_friendly_names     yes
} 

Comment out from multipath.conf file:

blacklist {
        devnode "*"
}

Reboot. Now we install the lustre software. From the lustre website, download and install: kernel, e2fsprogs, lustre, lustre-modules, lustre-ldiskfs. Reboot for the new kernel to take effect, and verify via ‘uname -a’. Next is to setup LNET, the lustre networking api that does even more magic. Edit modprobe.conf and add “options lnet networks tcp=(bond0)”

MDS/MDT Setup

These steps are for the MDS only, and to only be done on a single node, do not do these on both MDSs as they both can see the same LUN from the disk array. On the LUN to be used as an MDT, we identify it as a physical volume, then add it to a volume group, then create a logical volume on it(ie we are using LVM2). Now thats done, we format it to be a lustre partition. Its a bit more complicated as the format command also passes varables to be stored there for lustre.

# mkfs.lustre --fsname=lustre --mgs --mdt /dev/lustre-mdt-dg/lv1
# mount -t lustre /dev/lustre-mdt-dg/lv1 /mdt

Simply mounting a lustre filesystem starts all thats needed for it to work, no daemons/etc are used. You can use dmesg to see lustre output messages.

OSS/OST Setup

These steps are for the OSS only, and again to only be done on a single node. In this case as we have more then one LUN, we will want to distribute them across the two OSSs. This is done by mounting two on OSS1 and the last two on OSS2, even though since they are crossconnected, each OSS can see all the LUNs, or OSTs(see it gets confusing). You can do all the logical volume and formatting stuff on a single node to keep things easy, but dont forget to mount them on different nodes.

There is/was? a naming convention used by the vendor who came and did this, but its all up to you. Looking back at the MDT setup, and the following, this is our naming convention for the logical volumes: /dev/lustre-ostX-dgX/lv1 So we have two OSTs, each has two LUNs, so we would get something like /dev/lustre-ost2-dg2/lv1 Anyways we ended up with four of them and used the following to format them for lustre:

# mkfs.lustre --fsname=lustre --ost --mgsnode=X.X.X.X@tcp0 /dev/lustre-ost1-dg1/lv1

So now thats done, edit your fstab on each of the OSSs to mount two on the first one and the last two on the second. When failover is setup later, the running OSS will mount the other OSTs that were on the failed server to keep that data available. This is where the LNET magic comes in, it keeps the client waiting, while if this were an NFS server, problems are immediately known on the client, usually by stale nfs handles and the like.

Client Setup

You can setup up a client without patching the kernel, but I found it was easier to use the patched kernel as the other way requires a particular version of the kernel. So install the lustre kernel and lustre and lustre-modules packages and reboot to use the new kernel. Then to mount the lustre filesystem, use ‘mount -t lustre X.X.X.X@tcp0:/lustre /lustre’ where X.X.X.X is the ip of the MDS/MGS server.

Thats the basics. Failover hasn’t even been setup, because thats harder and I’ll detail that later.