Over the weekend, of course while I’m out of town on vacation, the lustre server decided to take a crap. I checked my email that morning to find notices of lustre clients unable to connect. I check the server to find it had reset itself(unsure exactly why but at the time there was a power failure, though the servers are fully redundant, not sure why one psu failure would cause the system to reset). It was the reset that seemed to cause the problem, lustre wasn’t mounting correctly, one of the OSTs was missing.
Part of the lustre installation is to setup the OSTs as LVM targets, my guess is to make it easier to pass the target from system to system as a simple scan will show the device. So why was one of the targets not showing up in the scan? Multipath was working and the multipath device was there, pvscan was not showing it as a listed physical volume. Luckily CentOS(well Redhat), has great documentation and I found this document to be of great help: http://www.centos.org/docs/5/html/Cluster_Logical_Volume_Manager/mdatarecover.html
Though different from that document, the lvs command was not reporting any errors, it simply just wasn’t showing the missing target. A nice feature of LVM is it keeps a backup of the data used in the creation of the LVM targets, which can be used to restore that information to the drive. I used vgcfgrestore to try to restore the data, now an error messasge saying a particular UUID was not found. Great, with that I can continue.
Using that UUID and the backed up LVM metadata, I used pvcreate to recreate the physical volume using the backed up metadata to restore that metadata to the drive. Now vgcfgrestore was able to find the device and restore it. Then used lvchange to bring it back online and was able to mount the device and get lustre working again.