Systems#

Resizing filesystems#

LVM and ext3 support resizing filesystems after they’ve been created. Filesystems can be expanded online, without unmounting them:#

lvresize -L+10G /dev/sys/scratch
resize2fs /dev/sys/scratch

When shrinking a filesystem, it’s safest to copy its data elsewhere before unmounting and recreating it:#

rsync -avr /scratch /data/.
umount /scratch
lvresize -L-10G /dev/sys/scratch
mkfs.ext3 /dev/sys/scratch

Hardware Debugging#

stress test#

New machines must pass Google’s stressapptest before joining the production cluster. This test is run from the uwhep-stress init script which is configured during the kickstart/bootstrap process; when the test completes, the init script is removed. The test can be run again by re-enabling uwhep-stress or running it manually.#

IPMI#

All newer machines come equipped with an onboard IPMI card that eavesdrops on the first network interface. Various administration tasks including power cycling, serial-over-lan and system event log inspection can be performed remotely using ipmitool. The IPMI cards listen on a private address; to compute this address, simply replace the first octet of the system’s primary IP address with ‘10’. The ipmi script found in /usr/local/ipmi automates this calculation, allowing easy access to machines by hostname. For example, the following command would connect to the serial-over-lan interface on g19n01’s IPMI card:#

/usr/local/ipmi/ipmi g19n01 sol activate

And this would reboot the same machine:#

/usr/local/ipmi/ipmi g19n01 power cycle

The client must have an interface with a 10.X.X.X address in order to reach the remote IPMI. One such server is nod, though any IPMI-capable server has a special eth0:0 alias that can also be used:#

sudo ifup eth0:0

Since the IPMI card ignores IPMI packets when the host system has an active matching IP address, eth0:0 should be disabled when not in use.#

Crash dumps#

When a system panics, an image of the system’s memory will be captured and written to disk before the system is automatically rebooted. After the system comes back online, the image can be examined using the crash utility:#

sudo debuginfo-install --enablerepo=sl-debuginfo kernel-$(uname -r) 
sudo crash /usr/lib/debug/lib/modules/$(uname -r)/vmlinux /scratch/vmcores/*/vmcore