HP WakeOnLAN bug versus Open Source – An “It shouldn’t be this hard!”-Odyssey.

This is the story of how I used Open Source to work around a bug in an HP computer’s BIOS.

Background: I administer a small remote office with about 10 computers which need to be backed up regularly. To do this without disrupting the users’ work, I have the backup server wake up all the machines during the night using WakeOnLAN and, once it is done backing up all the data, shut them down via Windows RPC. Since WoL support has become fairly ubiquituous in recent years this works very well for most of their machines. One of them, a brand new Hewlett-Packard Z210 CMT workstation that we recently acquired, just wouldn’t have it though.

The problem: In fact I already noticed the problem when I first installed the box, as my standard WoL test failed with the PC not finding its boot drive/Operating System. When that happened, I just hit the reset button and then Windows came right up, so I stupidly dismissed the failed boot attempt as a fluke and quickly forgot about it. Over the course of the next weeks however, it quickly became apparent from the backup reports that this box was never getting backed up and every time I attempted to wake it up remotely failed. When probing the local staff about the machine, they mentioned that they often found it powered up in the morning with the message ‘No disk or operating system’.

Thinking back to my failed boot attempt during the box’s installation, I figured that it had trouble starting up when cold due to the disks not spinning up quickly enough or a similar issue. I have previously encountered machines suffering from this, so it was the first logical conclusion to jump to. I walked somebody in the office though how to increase the BIOS’ wait time when starting, change the POST test type from ‘quick’ to ‘full’ and several other options, to no avail. The machine still would not come up for its nightly backups. As I was preparing to send the machine off for repair, I had one of the local staff mention that the machine would start normally for the, even when cold. A quick test confirmed this, and therefore trounced my initial conclusion. Something else was wrong with this machine.

The cause: With the PC starting up normally when using the power button but never coming up for the backups, we started digging around the possibility that something was going wrong only when started via WakeOnLAN. Since I was still remote, I called up one of the people on site and had him shut down the box and watch the screen as I issued a WoL remotely and sure enough, the boot up failed. The local guy mentioned that the machine was trying to do something with the network and a quick check in the DHCP logs indeed showed it grabbing an IP, so apparently it was trying to do a PXE boot instead of booting locally. We went though the BIOS options and sure enough, one specifies the boot source when woken up via WoL. You can set it to ‘network’ or ‘local disk’ but unfortunately, it was already set to ‘local disk’. Why then was it trying to boot of the LAN? Elsewhere in the BIOS, we found an option to disable the PXE/LAN booting altogether, so we did and ran another test. Still no luck. The machine was now going straight to the ‘No disk …’ message. Something was very wrong with the way the BIOS behaved here and since I had already upgraded it to the latest available release, it looked like I would never get this machine to wake up properly.

First workaround attempt: With the PC not coming up regardless of its BIOS settings, I figured I might try to work around the problem by simply letting it boot off the only source that it was trying to, the network, and then having the bootloader it got there redirect it to the local disk (chainload the MBR). I had recently set up all the PXE server stuff needed for network booting at home, so it didn’t take me long to implement that on their local server. The bootloader I used, PXELINUX, which is part of the SYSLINUX project, offers two ways to do this. One is the ‘LOCALBOOT’ command, which according to some Internet comments doesn’t always work, and the other is the ‘chain.c32’ module. I opted for the last one, created this configuration file and woke up the machine…

DEFAULT menu.c32
PROMPT 0
TIMEOUT 5
LABEL bootlocal
MENU LABEL Boot from first hard drive
COM32 chain.c32
APPEND hd0 0

It didn’t come up. I called up the office again to have somebody tell me what was on the screen and it was basically the same problem again. So I figured the ‘chain.c32’ method wasn’t working and gave the localboot one a try:

DEFAULT menu.c32
PROMPT 0
TIMEOUT 5
LABEL bootlocal
MENU LABEL Boot from first hard drive
LOCALBOOT 0

Still no luck, same error. I appears that the BIOS flat out does not recognize the disks when woken up via WoL. It does show them during the detection phase but during boot, nothing is able to get them. This is also why the normal boots, which should continue to disk once PXE failed, does not work.

The actual workaround: Since there was no way to get the system to boot from disk when woken up, I figured that I could possibly get it to work if I could only get it to reboot once and thus reset its BIOS to the normal, working state. So I downloaded the Debian netinstaller network boot parts and started hacking. First I unpacked its initrd.gz file (gunzip it, then extract with cpio) and then started looking around to find its startup script. The /etc/inittab file has this section in it:

# main rc script
::sysinit:/sbin/reopen-console /sbin/debian-installer-startup

# main setup program
::respawn:/sbin/reopen-console /sbin/debian-installer

Sure enough /sbin/reopen-console was the script that I was looking for. I left most of it intact but modified the first few lines to look like this:

#!/bin/sh

/sbin/reboot

With that done, I repackaged all of this back into an initrd.gz file and replaced the original one with my hacked version. To avoid having to disturb the local users again, I also decided to test this image first, so I quickly installed a VirtualBox VM on their server, set it up for PXE boot and fired it up for a test. Since the VM seemed to work exactly as I had planned, I reconfigured the PXE config files for the failing machine to use the same image and then woke it up via WoL. By this time, nobody was in the office anyway so I figured that the machine will either start or have to sit there powered on all night until people came in in the morning. Fortunately for me, the latter did not happen. About a minute after the WoL packet was sent, I had the machine pinging and could log in to shut it down properly.

Conclusion: Open Source software: 1 – broken HP BIOS: 0

Open Source software rocks! There is no way I could ever have gotten this to work using any other Operating System or platform. The fact that I could just download a PXE bootloader and PXE-bootable Debian distribution which I could then modify to just reboot as soon as started, is what saved me from having to wait for HP to fix this bug in their BIOS, which I’m still hoping they will eventually do. (Yes, I have e-mailed them about it, no response so far.)

With the PXE server set up anyway, I’m now planning on putting a properly bootable Linux image on it for the times where I am on site and need to debug a non-booting Windows again.  That will save me from trying to hunt down the local Knoppix disc or downloading/burning a new one every time.

Leave a comment

Your email address will not be published. Required fields are marked *

Bear