About 2 months ago, I bought a new server (SuperMicro barebone: SYS-5016I-MT + XEON 3400 + 16GB RAM) and decided to put Hyper-v on it as its virtual server environment.
The installation was smooth. Windows 2008 R2 as Hyper-v host, and four client OSes only cost me 1 working day to install and configure. I was very happy and satisfied.
But soon I felt frustrated and angry because the whole environment keep crashing.
It’s like the client windows 7 OS always dragged down the whole system including the parent host.
No event log, not even blue screen. How could that be? Maybe the system didn’t even have the chance to write to system event log?
Finally, I ended up to configure system failure dump file on client OSes and parent Hyper-V server. I hoped the system would write to crash log when the really bad thing happened.

It soon crashed again. This time I did see some memory dump files as I just configured before. I felt the lights shedding.
I saw the memory dump files sitting in my Hyper-V host server (C:\Windows\Minidump\), not in client OSes. Maybe the crash in the host server didn’t have a chance to write any dump file to clients yet.
Now, it’s the matter how to analyze this dump file. I was told this kind of dump-file-analyzing is for hardware driver debuggers. I suddenly felt doing some sort of hacker level reverse engineering.
Fortunately, it turned out fairly easy for us new-born hardware driver hackers.
First, we need to download a free utility called WinDbg from Microsoft.
Here are the download links:32bit 64bit
It runs like this:

Before you analyze the DMP files, you need to configure “symbol” file which is needed to translate those DMP files. To configure “symbol” file, see the picture below:

Make sure the symbol file path is srv*c:\Symbols*http://msdl.microsoft.com/download/symbols

This symbol file will be used to interpret the dump file symbols and as you can guess, it needs you to be online while analyzing the DMP file.
Now it’s time to open our DMP files. Just click the [Open Crash Dump…] from the file menu and point to c:\windows\minidump and pick the DMP file (I just picked the latest one).
The analyzing result disappointed me a bit. Because it didn’t tell me which module cause the crash. The only possible meaningful thing is this:
CLOCK_WATCHDOG_TIMEOUT (101)
An expected clock interrupt was not received on a secondary processor in an
MP system within the allocated interval. This indicates that the specified
processor is hung and not processing interrupts.
Arguments:
Arg1: 0000000000000019, Clock interrupt time out interval in nominal clock ticks.
Arg2: 0000000000000000, 0.
Arg3: fffff88001c5d180, The PRCB address of the hung processor.
Arg4: 0000000000000002, 0.
I searched quite a while, found this might be a known issue documented by this MSDN KB. I decide not to try the KB article recommended solution because it sounds too tedious and I might just try it if I don’t have any other solutions.
Then I found a guy asked similar question in the Microsoft support forum. I was so happy the guy mentioned that the issue will be gone if CPU’s “hyper-threading” feature gets disabled.
I didn’t even bother to read further because I was tired with this issue and just wanted to try something quick. No production data loaded on the server at that time anyway.
I disabled “hyper-threading” from BIOS, booted-up my 1-U box and waiting for its another crash.
Day 1 passed, then day 2 and 3, days on and on. After 30 continues days passed, my 4 virtual servers were still humming and serving some heavy duties.
I know I should be aware of this trick next time for another Hyper-v configuration and might even need to spend time to figure out why this trick works.
But for now, I just enjoyed server humming and focus on doing programming rather than hardware debugging.
crashing, hyper-v