TechHui

Hawaiʻi's Technology Community

Just been reading an interesting article on ars technica, 6 steps to a successful virtualization deployment and I was wondering if anyone here has anything else they'd add to the list?

http://bit.ly/13ZH5v

I know from my own experiences, monitoring and documentation was a pitfall we failed with. The sysadmin who originally built the VM box failed to put any monitoring on it, assuming the instances on it would be monitored. Sure, they were, but that just meant when the VMware machine went down (another failure, didn't have hot spare), the monitoring team saw alarms on lots of "servers".
That's where documentation failed to come in.

The VMWare box was for the most part an emergency box. We had a bunch of old servers (Solaris running on old UltraSPARCs, FreeBSD, Fedora, RedHat, and the works) and platforms from various ISPs we'd bought out. Management wasn't prepared to pay out for replacements so we were forever just making do. That the servers themselves were undocumented really didn't help, but we reverse engineered most of them and when we saw warning signs they were reaching the dodgy side we'd replicate their functionality in a VM and move the IP address.. and more often than not forget to update the only documentation we had to mention that the server was now a VM rather than the specific box.

I had the joys of being on call when the VMWare server took a dive, for reasons we never managed to identify (suspicions of human screw up were high, given someone was working in the cab next door.)

All our overnight team saw was a load of alarms crop up for a bunch of old servers simultaneously and they panicked.. and started looking for network failures. After 30 minutes of them trying all sorts of routes to figure out where the network problem might be, and with an engineer on route to the data centre the servers were in according to documentation, they phoned the on call number to alert us to the problem and were going to call networks.

I'm not the most coherent at 3am, but after about 5-10 minutes of connecting up to work and mtr'ing to boxes I twigged what the problem was, having just set up one of those VMs earlier in the week. An engineer was quickly dispatched to the server room down the hall and the VM host kicked back into life, panic over.

The next day saw both failures quickly rectified, and then I had to draw up some quick training docs to ensure NOC knew we actually had a VMWare server, what we were using it for, and how to interact with it.

What lessons have you guys learnt, if any, from VM roll outs or infrastructure changes?

Views: 8

Reply to This

Sponsors

web design, web development, localization

© 2024   Created by Daniel Leuck.   Powered by

Badges  |  Report an Issue  |  Terms of Service