Virtual Machines Running or Registered on Multiple ESX Servers

Does your vCenter flicker when browsing clusters? There could be a problem…

Over the last 3 years I’ve seen this happen twice and both times it was not good. Both cases were caused from an HA event that was interrupted, which left multiple VMs registered on more than one host. Fortunately, the VM stays running and the fix does not cause an outage but it is intimidating having to “KILL” VM processes.

HA being HA might have casued it and the KB below gives more causes and the solution.

HA is a good thing to have enabled, but if your NOC is monitoring your VMs and they see an alert that VMs are powering off they will log into the VIC and start powering them back on, no down time right? That’s one of the main causes of this problem and VMware admins need to educate their NOC admins on letting HA do its job power VMs back on.

Now, the devil’s advocate in me says that sounds good but how does a NOC know it’s a VM, or bunch of VMs? And, don’t we tell then to just treat them like any other server? The devil’s advocate has a good question and I will ask for help answering it. Can I get feedback on how to avoid this issue when and event happens that might cause “VM jumpers”?

Here’s a must know process for every VMware Admin on how to fix this problem…

VMware KB Link: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1005051

Here’s how VMware describes it:

  • After one of the following, a Virtual Machine appears as being registered on two ESX Servers:
    • A VMotion fails to complete correctly or times out in VirtualCenter
    • A DRS issue where virtual machines are VMotioned automatically in quick succession
    • When a machine is powered on during VMware HA failover.
    • The Service Console on an ESX host is low on memory starving the vpxa process
  • In VirtualCenter, you see the virtual machine as appearing on one ESX Server for a few seconds, then it seems to be on the other.
    The virtual machine may appear to jump back and forth among different ESX hosts.
To correct this misconfiguration:
  1. Click Inventory in the navigation bar. Expand the inventory as needed and click the appropriate managed host.
  2. Click the Virtual Machines tab.
  3. Note the virtual machine that disappears every few seconds.
  4. Log in as root with SSH to both affected ESX hosts.
  5. Run the vmware-cmd -l command to display the names of the virtual machines registered on this host.

    Run the vm-support -x command to
    show which virtual machines are currently running on the ESX host.

    Compare results from these commands to determine which ESX host has the virtual machine registered, but is not running it. When you have determined this, you need to unregister the virtual machine from the ESX host on which it is registered but not running.

     

  6. Run the following command to unregister the virtual machine from the ESX host:

    vmware-cmd -s unregister </vmfs/volumes/datastore/folder/machine>.vmx

     

  7. If the virtual machine has a process (PID) associated with it, ESX may not allow you to unregister it and the command fails with the error:

    If you see this error and are unable to unregister the virtual machine:

     

    • Kill the process for the virtual machine in the Service Console with the following two commands:
      • ps -auwwwxx | grep -i <Virtual Machine Name>
      • kill -9 <PID of the process returned from the above command>

         

    • Unregister the virtual machine from the ESX host again with the command:

      vmware-cmd -s unregister </vmfs/volumes/datastore/folder/machine>.vmx

       

    • Run the following command to stop the hostd process: 

      service mgmt-vmware stop
    • Use a text editor to open the /etc/vmware/hostd/vmInventory.xml file.
    • Locate the machine you want to remove.
    • Remove all of the information between the <ConfigEntry> tags for the affected virtual machine.
    • Run the following command to start the hostd process:

      service mgmt-vmware start

     
    VMControl error -999: Unknown error: SoapError: ServerFaultCode(0): (The attempted operation cannot be performed in the current state (Powered On).)
  8. Log in to all of your ESX Servers directly using VI Client.

    You see the virtual machine on both ESX hosts with a Powered-on status. One host however does not display any details of VMware Tools, IP address, etc in the Summary tab.

     

  9. Click the virtual machine on the host that does not display any details in the Summary tab.
  10. Right-click the virtual machine, and click Power Off.
Note: VMware recommends restarting the mgmt-vmware and vmware-vpxa processes on any hosts on which you have changed registered machines from the command line. For more information, see Restarting the Management agents on an ESX Server (1003490).

Originally posted 2009-08-10 18:56:49. Republished by Blog Post Promoter

VMware VM Guest SWAP File Best Practice

VM guest swap file diagram

VM guest swap file diagram

Answering the question that always seems to get asked, “what is the best practice for the VM SWAP file?”

I have put this method through bench testing and found it to have the best performance, step by step for VM SWAP file configuration is:
  1. Set up a data store in each ESX cluster for guest SWAP files, see diagram to the right. (Do not use local ESX disks! This will save more expensive SAN storage but causes latency for DRS and vMotioning of VMs.)
  2. For each ESX host in the cluster, configure the SWAP to point to the SWAP data store.
  3. Go through all the setting on each VM in the cluster and make sure VM SWAP file are controlled by the host, not in VM folder.
  4. Here’s a step that some will disagree on but my testing has found it to produce the best VM performance, log into the Windows VM and set the SWAP file to be maintained by Windows. Some admins like to do 2.5 – 3.5 ratio for the memory but this is not a physical so let Windows adjust the SWAP file size. From what I have found it’s a 1:1 ratio.
  5. Configure 1 – 3 the same for Linux but on 4 just let Let Linux do it’s own thing. Linux uses a SWAP volume which is stored in the VM’s folder.
My test results produced:  Faster VM Booting, Better VM Performace, Faster vMotioning and DRS, Faster VM evacuation when ESX host is entering maintenance mode

Originally posted 2009-02-08 07:49:24. Republished by Blog Post Promoter

VMWorld 2009 was Really a Storage Conference

I’ve mentioned this in a few of my blog posts over the years and here it is again – “Storage has been a major oversight by most companies deploying VMware Virtual Infrastructure”.  News Flash! This was confirmed at the recent 2009 VMWorld … 

What I’ve heard and read about VMWorld 2009 is that it was really a storage conference. Every storage vendor and his brother were there displaying and promoting storage products to help build better VI environments. And, from many of the emails I get, most visitors were really just interested in finding out what they can do to fix their main storage issue: poor performance.

A year ago I had a talk with my VMware support engineer and I explained to him then that I thought there was going to be a huge market for solutions for improving storage performance coming. I also explained how most VMware deployments are similar to boiling a frog. If a frog is thrown into hot water it will jump out, however, if the water temperature is turned up slowly the frog won’t realize it until it is too late. Likewise, if VMware sales representatives told every new customer they would need a new storage array within a year, nobody except those already planning to buy new storage would virtualize. However, if nothing is said and VMs are built one at a time over a 6 -12 month period on existing shared storage, nobody will notice performance degradation until one day when the main business application database crashes because it shares the same SAN as VMware but then VMware won’t be to blame, storage will. I suspect many companies using VMware are at this point today with their environments?

You have two options for your problem: 1) Buy more storage or 2) Re-carve your existing storage. My gut tells me most SAN Admins reading this post will argue with option 2 because that would take a lot of work and they most likely don’t believe it will help. I suggest for those that just don’t agree because they know better that they get re-educated on carving SAN for VMware.

No disrespect intended but it’s a hard one to digest. Think about this, you’re used to carving one 100GB LUN for one server with many users and dedicated HBA ports, right? Now consider for a moment that for VMware you are carving (8 – 16), 300GB, 400GB or  500GB LUNs for 8 – 16 ESX hosts with 160 – 240 virtual servers all accessing the LUN through the same HBA port, or path and – all at the same time. If I was lucky enough to get your attention then I won’t even try to insult you by trying to explain how each SAN is different – but – I will recommend calling your VMware and SAN support and speaking only with someone who works on storage for VMware.

Furthermore for both options mentioned, many storage vendors have the ability to do a form of what is known as wide-striping (HDS term) and it also requires a special license that will cost you. HDS, 3par, EMC and HP all can have 100s of drives in a single disk pool (RAID/parity group) this is, with the proper licensed features. NetApp will have a similar feature in OnTapp 8.4 from what I’ve been told.

I hope this has been helpful for someone trying to understand why the frog keeps dying. So, the next time you have a VM that starts croaking, I’d have a look at storage option 2.

Originally posted 2009-09-15 15:48:23. Republished by Blog Post Promoter