Hyper-V Replica is Not a Silver Bullet

Don’t worry, this is not one of those articles intended to disparage Hyper-V Replica (HVR). HVR is a fantastic technology and it works as advertised. The problem is that it is being treated like the technological penicillin of this decade. All too often, people are prescribing it for diseases that it won’t cure.

There are two major problems with this over-prescription. The first is the oft-asked question of “Should I set up a cluster with Live Migration or use Hyper-V Replica?” Of course, this is an apples-to-oranges question because a cluster is for high availability and HVR is for disaster recovery. Most of the people who ask this question are actually aware of the distinction. What they often really mean is, “I can afford to build a cluster or I can afford to build HVR. Which one should I choose?” All too often (usually as a knee-jerk response), the given answer is HVR. In reality, it should probably be a cluster more often than not.

To Cluster or To Replica, That is the Question

The missing element in this question is backup. Hopefully, that’s because backup is just a given. No matter what you do, you absolutely must have a proper backup solution. As soon as you consider that, the question immediately tips toward a cluster. Here’s why:

Cluster Hyper-V Replica Backup
Can Replace Cluster

N/A

N/A

No

No

No

No

Can Replace Hyper-V Replica

No

No

N/A

N/A

Yes

Yes

Can Replace Backup

No

No

No

No

N/A

N/A

As you can see, only one item in this list can stand in for any other: backup can be used in place of HVR. Since you absolutely must have backup if you care about the data at all, that automatically means the least useful component is HVR. That’s not to say that there’s anything wrong with HVR and I’m certainly not claiming that backup can do everything that HVR can do. It’s simply that if you can only choose two of the above three technologies, and given that forgoing backup is not an option, then you’re going to need to get your pencil pretty sharp to explain why clustering is not the logical choice. The most common “reason” I see given is that clusters are “hard”. If you’re avoiding clusters solely because they’re “hard”, I recommend you seek an alternate career.

HVR is Not Cheap

The second major problem with HVR over-prescription is that some people act like everything related to HVR is completely free and can just be tossed together from random bits you find discarded by the street. I’ve actually seen HVR described as “poor man’s disaster recovery”. This is definitely an example of “Let Them Eat Cake” syndrome in which one person’s definition of “poor” is unlike most anyone else’s definition. You’re not going to get HVR for cheap.

The Hardware

One place that people try to oversell HVR is in the hardware. They tell you that you don’t have to use the same horsepower for hardware to host the replicas as you do for the primary live site. In general, that’s probably true. In the event of a disaster, you probably have a lower staff load and you’ll probably trim some non-essential virtual machines. The problem is that this isn’t usually fully explored. Some people will make it sound like you can just re-use any old tech you find lying around somewhere. No matter how light you make the replica site’s load, you’re not going to cover a twenty core 128GB system with someone’s cast-off Pentium III. Also, if your organization is small enough that you’re really having to make a decision between HVR and a cluster, it’s also likely that you haven’t got that many virtual machines to trim off as non-essential. Realistically, your replica site is probably not going to look starkly different from your live site.

The Licensing

Oh yeah, that. I actually had to call up an actual expert on Microsoft licensing to get that answer, because the people who really want you to buy in to HVR really don’t want to talk about licensing. So here it is, the straight answer: a replica virtual machine must be licensed exactly as if it were an active virtual machine. Is this an oversight or something that’s likely to change? Possibly, I suppose, but I really doubt it. The thing is, if you buy Software Assurance for all the guests in your primary site, that will include the licenses needed for the replica site.

I don’t know anything for certain, so I’m making an educated guess. Microsoft has been trying really really really hard for years to sell Software Assurance. That’s because it’s a three-year commitment, but it doesn’t really start to pay off until the fourth year (at least, when I ran the numbers in 2010 that’s what I came to), so once you’re on it, it’s kind of silly to get off. All that means that it easily becomes a fairly guaranteed revenue stream for them. So, you can pretty much write it in stone that if HVR has any chance to sell Software Assurance, HVR licensing is not going to change. Of course, now that it seems that Microsoft is going to start accelerating the rate of operating system releases, SA might start to make more sense. Even if they don’t, that fourth-year payoff is very real and it gets even better as time goes on. This is all kind of an aside, though. The point is, VMs that are HVR targets must have their own licenses; whether you buy them outright or purchase SA to cover them is up to you.

Did this post catch you by surprise? Have you already purchased your primary site licenses, thinking that the replicas didn’t need their own? Well, now you’re really going to be upset: you can’t retroactively purchase SA. You have to get all new licenses. This is the primary reason I really dislike it when people just blindly recommend HVR without any knowledge of the situation.

The Bottom Line

So, in a realistic replica-instead-of-cluster situation, you have to buy hardware and licensing for 60 to 80% of your live site, and, with it being that close, maybe you went all out and made it 100%. So, you’ve pretty much spent enough money to have all those nifty high-availability features like Live Migration and Cluster-Aware-Updating and all that, but you don’t have any of it. That is probably not a wise use of capital for most organizations.

Know Your Situation

Disaster recovery planning includes two very well-known concepts: recovery time objective (RTO) and recovery point objective (RPO). An RTO effectively says, “I want to be back online within this amount of time after a failure.” RPO effectively says, “I want to lose no more transactions/data than what occurred within this amount of time prior to the failure.” HVR allows you to set both RPO and RTO to very low numbers. It can provide an RPO as low as five minutes. RTO is a little more shaky; the decision to fail over to replica should not be taken lightly, should only be made by an executive, and should include such metrics as the expected time that the primary site will be out. This is because failing back takes time and is also an involved process. So, RTO is variable, but HVR can allow you to set it to a few minutes, if desired.

That’s all well and good. But, do you actually require a 5-minute RPO and a low RTO? Remember that “desire” and “require” are not synonyms. I worked in one medium-size (approximately 400 employees) organization that estimated it could go without computers for an entire week and even lose data for the week prior to the failure. A normal daily backup routine easily met that RPO with quite a bit of wiggle room. Since they had full replacement insurance, new equipment could be shipped and delivered overnight, also easily outmatching the RTO. If they’d had to choose between replica and clustering, HVR for that company would have been a “sort-of-nice-to-have” and would definitely have lost the debate.

Here’s another consideration people don’t often make. There are two ways to lose your data center in a way that might cause you to want to fail over to replica: actual physical loss and data corruption. If data corruption occurs, there’s a greater than zero chance that HVR will transmit it to the replica site, which immediately invalidates all those replicas (and is one reason why replica does not replace backup). If physical loss occurs, there’s another question: do you actually need to fail over? This seems like an obvious, “Yes.” For some people, it won’t be. If your primary site is a factory that makes widgets, and it’s the only site that makes widgets, and all the existing widgets and equipment that makes those widgets are destroyed, what good are your computers? Bookkeeping and customer contacts? Sure, but do you really need to spin up an entire replica site for that? And do you really need a 5 minute RPO and a short RTO? Maybe you do. Maybe you don’t.

Of course, as has been pointed out, there are situations in which replica will fail. So, being completely dependent upon it to deliver those RPOs and RTOs is not a good plan. And, of course, you can only use disaster recovery technology in the event of a disaster. You can use high availability technology all the time.

Do The Math

I’m not saying that HVR is bad and I certainly don’t want to turn anyone away from using it sensibly. The key word is “sensibly”. Don’t do it because someone on a forum who works for a tech vendor and has only seen HVR in a lab or at customer sites they conned into it and has never worked in any other industry make some off-the-cuff recommendation for you. One thing that I often stress to other audiences is that to really be effective in IT, you must have a fairly good understanding of the business operations of your organization. HVR is a shining example of that need. In my estimation, HVR makes the most sense for organizations who have enough resources that they don’t have to make an either/or decision.