In this final post we are going to cover a simple recovery, as well as do a quick summary. I’ll throw in a few bonus details for free.
Recovery
Our CG has been running now for over 48 hours with our configuration – 48 hours Required Protection Window, 48 max snaps, one snap per hour. Notice below that I have exactly (or just under, depending on how you measure) a 48 hour protection window. I have one snap per hour for 48 hours and that is what is retained. This is because of how I constructed my settings!
If I reduce my Required Protection Window to 24 hours, notice that IMMEDIATELY the snaps past 24 hours are nuked:
The distribution of snaps in this case wouldn’t be different because of how the CG is constructed (one snap per hour, 48 max snaps, 24 hour protection window = 1 snap per hour for 24 hours), but again notice that the Required Protection Window is much more than just an alerting setting in RP+XtremIO.
Alright, back to our recovery example. Someone dumb like myself ignored all the “Important” naming and decided to delete that VM.
Even worse, they decided to just delete the entire datastore afterwards.
But lucky for us we have RP protection enabled. I’m going to head to RP and use the Test a Copy and Recover Production button.
I’ll choose my replica volume:
Then I decide I don’t want to use the latest image because I’m worried that the deletion actually exists in that snapshot. I choose one hour prior to the latest snap. Quick note: see that virtual access is not even available now? That’s because with snap based promotion there is no need for it. Snaps are instantly promoted to the actual replica LUN, so physical access is always available and always immediate no matter how old the image.
After I hit next, it spins up the Test a Copy screen. Now normally I might want to map this LUN to a host and actually check it to make sure that this is a valid copy. In this case because, say, I’ve tracked the bad user’s steps through vCenter logging, I know exactly when I need to recover. An important note though, as you’ll see in a second all snapshots taken AFTER your recovery image will be deleted! But again, because I’m a real maverick I just tell it go to ahead and do the production recovery.
It gives me a warning that prod is going to be overwritten, and that data transfer will be paused. It doesn’t warn you about the snapshot deletion but this has historically been RP behavior.
On the host side I do a rescan, and there’s my datastore. It is unmounted at the moment so I’ll choose to mount it.
Next, because I deleted that VM I need to browse the datastore and import the VMX file back into vCenter.
And just like that I’ve recovered my VM. Easy as pie!
Now, notice that I recovered using the 2:25 snap, and below this is now my snapshot list. The 3:25 and the 2:25 snap that I used are both deleted. This is actually kind of interesting because an awesome feature of XtremIO is that all snaps (even snaps of snaps) are independent entities; intermediate snaps can be deleted with no consequence. So in this case I don’t necessarily think this deletion of all subsequent snaps is a requirement, however it certainly makes logical sense that they should be deleted to avoid confusion. I don’t want a snapshot of bad data hanging around in my environment.
Summary
In summary, it looks like this snap recovery is fantastic as long as you take the time to understand the behavior. Like most things, planning is essential to ensure you get a good balance of your required protection and capacity savings. I hope for some more detailed breakdowns from EMC on the behavior of the snapshot pruning policies, and the full impact that settings like Required Protection Window have in the environment.
Also, don’t underestimate the 8,192 max snaps+vols for a single XMS system, especially if you are managing multiple clusters per XMS! If I had to guess I would guess that this value will be bumped up in a future release considering these new factors, but in the meantime make sure you don’t overrun your environment. Remember, you can still use a single XMS per cluster in order to sort of artificially inflate your snap ceiling.
Bonus Deets!
A couple of things of note.
First, in my last post I stated that I had notice a bug with settings not “sticking.” After talking with a customer, he indicated this doesn’t have to do with the settings (the values) but with the process itself. Something about the order is important here. And now I believe this to be true because if I recreate a CG with those same busted settings, it works every time! I can’t get it to break. :) I still believe this to be a bug so just double check your CG settings after creating.
Second, keep in mind that today XtremIO dashboard settings display your provisioned capacity based on volumes and snapshots on the system, with no regard for who created those snaps. So you can imagine with a snap based recovery tool, things get out of hand quickly. I’m talking about 1.4PB (no typo – PETAbytes) “provisioned” on a 20TB brick!
While this is definitely a testament to the power (or insanity?) of thin provisioning, I’m trying to put in a feature request to get this fixed in the future because it really messes with the dashboard relevance. But for the moment just note that for anything you protect with RP:
- On the Production side, you will see a 2x factor of provisioning. So if you protected 30TB of LUNs, your provisioned space (from those LUNs) will be 60TB.
- On the Replica side, you will see a hilarious factor of provisioning, depending on how many snaps you are keeping.
I hope this series has been useful – I’m really excited about this new technology pairing!
