All week chasing VMware cluster problems. It'll now roll onto Monday. It always sucks when things fail in ways so subtle the fault tolerance doesn't catch it and you need to start pressing buttons.

A hypervisor has had some weird storage failure where a couple of devices have just vanished, but the VSAN stuff hasn't properly accommodated this fact, so it's just sitting there protesting.

If it considered the node totally failed, it would all be so much easier.

@sullybiker Ceph has the opposite problem: if a node sneezes it starts recovering data immediately unless you go out of your way to tell it not to.

