18 April 2018 Performance issues and service interruptions at some of US clustersI want to give a small description about the issue happened today 18/04/2018.
After getting service down alerts we have confirmed one of the hosts failed instance checks at AWS and failed.
All our plans are HA and designed to handle cases like this one. However after one host machine failed rest could not able to handle current load.
Our initial review it should handle but somehow it could not able to until we noticed one node has some disk performance problems. We are using high IOPS disks and optimised (EBS optimisation to handle high IOPS at disk operations) instances at AWS. Unfortunately host causing bottleneck was not created as EBS optimised and failed.
Our first attempt to just increase capacity and transparently handle the issue however host having huge load was not moving data and while trying to move all gone worse and semi working parts stopped working as well.
Replacing that one was not possible without provisioning everything from scratch and bring data via backups.
We have checked all hosts and our configuration to ensure all provisioned AWS instances are EBS optimised to avoid this to happen again.
Sorry for all the trouble and bad experience.