January 2013 Service Interruptions Postmortem
January was kinda nightmare for us due to service interruptions and performance issues. We have resolved most of the issues and made some critical infrastructural changes which led us to make %100 uptime at February.
What was the problem ?
We had several issues, which are;
- Latency problems
- Split brain issues
- GC pressure
- Cluster time-out problems, some nodes get kicked by master
Some of the issues have best practices to solve, some need more than just configuration. Our cluster was built with m2.medium nodes and cluster was enough in terms of memory and cpu. Even though ElasticSearch just plays very nicely if it scaled horizontally that does not means it will solve all capacity problems. If you don’t have enough memory for each node, JVM starts garbage collections and if garbage collection operations goes very frequently that nodes has potential candidates to be removed from cluster, due to not answering ping requests in time. Also singe core instances suffers these potential issues much more. Another point to consider while scaling horizontally is network latency. Adding more nodes means more network connections also increases latency.
As a conclusion, we have come to;
- Consider scaling vertically instead of horizontally if you feel safe with availability and search performance (Enough replicas)
- Network is evil, less communication means low latency and fast responses
- Singe core cpus does not scale well, more concurrency helps a lot
- Frequent GCs can harm your cluster, configure discovery time-out values carefully and ensure each individual node has enough memory
- To avoid split brain issues with discovery.zen.minimum_master_nodes
- Just adding new nodes may save the day but not tomorrow, monitor your cluster carefully to find out root cause of problems.
So we have changed our cluster from medium instances to m2.large instances, played with configuration to make cluster more stable.
We have %100 uptime and much better latency on February for less money.
You can check out service status here.