I've been tasked at work with setting up an Elasticsearch cluster. We use Chef for provisioning and there's an official cookbook available with some instructions but they pressume you are using Amazon EC2 which we are not - we're using our own servers and Vagrant VMs for testing - so I had to figure a few things out myself.
When I first added the recipe to the node's run list it all installed fine but then I found that Elasticsearch was not running. When I tried running it manually it just said "Killed" and exited. This had me scratching my head for quite a while but I finally found the solution.
In some of the official examples they include the following in the Chef node:
"elasticsearch": { "bootstrap.mlockall": true }
It's not explained what this does but in the template config YAML file it says it prevents the JVM from using swap which causes Elasticsearch to perform badly. Fair enough, however, on a virtual machine that has very little memory it can mean that the JVM doesn't have enough memory to run so it crashes. True is the default value so it's not enough to simply not specify this config, you have to set it to false.
Once I got that working my first node had Elasticsearch running and all was well. Then I started up my second node but I couldn't get it to form a cluster with the first.
As per the documentation I had given them both the same cluster_name. Our servers are spread across different networks so I couldn't use the default multicast option for discovery so I added the FQDN's of each node to the unicast list:
"elasticsearch": { "discovery.zen.ping.multicast.enabled": false, "discovery.zen.ping.unicast.hosts": "[\"node1.example.com\", \"node2.example.com\"]" }
Each node has a host entry for each other node and they could telnet to each other on the Elasticsearch discovery port (9300) just fine but when the second node started up I got an error like:
[node2[inet[/ 10.0.2.2:9300]] failed to send join request to master [node1], reason [org.elasticsearch.transport.RemoteTransportException: [node2[inet[/ 10.0.2.2:9300]][discovery/zen/join]; org.elasticsearch.ElasticSearchIllegalStateException: Node [node2[inet[/ 10.0.2.2:9300]] not master for join request from [node2[inet[/ 10.0.2.2:9300]]
Huh? Why was node2 trying to connect to node2? It was my colleague that noticed the references to the 10.0.2.* IPs where we would've expected 192.168.33.* IPs. Turns out that Vagrant always sets the NAT adapter on eth0 and it was the IP of that that Elasticsearch was binding to by default. You can override with the network.host config:
"elasticsearch": { "network.host": "192.168.33.1" }
Once I'd done that for each node (with their respective IPs) the cluster started working.