I had to handle a 2 month old issue related to ElasticSearch. A feature I did not know from one of our large monolithic application at iAdvize was not working anymore. Instead of looking directly at thousand lines of code, the fastest way to resolve it was to gather informations related to the issue from outside the app. I had to answer this questions:
- Is the ElasticSearch server remotely accessible from the frontend servers?
- If it does not, the first thing to do is to bring back the connexion between the two.
- Is the ElasticSearch request valid?
- If it does not, maybe:
- we are doing the request on the wrong indice/alias?
- the query body is malformed?
- Does it yield results?
- If it does not
- but we should get some, then either the query is invalid or elasticsearch have an issue
- If it does
- but from the app point of view we don't get anything
- it's related to the app itself and we will have to look at the code
To verify the first 3 points in one shot, I started tcpdump on the ElasticSearch server using the command below:
tcpdump -A -nn -s 0 'tcp dst port 9200 and (((ip[2:2] - ((ip[0]&0xf)<<2 -="" tcp="" xf0="">>2)) != 0)' -i eth12>
Note: don't forget to change the interface you want to listen on.
Looking at the result I discovered that the app was doing an elasticsearch query on a missing alias. That explains it! Once the alias created, the application feature was working again in production.
The final step was to setup jenkins (or rundeck) to run daily elastic/curator in order to refresh the alias otherwise :
docker run -it --rm bobrik/curator:3.5.1 --host "elasticsearch.domain.com" --port 80 alias --name plugin-xxx-log indices --prefix plugin-xxx-log- --prefix plugin-salesforce-log- --timestring %Y.%m --time-unit months
Now we will be proactively alerted if anything goes wrong. One less thing to worry about!
[Update] I now also use tcpflow to better display content (too bad it's not maintained anymore):
tcpdump -A -l -nn -s 0 'tcp port 9200' -i eth0 | tcpflow -c -e