Problem Description:
ES cluster crash looping with following state in describe of the pod:
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 143
Ready: False
Restart Count: 7
Error from the logs show following stack(trimmed to fit):
{"type": "server", "timestamp": "2023-07-03T09:15:52,578Z", "level": "WARN", "component": "o.e.i.e.Engine", "cluster.name": "es-cluster", "node.name": "statefulset-es-cluster-0", "message": " [auditlog-00001][0] failed engine [failed to recover from translog]", "cluster.uuid": "gsyV2x6QQ3CH1HR7C3KjVQ", "node.id": "5x_n1C52T7mR3B1Xu5Qlzw" ,
"stacktrace": ["org.elasticsearch.index.engine.EngineException: failed to recover from translog",
"at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslogInternal(InternalEngine.java:510) ~[elasticsearch-7.16.2.jar:7.16.2]",
"at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:483)
...............
"at java.lang.Thread.run(Thread.java:833) [?:?]",
"Caused by: org.elasticsearch.index.shard.IllegalIndexShardStateException: CurrentState[CLOSED] operation only allowed when recovering, origin [LOCAL_TRANSLOG_RECOVERY]",......
............
"at org.elasticsearch.index.shard.IndexShard.lambda$openEngineAndRecoverFromTranslog$10(IndexShard.java:2008) ~[elasticsearch-7.16.2.jar:7.16.2]",
"at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslogInternal(InternalEngine.java:508) ~[elasticsearch-7.16.2.jar:7.16.2]",
"... 14 more"] }
Steps to troubleshoot :
Following message in the logs specifies what shard/index has a problem and could not be recovered from the translog :
"message": " [auditlog-00001][0] failed engine [failed to recover from translog]"
- Extract the URL and auth details from es-cluster YAML. Args: --es.uri is where you will find it. And form a curl command to the indices endpoint(refer the example in third point).
- Exec into statefulset-es-cluster-0 pod.
- Curl to http://localhost:9200/_cat/indices/auditlog-00001?v=true. Note auditlog-00001 is the name of index we got from message above. Here is how the output would look like:
ibuser@statefulset-es-cluster-0:~$ curl -kvv -u elastic:<password> http://localhost:9200/_cat/indices/auditlog-00001?v=true
ibuser@statefulset-es-cluster-0:~$ * Trying 127.0.0.1:9200...
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open auditlog-00001 kAXOglGYQw2iHcQ0d7ppYQ 1 1 3310756 0 892mb 892mb
* Connection #0 to host localhost left intact
- You would be needing the UUID to locate the index and this also gives you the size and other details of the index in the discussion.
- From ES cluster standpoint we have option to truncate, delete or remove bad shard from the tranlog using command like below:
ibuser@statefulset-es-cluster-0:~/bin$ elasticsearch-shard remove-corrupted-data --index auditlog-00001 --truncate-clean-translog
But the catch is, it would need ES server to be down, which we cannot since it will then throw the user out of the container.
- We could also restore from snapshot, using ES snapshot restore API, but that again is not a possibility since snapshot is not enabled on ES cluster.
- That leaves us with one option, i.e. copy the index locally and run core lucene index fixer against it.
- Since you have already determined the UUID of the index. It will help you locate the path. es-storage mount is writing to /data and all ES indices and segments are stored with /data/data/*. So given the path and UUID, the specific index path resolves to :
/data/data/nodes/0/indices/kAXOglGYQw2iHcQ0d7ppYQ
- Use the following command to copy the index and its segment and translog file locally:
kubectl cp instabase/statefulset-es-cluster-0:/data/data/nodes/0/indices/kAXOglGYQw2iHcQ0d7ppYQ/ . -c es-cluster
Note: kAXOglGYQw2iHcQ0d7ppYQ is the index UUID in our case, it will be completely different in your environment.
- Confirm that you copied all the files from statefulset-es-cluster-0:/data/data/nodes/0/indices/kAXOglGYQw2iHcQ0d7ppYQ/ directory to your local current directory.
- Download lucene-backward-codecs-9.7.0.jar and lucene-core-9.7.0.jar file from web that would support the current ES version or use the attached. This is for the checkIndex utility to run.
- Execute following command to check if there is any problem with your index(output for both the commands should look same as the last one. It has been trimmed to fit in here):
java -cp lucene-core-9.7.0.jar:lucene-backward-codecs-9.7.0.jar -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex /tmp/0/index/
Note: Before you execute the next command to fix the index please read the following warning is from the utility itself:
**WARNING**: -exorcise *LOSES DATA*. This should only be used on an emergency basis as it will cause documents (perhaps many) to be permanently removed from the index. Always make
a backup copy of your index before running this! Do not run this tool on an indexthat is actively being written to. You have been warned!
- Execute following to command to fix the problematic indices:
java -cp lucene-core-9.7.0.jar:lucene-backward-codecs-9.7.0.jar -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex -exorcise /tmp/0/index/
Output would look like following:
.....
.....
26 of 26: name=_13k maxDoc=1348150
version=8.10.1
id=91vo2qqmwbjx20dbve67219i4
codec=Lucene87
compound=false
numFiles=17
size (MB)=361.263
diagnostics = {os=Linux, java.runtime.version=17.0.1+12, mergeFactor=10, os.arch=amd64, source=merge, java.vendor=Eclipse Adoptium, os.version=5.4.209-116.367.amzn2.x86_64, mergeMaxNumSegments=-1, java.version=17.0.1, java.vm.version=17.0.1+12, lucene.version=8.10.1, timestamp=1681852502704}
no deletions
test: open reader.........OK [took 0.112 sec]
test: check integrity.....OK [took 3.298 sec]
test: check live docs.....OK [took 0.000 sec]
test: field infos.........OK [208 fields] [took 0.000 sec]
test: field norms.........OK [76 fields] [took 0.331 sec]
test: terms, freq, prox...OK [5626251 terms; 49071371 terms/docs pairs; 49200065 tokens] [took 9.171 sec]
test: stored fields.......OK [2697863 total field count; avg 2.0 fields per doc] [took 0.633 sec]
test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec]
test: docvalues...........OK [129 docvalues fields; 0 BINARY; 3 NUMERIC; 0 SORTED; 50 SORTED_NUMERIC; 76 SORTED_SET] [took 2.229 sec]
test: points..............OK [20 fields, 2709267 points] [took 0.109 sec]
test: vectors.............OK [0 fields, 0 vectors] [took 0.000 sec]
No problems were detected with this index.
Took 16.174 sec total.
- Once the indexes are fixed. Copy it back to override the existing one.
Learnings : Kubectl cp/copy can be replaced by temporarily mounting the path to an EC2 or network host(if that is an option) where the files are accessible even without ES cluster running and can be corrected in place(besides you have java in the path for above command). In that case elasticsearch-shard utility could also be used.