Elasticsearch is a nice tool, robust and nicely crafted. Elasticsearch is powerful, but not magic. If you hit it hard enough, you can hurt it, and when you are an admin, the unique task of your users is to break something, with creativity.
Elastcisearch provides features for developers and administrators. Marvel is a nice tool for monitoring an Elasticsearch cluster, but it won't tell you about abuses. Don't worry, Elasticsearch is just yet another HTTP server, the spying tools already exist.
Packetbeat is a monitoring tool using the ELK stack, with some new panel for Kibana. The inconspicious agent is written in go, listen and monitors some protocols (http, mysql, postgresql and redis) with pcap. The agent run aside of its target, it just spys on it (you then will have to find a place without TLS). Packetbeat can send listened events to an Elasticsearch, or, for more fun, to a Redis server, one of the brokers usable with Logstash.
Elasticsearch uses Lucene for storing data (and a lot of other tasks). Lucene index are immutable. When you want to add documents to an index, you write new indices, and merge them. Merge policy is a bit more complex than this example. To remove documents, you first have to write a list of deads documents, and remove them in the next merge. But who removes documents from an index?
Adding documents is expensive. To minimize this cost, you just have to add documents by batches, not one by one.
And it works well, Elasticesearch is now also a time serie database (with its UI : Kibana).
The official technical term is bulk.
Sometime you fumble your bulk, or logstash does.
With the cat api for indices, you can find huge merges.
Try something like :
Good luck with abbreviated columns name. I pick: index, merge total, merge total size, store size, merge total time, search query time. The last one is just for comparaison. Be careful, cumulative counters are reset when the cluster restarts. btw the doc for column names is on the cat/nodes doc page, but _cat/indices also helps.
If your merge total size is 20 times the store size, in less than a month, bingo, you are grinding your hard drive.
White box analysis
Elasticsearch doesn't provide Extended Log Format, ala Apache, and it's a good idea. An Elasticsearch cluster can handle a huge number of queries, and you never know what to log before the drama.
Packetbeat is polite, you just have to install libpcap, the unique dependency of the agent. The agent watches http on port 9200, and sends events to a Redis server. Analysis will be done on another machine.
Finding bad bulk is easy, the request response size are far smaller than request size and contains the information you need. Just throw away events without
/bulk uri, parse the json, and count the number of actions per bulk.
Be curious, read also errors during the bulk insert, for each action.
Standard bulk size is 100, make some tests to find the optimal size for your cluster and your usage.
Logstash elasticsearch_http output plugin has some hidden options. Read the tiny lines.
flush_size is easy to find, 100 is a usable value.
idle_flush_time can be a trap. The 1 second value is snooty.
The default behavior is to buffer at most 100 events, before sending them every seconds.
If you handle few events per minute (like monitoring tools do), you are doomed, the buffer is flushed too early, the 100 promised becomes less than 10, mostly 1.
Elasticstat is a python framework for explorating strange Elasticsearch behaviors.