A journey in your http logs.

le

Your webserver has many things to tell you. How many people talk to it? Is it fast enough, large enough? By people, you mean robots or humans?

Your webserver has many things to tell you. How many people talk to it? Is it fast enough, large enough? By people, you mean robots or humans?

First, there was AWstats and its friends the static logs analyzer. Then Google analytics and its clones (like Piwik) came. Live analyzer are nice and useful, but they are lying, they just analyzed what real user do. They don't see RSS nor REST or bot usage. It's great for commercial analysis, bad for performance and usage analysis. You can spy your application with statsd, or, more simply dig into your Apache log. Good news, Nginx logs are similar.

Logstash is the capital ship of log analysis. Kibana is its dashboard. Things are simple, you push your own data to Kibana. It's just JSON object, anything with a timestamp. You can naively push one object per http connection. You'll get a cute AWstat, with information about bandwidth usages, just another AWstats, twenty years later. You can mix old and new fashion, with session indexing. Naïve sessions are simple : a collection of web connection from the same user agent and IP. When there is a more-than-15-minutes hole, it's a new session. Standard HTTP logs miss some information, like random session-token to identify everybody. You will merge some users behind the same NAT, but it doesn't matter. Let's add some real world information. User agent parsing is a curse, but you can guess if it's an honest bot, or a web browser. ua-parser provides a big list of regex and then some in every language in Earth. Converting IP to geo info is as easy with pygeoip. Let's add some metrics in each session : an histogram per second, to watch speed, duration and total hit. Don't count asset medias, just potentially dynamic web pages. That's what Poteau does : count things and push it to a naked Kibana.

Each website is unique, and behaves differently, and some surprising stuff can happen. Half of the sessions come from bots, half from Google (as computer, and as smartphone), 25% from Baidu, the rest from the others (Bing or worst). Bots are polite, they drive slowly, but constantly. Monitoring bots break my cute graphics, with their never ending sessions. Bots don't own a watch, they work every hour of the day, unlike humans. Kibana is a promising tool, but it's better to be prepared. Don't push raw logs, it's a lie.


Partager cet article :