Blog/Structure your logs

From foldr
Jump to: navigation, search

Information icon.svg This is a blog post. If you want to comment on this blog post, please mention me on Twitter @rightfold. I will set up proper comment functionality later.

WordPress blue logo.svg This blog post was migrated from my old WordPress blog. If you want to link to this blog post, please link to this page instead of to the WordPress blog.

When data has no well-defined structure, extracting useful data from is proves difficult. The most advanced natural language parser to date, SyntaxNet, has a failure rate as high as 10%. You do not want to analyse your data with this if you could have structured it in the first place. This is why most data is in a structured form: relational databases, associative arrays, etc.

Log messages are no different. You log events because when something bad happens you want to know why, and you want to know how often it happened before, and you want to draw various statistics. If your log messages are messy blobs of English text, you cannot do this efficiently. You will try to search the log file, and you will miss log messages, or you will get log messages you didn’t mean to match.

Instead, you want to keep your log messages structured. The way I do this is as follows:

  • Each type of event has its own unique identifier. For example, when a user logs in, the log message has the identifier “logged in”. This way you can quickly find all log messages generated by this type of event, by simply searching the log file for log messages with the identifier “logged in”.
  • Each log message has an associative array that contains extra information. For example, when a user with user ID 1 and IP address logs in, the log message has the associative array {"id": "1", "ip": ""}. All log messages with the same identifier have the same keys in their associative arrays. The advantage is obvious.
  • Log messages are saved in an unambiguous format. Do not just concatenate the identifier and all the associative array fields with spaces. Use a format that a computer can understand.

I store my logs in PostgreSQL databases, because this allows me to have them indexed and I can search through them quickly, and it provides ACID guarantees. For the associative arrays, I use the hstore extension and a GIN index. You could just use a text file, though. Or protocol buffers. Anyway: please keep your data structured when you can. English is lossy, do not throw away your data if you do not have to.