Actionable Logging for Smoother Operation and Faster Recovery
Notes from Velocity.
Next session is Actionable Logging for Smoother Operation and Faster Recovery by Mandi Walls (AOL).
Actionable logging:
Next session is Actionable Logging for Smoother Operation and Faster Recovery by Mandi Walls (AOL).
Actionable logging:
- no nonsense logging
- concise, easy to understand
- express symptoms of production issues
- anything that makes the log needs to be fixed
- expending resources on production systems
- the point of logging in production
- diagnosis of issues
- the 4am test
- diagnosis and recovery
- statistics and monitoring
- provide insight into the behavior of the application
- indicate potential issues, and areas for improvement
- not the same goals as development and QA environments.
- access log
- server log, i.e. catalina.out
- application logs
- special use logs for recording specific groups of activities
- where logs are located on the system should be predictable and obvious
- it may be helpful to locate logs on different disk partitions but link them back to the app.
- keep older logs in an obvious place. Having a couple days worth of logs helps you get convergence on the problem.
- everyone has their own method
- roll logs into files with timestamps
- host-01.log.003 vs host-01.log.06202008
- roll all the logs at the same time for a given app to make coordination of events easier
- roll when teh app needs logs rolled: hourly, daily weekly.
- logs should be expressive but not overly verbose
- keys to making logs more actionable:
- appropriate formats
- Timestamping: what not to do:
- 1213988938:tvdata shows/617/306
- 1213988939:tvdata shows/618/307
- Timestamps that mean something:
- Jun 19,2009 4:20:25PM
- Good timestamps give context for linking to external events like network outages or traffic anomalies
- creating a common format for multiple products and log tpes.
- limiting the number of log entries that write to multiple log lines for faster parsing
- deciding how much is too much information
- bad: messages with only numerical error codes in them
- bad: don't put debugging messages in production logs.
- bad: if log message doesn't give you anything to lead on.
- bad: multiple line error messages.
- good: module_id with descriptive message
- don't list non fatal messages as fatal
- incorrect / misleading severity
- not logging anything for fatal errors
- log at the first point an error is encountered - don't log a timeout to a backend as a parse error of date expected from the request
- messages include the method name and key variables to speed up fixes
- suppress anything in the log that isn't actionable
- actively managing, parsing, pruning logs make new errors more obvious
- check teh logs afer every install for new messages
- usernames, passwords, database logins
- provide crib notes for anone getting in your system.
- more than 25% of the number of access log entries is a hindrance. Even 10% may be too much in most environments
- if you log more in server log than access log, then its a problem
- the log is first line of information when a problem occurs
- a production log should be focused on providing information to operations staff, not developers.
- when, where and how messages are logged can help or hinder recovery
Labels: aol, logging, mandiwalls, operations, velocity, velocity08





0 Comments:
Post a Comment
<< Home