Strategies to Log & Retain Activity Data

My previous article The Need to Log & Retain Activity Data argued the very need of logging & retaining data. In this post, I am listing out various logging strategies along with some brief explanation, utility, associated constraints and effectiveness of each method. As highlighted before, most people fail to understand the difference between logs/traces, audit trails and database time-stamps. Each of Log/Trace, Audit Trail & Timestamping has its purpose, pros and cons.

Log or Trace

When I think of a log, the first thing that comes to my mind is a trace consisting of developer injected SOPs (SysOuts), messages/exceptions generated by the server or any third-party component used. This trace could be written to a flat file or a database table.

Example:

2009-31-12 23:59:59 ::: LoginServlet >>> john.doe >>> Incorrect Password
192.168.10.101 – 10/Nov/09:13:55:36 -0700 “GET /logo.gif HTTP/1.0” 200 2326
instantiated Bean: com.detangle.ejbs.whatever
Java.Lang.NullPointerException at …..
Connected to ProductionDB: Saved record #862
Executed Query: INSERT INTO SUPPLIERS… : 1 row affected
inside getSuppliersForCategory: Category = “Laptops”

Comments:
Being the most detailed trace for system activity, user activity, exceptions, database operations, it is likely to generate thousands of lines per hour in Multi-tenant SaaS hosting. Such logging should surely be done for various reasons, but is not a real information source for administrators.

#Think: Imagine an auditor asking for all the changes to the ‘Amount’ field of Purchase Order #102365 for customer ABC; can you query the trace to generate this report?

If you would agree, it is not the easiest way for an administrator to trace user activity. Thus, writing enough SOPs does not imply that you have great audit capability.

Audit Trail or Change History

Audit trail is nothing but change history at some grain (explained below), accompanied by timestamps and user information. There is thus a table containing the current values for a record, and another history table that maintains versions for each record in the main table. This is the same as ‘Type 2 change’ as understood by Data-warehousing professionals.

Example:
By grain I mean that we could either add one version row in the history table per changed field or per changed record. Let’s understand this:

Main table:

ID (PK) Name Designation
4052 Peter Manager

The day Peter is promoted to ‘Sr Manager’, a Type1 change would be applied to the Main Table and the designation over-ridden.

To record history of Peter’s past designation(s) and the date of promotion, changes would be recorded in the history table with one of the two schemas (and thus the grain set at field or record):

History Table (Grain: Field):

ID (PK) Version Field Old_Val New_Val Date User
4052 1 Designation Manager Sr Manager 2009-11-10… HR_Admin

History Table (Grain: Record):

ID (PK) Version Designation Archived_On User
4052 1 Manager 2009-11-10… HR_Admin

Comments:
Managing versions of records is more of an overhead, unless the end-user or administrator would really want to revert to a previous version or track changes on a frequent basis. ERP databases carry this overhead, but with the advantage of being able to generate any sort of audit report. Without such logging, you could be inviting a lot of trouble from one of the four biggies! Overheads can be minimized by building this mechanism right into the persistence framework.

Time-stamping

During my (good-old) development days, no matter whether I created a table in MS-Access or MySQL, I always started off with a key field (PK, Identity, and Auto-Increment OR FK), a couple of timestamps and a flag to indicate row validity.
With the latest generation of persistence & ORM frameworks, this kind of comes free. But the framework will manage these fields, provided the developer opts in.

Example:
Continuing with the example from my last post, when Darth assigned Malory as Trudy’s approver in the leave management system, Bob programmed to create the following record:

Establishing the relation (in a factless fact table) without any additional information:

Employee Manager
Trudy Malory

David wished that some more information be captured; something like…

Establishing the relation with a time-stamp, validity & login:

Employee Manager Date isValid Login
Trudy Malory 2009-11-10… false Darth

Soft-Delete: Did you notice the ‘isValid’ above and something mentioned about row validity? The flag allows you to perform a logical- or a soft-delete. This way the record is not physically removed, but marked invalid. Transactions should become squatters in the database, and live there forever. I believe that timestamps should always be implemented.

Comments:
Timestamps will always act as a starting point for investigation. It gives an immediate insight into the ownership of the record.

#Think: Even in the absence of the ‘User_Login’ field, if only David had the timestamp, he would only have to go through the 100 or fewer entries generated during that fraction of a second! Makes sense to you?

Few Links:

Logging as required by US regulation/law:
A bit over-engineered, yet comprehensive design

Common Log Format
Extended Log Format