Friday, 25 October 2013

Agile Real Time BI with Pentaho - Part I

A change of direction

Back in June I changed direction in my career and moved from being a business architect trying to deliver business benefit at a large media and publishing house to a pre-sales engineer for Pentaho.  I'd used the community edition of Pentaho quite extensively in the past, so now I've moved on to using the Enterprise Edition to show the possibilities of using a modern plug-able business intelligence platform.  

One of the interesting aspects of my role is visiting trade events and shows giving live demonstrations, showing how we can solve problems on the fly with our technology.  While most people are in the talks, the rest of us on the trade stands have an opportunity for some downtime.  This also gives a great opportunity to put together some new technical demos.  Last week I was at the Cassandra Summit Europe 2013 in London and took the opportunity of an hour or so's downtime to put together a small technical demo.

Agile Real-time BI

Trying to come up with a sensible test case for real-time BI is often a significant stumbling block, but in this instance I took inspiration from the work that Matthew Summerville did on a live London underground map.  Transport for London have an API available with various data sources, this is free to register for and access.

TFL Service Status
The choice of technology in this instance is Pentaho Business Analytics, both the BA server and Data Integration.  I'm using the Enterprise Edition in this case, but most of the functionality needed is in the community edition as well.  Most of the data I'm planning on pulling live but where I need data storage I'll use Cassandra.

The objective that I've set myself is to create three report components:
  1. A report similar to the the TFL service status report
  2. A drill in detail report showing the arrivals at each station
  3. An analytic view of real-time and historic data.
In this post I'll cover component one and I'll followup with parts 2 and three.

The starting point for the service status is an outline architecture.  One method of producing a report component with tight control over the visual appearance is to use the Pentaho Report Designer.  Looking at the options for the  real-time data acquisition for this component, one option is to use a PDI transformation as a data source.  So that's where we're going to start.

Real-Time data acquisition in PDI

The TfL API can be accessed by a HTTP client and will return XML.  PDI has a couple of HTTP clients for Web-Services (SOAP), RESTful interfaces and general purpose HTTP clients, in this instance I'll use the HTTP client.  The HTTP client on it's own will not generate any data rows, you have to add a step before the look-up to generate at least one row.  In this case it would be useful to know the query time anyway, so I'll use the Get System Info step to get the current time.

To this gives us:
The get system info step is used to get the system time:
The connection to the Tfl API is quite simple, just connect and get the line status:

One complication at this point is that you might get a BOM (byte order mark) tacked on the front of the response.  On Windows I found this to be 3 characters and on Linux 1.  As a simple work around I used a simple OpenFormula step.  The next part is to extract the interesting fields using the Get XML Data step.
This extracts the Line and Status fields from the XML as separate data fields in the stream.  I also added a database look-up step to retrieve some color formatting data for the report.  TfL publish a style guide where they specify the color scheme of foreground and background for each line.  So the final transform is:

A dashboard report in Report Designer

The desktop report designer tool is an hugely powerful tool for creating visually rich report components.  While the tool is used extensively used for creating inventory reports an invoices destined for print its flexibility lends it to being suitable for a wide range of purposes.  This flexibility starts from the supported data sources.  These range from simple SQL statements, dynamic queries, metadata repository queries, PDI transformations and from version 5.0 MongoDB.   In this instance I'll use the PDI transformation that I just created.  

When creating a new data source in PDI there is an option to use "Pentaho Data Integration", this opens a dialog where you set a query name and specify the path to the transformation file.  This will then read the transformation and list the steps available to use as the source of the data for the report.  It is also possible to add parameters to filter data from the report.
By dragging and dropping the available data fields into the report you can create the layout that you want.  Each component on the report can have all its parameters set statically in the tool or dynamically from a data source.  In this case I'm going to use the foreground color and background color from the query to style the report elements.   

In this case I've set the text color to the value of the data field "foregroundcolor".  All that remains to do now is style the remaining components and preview the report.
That's our finished report, all that remains to do now is add this into a dashboard.  Firstly you need to publish the report to the BA server, once the report is there in the EE edition you can create a new dashboard page and drag and drop the report element into place.  
In the dashboard designer the refresh interval can be set.  In this case I use 60 seconds.  So every 60 seconds the dashboard element is refreshed, where the report is redrawn and the data source queried, in this case our PDI transform bringing real-time data into the dashboard.  

So I now have a dashboard element that updates every 60 seconds querying and displaying real time data.  In my next post I'll look at how this report can be linked to a detail drill-down to show the status of each line.  This whole exercise should have taken no more than an hour even for a PDI and Pentaho reporting novice, you cant get much more agile than that!


  1. This is a great post. I love seeing Pentaho BI solutions taking advantage of webservices

  2. Syntax:
    pentaho bi development services

    Good Design ,Nice Explanation With Screenshots about Agile Real Time BI with Pentaho,Real-Time data acquisition in PDI,A dashboard report in Report Designer.