Sunday, 5 October 2014

Embedded Pentaho Data Integration - Raspberry Pi

Embedded Pentaho Data Integration - Raspberry Pi

This week I was asked if it was possible to run a Pentaho Data Integration transform in an embedded use case. My initial thoughts were that I didn't know, but I didn't see why not. Best try it out.

In the UK the Raspberry Pi is a very popular, readily available embedded platform that's used in all sorts of fun hacky projects.  Costing about £25 ($40) it's also pretty cheap.  It has a 700 MHz ARM 11 based CPU  and 512MB RAM, so it's no powerhouse, but should be enough.


The board itself is about the size of a credit card and comes with a good selection of ports, HDMI, 4x USB, ethernet and a mini USB for power.  It also has a 40 pin GPIO (General Purpose Inout Output) connector that has a wide range of possibilities.  

The board can be supplied with a 8GB SSD that comes with a collection of operating systems and can then be used for storage and booting.  

To get started I installed Raspbian which is a Debian clone optimized for the PI.  Installing took a couple of minutes and the OS booted, initially I was connected to a monitor and keyboard just to get the setup done.  Once the initial setup was complete and I was on the network, I just had to enable SSH then login over the network.  After this point I dispensed with the keyboard, mouse and monitor.  

I obviously wasn't going to run Spoon, the transformation development environment, but my objective was to see if I could run "a" data transformation on the platform. One way to achieve this was to run a Carte server, this allows you to connect remotely and run transformations.  

The Carte server can be copied over from the data integration design tool folder,  and to my utter amazement with a couple of bugs and errors in the console (some X11 related, could be connected with running headless) the server started up first time.  (I know that's supposed to happen with Java, but still).  


So the next part was to create an ultra simple transformation just to show that things work!  
So this just creates a few rows, waits 10 seconds between each row, gets some system information and writes it to the log, virtually pointless but proves the use case non the less. So the next part is to configure the Carte server, View Tab->Slave Server->New and enter the config.
All configured, now just select run the transform and execute remotely selecting your new slave config, and off it goes.



Just to be sure that it was executing on the Raspberry, here is the console output on the Pi.
So it works, whats next?  That's where the fun can begin, there are a huge range of applications of what this could enable, and additionally lots of options on how to communicate with the remote devices and make them as autonomous as possible.  Hopefully I'll find the time to try some of these out!

Monday, 24 March 2014

Real Time CTools Dashboard - Twitter - Part I

Real Time CTools Dashboards  

If you have read any of my previous blogs, you will have noticed that I like the slightly unconventional and challenging.  My last example was a real time dashboard showing the status of the London Underground system using Pentaho Business Analytics Enterprise Edition.  In this post I'm repeating the "Real Time" theme but using CTools for the front end.

I've split the post into multiple sections as it's rather a lot to put in a single post.
  • Part I  Covers the data integration and data sourcing (This post)
  • Part II  Covers the front end.

CTools?  What's that then?

Unless you have been living on another planet for the last 5 years you will surely have come across the great work being produced by WebDetails  and others.  Over the last few months I've been fortunate enough to work with Pedro's Alves and Martins and the rest of the WebDetails team. I've been inspired to see what I could put together using CTools and the other parts of the Pentaho toolset.  This came along at the same time that we were running an internal competition in the sales engineering group to create a CTools dashboard, so with some assistance from Seb and Leo, I created something a little different.


Real time?  No problem!

One of the really powerful features of CTools is the ability to use a Pentaho Data Integration transform as a direct data source.  As PDI can connect to practically anything at all, transform it and output it as a stream of data, it means you can put practically any data source behind a CTools dashboard, MongoDB, Hadoop, Solr or perhaps a Restful API?  Not only that you can use these multiple data sources in a single transformation and blend the results in real time. In effect it's using a tool that had its roots in ETL as an ETR tool, "Extract Transform Report", or to look at it another way, an ultra powerful visual query builder for big data (or any data for that matter).

The first bit is relatively easy, create a search string, get an authentication token, get the results then clean up.  To enrich the data feed a little more I've added in a WEKA scoring model for enhancing the data stream with sentiment analysis.  At this point I've got a raw feed with the text and some details of the tweeters and some sentiment.  To enliven the dashboard, a few aggregates and metrics are needed.  One option is to add additional steps to the transformation to create additional aggregations, but I'd rather just run the one twitter query to create a data set and then work with that.  There is a way to do that...

Querying the CDA cache

This is where we can use a novel approach to get at the results of the last query.  In CTools the results of each query are held in the CDA cache.  It's possible to access the results of the CDA cache using its URL directly:

To get at the URL open the CDA page associated with the dashboard, select the data access object, and look at the query URL.

This URL can then be used as a source in a transformation, in this case I put it in the HTTP Client step:

Then using the JSON parser the individual fields can be split out again:

I can then use a range of sorting, filtering and aggregation steps to create the data views that I want.  Just branching the data flow and copying the data allows you to create multiple views in a single transformation, each of which can be picked separately by a new CDA data source.

That pretty much covers the data integration, at this point I've used PDI as a visual querying tool from a WEB API and the same tool works to query the cached results of the first query - this is powerful stuff.  In addition to this (with virtually no effort at all) I'm doing sentiment analysis on twitter using a WEKA model.  This process could be enhanced by using MongoDB as a repository for longer term storage of each of the search results.  Allowing the possibility of using multiple "iterative" guided searches to build up a results "super-set" that could then be analyzed further.

In the next post I'll talk about building the CTools front end using CDE.  I'll include some rather distinct styling, and some very flashy but ultra simple CSS and Javascript tricks.