Tuesday, 5 November 2013

Agile Real Time BI with Pentaho - Part III

Bringing it all together

This is the third article in a trilogy on using Pentaho Business Analytics to do Business Intelligence on real-time data sources in next to no time.  The first two articles cover using  Pentaho Report Designer reports in combination with Pentaho Data Integration transformations to provide summary and detail level reports in a dashboard.  The details of how to create these are covered in part I and part II.

Lets take a look at what I'm aiming for at the end of this article:
This shows the summary report covered in Part I, the detail level report from Part II, and whats that on the far right?  This is a Pentaho Enterprise Edition Analyzer view running on Mondrian, and in this instance reporting off a Cassandra database being updated every 60 seconds, so that's real-time (well to the nearest 60 seconds) OLAP analysis!

This example is a huge over simplification of what is possible using the technology, but a crucial component here is Pentaho Data Integration's data blending technology.  The outcome of this is that we can use a NoSQL database as a data source for Mondrian via JDBC.

There are three transformations behind this:
  • Read the API and load the data into Cassandra
  • Clear the Mondrian Data cache
  • Read from Cassandra and make the data available over JDBC

Reading the TfL API and putting the results into Cassandra

The first part is to get some data, this reuses the basic API call and XML parsing from Part I, in this case I'm also capturing the status details field which I also want to store in my Cassandra database for possible future enhancement (stay tuned).  I also separate the time and date elements to make the Cassandra Key.  
There is really very little to configure to write the data into Cassandra, just configure details like the column family (think wonky table), specify the fields to use as a key and that is about it. 

Clear the Mondrian Cache

When we make new data available within Cassandra and we're querying via Mondrian we need some way of indicating that the cache needs to be updated.  There are a couple of ways to achieve this, including telling the data model not to cache the data, but in this case I'll take the nuclear option and blow away the cache after loading new data.  This can be done via an API call in version 5 of Pentaho, so I used a HTTP Client step in PDI.
I just used a simple job set to repeat on a 60 second loop to run these two transformations.  

Now for the Magic

So how do we use a traditionally SQL based technology on a NoSQL database?  This is where we can use one of the new features of Pentaho Business Analytics 5.0, data blending.  In effect what this allows a developer to do is create a data transformation using the steps available in PDI and make the resulting data stream available for SQL querying via a thin JDBC driver.  The transformation is stored in the enterprise repository and the data made available via the Data Integration Server, details are available in the Pentaho Wiki on how to do this.  The transformation in this instance could not be simpler, read the data with a Cassandra input step:
add a constant of 1 as the "count" of the number of entries to aggregate up in Mondrian, and then use a step to act as the "output" in this case a select values step.  
The final step in the transformation is to use the Data Service tab in the transformation settings.  This associates a data service name with the data stream from a transformation step.

The easiest way to get at the data is to use the data source wizard in the Pentaho User Console, but before doing this you need to create a new JDBC connection (you can do this in one step, it;s just cleaner to explain in two).   The JDBC driver needs to use the custom driver type, and a custom connection URL.

For the custom connection URL I used:
jdbc:pdi://localhost:9080/kettle?webappname=pentaho-di
and for the driver:
org.pentaho.di.core.jdbc.ThinDriver
The final step was to create a new data source in the data source wizard, I created a simple analysis view and then added a very quick and easy stack column visualization on an analyzer report.  Add the new report into the dashboard created in Part I, set the refresh interval as appropriate and you're done.

So that's it!  Real time BI with a combination of reports with drill down and real time OLAP style analytic reporting.

Where Next?

This has been an interesting quick and easy experiment but has by no means reached the limit of whats possible.  There are three paths that I might pursue in future posts, advanced visualizations, text search or predictive analytics.  The visualization options could by something like a Force Directed Layout showing the stations and relative locations of the trains as a network graph.  Another alternative would be a version of the Live London Underground map.  The text search could be an interesting option to search for particular issues or performance problems, possibly linked with a twitter search?  Finally another option is to use WEKA to do some predictive analytics on historic data and influencing factors on performance, and build a predictive engine to answer questions such as "It's raining, it's 08:15 on a Monday morning I'm at Kings Cross and I need to get to Angel, should I just walk?".  The answer to that question will almost certainly be yes in every instance, but you get the idea.

If you are interested in a one-on-one demo look me up at one of the trade fairs that I regularly attend for #Pentaho. Keep up to date on when I'll be about by following me on twitter @MarkdMelton.



Tuesday, 29 October 2013

Agile Real Time BI with Pentaho - Part II

Where to go next?

This is part II of a trilogy of posts on using the Pentaho Business Analytics platform to do Business intelligence on real-time sources as rapidly as possible (it took me longer to write the first post I that it did to build the solution!).  If you have found this article from elsewhere I'd suggest reading part I first, which is available here:
In Part I I implemented a simple report as part of a dashboard using a PDI transformation as a data source.  I'd also like to point out that I'm not the first person to do this, nor is this aspect of the solution a new Pentaho feature, this has been around since 2011, as shown here by Wayne, Using a Pentaho Data Integration Data Source with the Pentaho Report Designer.  In part II I'm going to expand this to show how a second detail level drill-down can be created passing parameters into the transformation to control the search.  Part III goes on to use some Pentaho 5.0 features.

Part II starts off by creating a new transformation to retrieve and process the detail level data for one specific line.  After creating a new transformation the transformation properties are modified by adding a parameter.  In this case I've added two, one for the URL of the detail level service URL and one for the line that I'm interested in. 

The first part of the transformation repeats the call to the line status API, but gets back more detail on the service, in parallel with this it fetches the summary level detail for the individual line, this lists information for individual trains and stations on a line.  
Again there are a few steps to trim byte order marks etc.  There is a database look-up step to get the color formatting for the line from the same database.  In addition in this case the second detail level look up requires a line code.  These are published by TfL in their developer guides available here.  I have also  put these into a database.  The query URL is then constructed and passed to a second HTTP request.  This generates a large XML response where I use a Get Data From XML step to break out the values I'm interested in.  In this case I loop on the "/ROOT/S/P/T" element and then use XPaths to get the fields I need inside and outside of the looping element.
Platform and Station are one level and two levels back from the looping element.  This then creates one data row for each train active on the line.   I then join this stream with the line status information that I have from the earlier call.  This is then followed by a slightly odd section...
Filter Ghost stations? I was a little confused when I first looked at the data and found that there were stations appearing on the Bakerloo line and Central line that were listed as having overground services.  A little bit of investigation discovered something a little strange, there are many stations that used to be part of the underground, but for one reason or another are no longer part of the underground network, details are available here.  Why a station that ceased to be part of the underground in 1982 is still appearing in an API query for that line I haven't been able to work out. 

If you are interested in some of the more bizarre aspects of the London Undergrounds missing and closed station there is an excellent website dedicated to this topic underground-history.  
A little bit of further tidying of the data was required for the predicted arrival time, the default for "in the station" is a '-' character, and as I wanted to sort by arrival time this was not much use, so I replace this with "00:00".  The next steps just sort the trains by arrival order and filter this down to just the next three arrivals.

Line Status Report

This dashboard panel will again use a PRD report to display the data.  The complexity in this case, aside from rather more data, is passing a parameter to the transformation. In this case I create a parameter "Line".  In the Pentaho data Integration Data Source popup there is an Edit Parameter button.

This just links the name of the parameter as it is known in the report to its name in the transformation parameters. I then added the data fields to the report, did a bit of manual styling  linked in the line status details, station names etc
To make this report work in a pair with the report created in Part I as part of the dashboard, a minor modification was needed to the first report.  The Pentaho EE dash boarding has a capability to very easily link dashboard widgets using parameters and content linking.  To access this in a PRD report you need to add a hyperlink from the cell that will be the drill-down link.  The link itself will be intercepted, so doesn't actually hit it's end point, but I used:

 So at this point I have two transformations and two reports.  All that remains is to edit the dashboard, drag and drop in the second report, enable content linking from report one and attach this as a parameter to report two.  The refresh can be left on the individual panels or moved to the dashboard level.


So the dashboard at the end of Part II looks like this:
Clicking on the line in the left hand report changes the line listed in the right hand panel. In Part III I'll show a very simple example of how this real time data can be stored in Cassandra and queried using a Mondrian Analyzer view.

Friday, 25 October 2013

Agile Real Time BI with Pentaho - Part I

A change of direction

Back in June I changed direction in my career and moved from being a business architect trying to deliver business benefit at a large media and publishing house to a pre-sales engineer for Pentaho.  I'd used the community edition of Pentaho quite extensively in the past, so now I've moved on to using the Enterprise Edition to show the possibilities of using a modern plug-able business intelligence platform.  

One of the interesting aspects of my role is visiting trade events and shows giving live demonstrations, showing how we can solve problems on the fly with our technology.  While most people are in the talks, the rest of us on the trade stands have an opportunity for some downtime.  This also gives a great opportunity to put together some new technical demos.  Last week I was at the Cassandra Summit Europe 2013 in London and took the opportunity of an hour or so's downtime to put together a small technical demo.

Agile Real-time BI

Trying to come up with a sensible test case for real-time BI is often a significant stumbling block, but in this instance I took inspiration from the work that Matthew Summerville did on a live London underground map.  Transport for London have an API available with various data sources, this is free to register for and access.

TFL Service Status
The choice of technology in this instance is Pentaho Business Analytics, both the BA server and Data Integration.  I'm using the Enterprise Edition in this case, but most of the functionality needed is in the community edition as well.  Most of the data I'm planning on pulling live but where I need data storage I'll use Cassandra.

The objective that I've set myself is to create three report components:
  1. A report similar to the the TFL service status report
  2. A drill in detail report showing the arrivals at each station
  3. An analytic view of real-time and historic data.
In this post I'll cover component one and I'll followup with parts 2 and three.

The starting point for the service status is an outline architecture.  One method of producing a report component with tight control over the visual appearance is to use the Pentaho Report Designer.  Looking at the options for the  real-time data acquisition for this component, one option is to use a PDI transformation as a data source.  So that's where we're going to start.


Real-Time data acquisition in PDI

The TfL API can be accessed by a HTTP client and will return XML.  PDI has a couple of HTTP clients for Web-Services (SOAP), RESTful interfaces and general purpose HTTP clients, in this instance I'll use the HTTP client.  The HTTP client on it's own will not generate any data rows, you have to add a step before the look-up to generate at least one row.  In this case it would be useful to know the query time anyway, so I'll use the Get System Info step to get the current time.

To this gives us:
The get system info step is used to get the system time:
The connection to the Tfl API is quite simple, just connect and get the line status:

One complication at this point is that you might get a BOM (byte order mark) tacked on the front of the response.  On Windows I found this to be 3 characters and on Linux 1.  As a simple work around I used a simple OpenFormula step.  The next part is to extract the interesting fields using the Get XML Data step.
This extracts the Line and Status fields from the XML as separate data fields in the stream.  I also added a database look-up step to retrieve some color formatting data for the report.  TfL publish a style guide where they specify the color scheme of foreground and background for each line.  So the final transform is:

A dashboard report in Report Designer

The desktop report designer tool is an hugely powerful tool for creating visually rich report components.  While the tool is used extensively used for creating inventory reports an invoices destined for print its flexibility lends it to being suitable for a wide range of purposes.  This flexibility starts from the supported data sources.  These range from simple SQL statements, dynamic queries, metadata repository queries, PDI transformations and from version 5.0 MongoDB.   In this instance I'll use the PDI transformation that I just created.  

When creating a new data source in PDI there is an option to use "Pentaho Data Integration", this opens a dialog where you set a query name and specify the path to the transformation file.  This will then read the transformation and list the steps available to use as the source of the data for the report.  It is also possible to add parameters to filter data from the report.
By dragging and dropping the available data fields into the report you can create the layout that you want.  Each component on the report can have all its parameters set statically in the tool or dynamically from a data source.  In this case I'm going to use the foreground color and background color from the query to style the report elements.   

In this case I've set the text color to the value of the data field "foregroundcolor".  All that remains to do now is style the remaining components and preview the report.
That's our finished report, all that remains to do now is add this into a dashboard.  Firstly you need to publish the report to the BA server, once the report is there in the EE edition you can create a new dashboard page and drag and drop the report element into place.  
In the dashboard designer the refresh interval can be set.  In this case I use 60 seconds.  So every 60 seconds the dashboard element is refreshed, where the report is redrawn and the data source queried, in this case our PDI transform bringing real-time data into the dashboard.  

So I now have a dashboard element that updates every 60 seconds querying and displaying real time data.  In my next post I'll look at how this report can be linked to a detail drill-down to show the status of each line.  This whole exercise should have taken no more than an hour even for a PDI and Pentaho reporting novice, you cant get much more agile than that!


Monday, 17 June 2013

When Enterprise Architecture and BI collide

Last week I was in London at the Enterprise Architecture Conference Europe, first off I must say that this was a really enlightening conference, containing both good and bad examples of EA.  I'm a firm believer that those presenters who are brave enough to stand up and say "this just didn't work out" are often more valuable than those who want to represent everything as an unmitigated success.  For a while now I've been working around the idea of a post on how your EA strategy can and should impact   your BI/Big Data/Enterprise Analytics (delete as appropriate) strategy.

For those that have grown up on the Kimball method of building data warehouses, there might seem to be one solution of "we do masses of up front analysis and build a dimensional data warehouse", while this is a gross exageration it sums up many approaches I have seen.  The stock BI strategy has lots of nice, fairly obvious statements of "do away with data silos", "single version of the truth", "driven by business requirements" which are all very nice, but how do you actually go about achieving this? If you have the ideal of a CEO/CFO who is a strong advocate of the value of BI and is trusting enough to leave the specialists to get on with it, then you are truly blessed. But this is not always the case and there are often many other factors that now have to be taken into account.  With the velocity of change now increasing massively, not just in technical terms but business change we need to step back and look at the big picture for a moment.

Taking the view from the top, an enterprise or company is a collection  of people, processes and data working towards a common goal.  Our role as BI professionals is to help the enterprise do this more effectively, by enabling an increase in revenue, reducing costs or avoiding costs.  Enterprise Architecture is also a profession that aligns with this.  If you are not familiar with EA I'd recommend reading the excellent book "Enterprise Architecture as Strategy" by Jeanne W. Ross, Peter Weill and David C. Robertson.

There needs to be a clear differentiation between "Enterprise Architecture" and "IT Architecture for an Enterprise"  I've come across several IT architects who describe themselves as "Enterprise architects" when they only deal in IT.  Enterprise Architecture is the view of the business processes, their standardisation and integration and the underlying systems that support them.  Sometimes what was "Enterprise Architecture" is now being called "Business Architecture". A simple example of the difference is the IT view of "how can I redesign or implement this system to be more efficient?", the EA view is "how can I redesign or implement the processes supported by this system to be more efficient?".  There is no point in building the most perfect and cost effective IT system to support a process that no longer needs to be done.

While I'm on the topic one common theme I've come across is an odd behaviour of a commonly held belief that "IT cannot lead business strategy". While it is true to an extent that just because there "is" a new technology you should not necessarily use it, to be blinkered to new technical opportunities is just as bad.  It's much more helpful to see IT and the business as having a symbiotic relationship, each a separate entity with different drivers and opportunities but ultimately part of the same ecosystem, mutually dependent no the success of the other.

So back to the thread,  whether it is explicitly stated or not your Enterprise has an architecture, in my experience it mostly got there by accident rather than design.    The MIT Sloan Centre for information research  demonstrate 4 different operating models depending on the level of business process integration and business process standardisation.

By default most businesses subconsciously fall into the Diversification category, but this should not necessarily be seen as a bad thing, sometimes this is the most effective and cost efficient model for operating in certain markets.  Let me give my paraphrasing for each category, both the "formal" and reality of whats happening on the ground:
  • Diversification:  
    • Formally - Each business unit has the flexibility to implement its own services and products as it sees fit, there may be some shared services but generally each unit has the flexibility to define, build and run its own services and processes as required.  
    • The reality - Its chaos out there.  Lots of duplication of systems, different processes for performing the same action, multiple business definitions for the same entity. Rapid pace of uncontrolled change.  You will probably also come across business units implementing their own IT systems. 
  •  Coordination:
    • Formally - A high degree of data integration, prebuilt delivery channels and standardised technology.  Each business unit still has a degree flexibility variation of the processes it uses on top of the standard services.
    • The reality -  Lots of effort to define and enforce the standards.  The body responsible for enforcing the standards often seen as an obstacle to progress leading to a tendency to move back down to the diversification model.
  •  Replication:
    • Formally - Pretty much a franchise model, standard branding and processes reused repeatedly.  High efficiency from reduced risk of implementing new untried processes.  Local data ownership but enterprise wide definitions as part of the process.
    •  The reality - replication of data systems and small local variations to standards.  No clear view of the customer gives a risk of competing against yourself for business.  Continual risk that the processes may be seen a limiting factor and an obstacle to progress leading to a tendency to move back to a diversification model.
  • Unification:
    • Formally - Highly organised cost effective model, highly effective at identifying cross-sell and up-sell opportunities.
    • The reality - "You will be assimilated", but implementing a global enterprise BI system is easier to achieve in this model than any other.  Generally to have implemented this model means that the architecture and standards governance in the Enterprise must be top notch.
Before I go on to look at the impact that each of these operating models has on a BI strategy there is another important influencing factor.  Where in your organisation do your enterprise architects sit?  Obviously by this I don't mean the nice seats in the corner with the good windows and pot plants, but whats their reporting line?  Generally if they are within the IT organisation their effectiveness will be considerably reduced.  The tendency will be for the business to see them as "IT" so "what do they know about business" .  This was a very common theme at the EAC conference last week.   The risk is that while you may have "IT Architects" who do their best, the business is doing it's own Enterprise Architecture to their own tune.

How does this affect our BI strategy?  Well even if the EAs have a clear vision of the future if they don't have the authority to rigorously enforce this or they are seen as living in an ivory tower the reality is that over time new processes and capabilities will appear that bypass the central organisation.  Essentially your enterprise has reverted to the diversification model.  This becomes your problem when you try to build a BI capability for one model but the reality proves to be very different.  The advice here is form a good relationship with the EAs (you may even be in that group) but also look further afield to see the bigger picture.  Also remember that if they are doing their job well EAs will be having a tough time, they are often a voice of sanity in a enterprise trying to produce change for a better future, keeping that better future in mind is not an easy task when dealing with the day to day politics, especially as your actions may well render parts of the business redundant.  This was best summed up in a presentation last week by "When you are up to your arse in alligators, just remember you are there to drain the swamp".

So getting to the point, how should your BI strategy be influenced by the companies operating model? Lets start with the easy option, your enterprise is following a unification model, firstly, lucky you, someone else has already done most of the hard work of data integration and standardisation. In this model you can pretty much go to the shelf pick off a standard BI strategy of a single enterprise warehouse and stand a good chance of being successful.   All change should be planned and coordinated, where you just need to ensure you are plugged in at the right point.  Your technical implementation can also be considerably simplified by the reduced requirements for data integration and standardisation.  These are the models that the vendors love to tout as examples, being mostly successful and at the lower end of the cost range.

Now onto the slightly more problematic models, firstly replication.  The primary problem here is data integration and standardisation of entities not covered by the replicated processes. But the first question to ask here is do you need the "whole" view of the enterprise.  While the single view of the enterprise might be a nice to have and certainly required from the very top of the business, this may not even be needed at the lower levels.  Think of it as letting your BI strategy model follow that of the business.  Do you really need to go to the level of standardising elements such as customer address right across the enterprise?  As each business unit operates independently with it's own customer base what value are you adding by doing this?   So even if you build a single conceptual warehouse you can probably have dedicated model for each unit and only aggregate data at a level where it makes sense to do so.  While there is more effort involved in maintaining essentially separate solutions this may well be more successful that trying to force the business to change just to fit the niceties of your warehouse architecture.

Another slightly less problematic model is coordination.  Provided that everyone is sticking to the rules again most of the hard work for data standardisation and integration will have been done.  The problem here is how do we tailor reporting to suit local process variation?  The difficulty here is firstly there may be hidden data sources that support the localisation, these could even be the dreaded multi-thousand line spreadsheets, secondly you may well need to have a different view on the same data where local process practise adds a variation to the standard definition.  The latter problem can in part be resolved by moving away from the standard Kimball dimensional model to the more mature models using "Foundation layers" or similar techniques to abstract the dimensional model from the main data storage location.  Here each business unit can have it's own local dimensional model tailored to suit its local process variation.

Finally we reach diversification, this is really the wild west of BI and data warehousing, again you really need to ask the question "Do we need an Enterprise wide view?"  if the answer is yes the solution will be challenging not just from a standards and data perspective but technically as well with the associated costs of such complexity.  But the biggest hurdle will be getting business sponsorship for the project, your estimates will be put alongside example implementations using Qlikview in enterprises using a unification model with the inevitable questions about the extra cost.  Your first hurdle here is going to be getting the funding to just do the work to establish the scale of the problem and produce a governance and technical plan.  The only advice there is you have to do your best to present the vision of the possibilities and attempt to get a good executive sponsor who appreciates the scale of the problem.

While a standard approach here might be to start with the Kimball method of detailed requirements gathering followed by building a dimensional warehouse, this is going to be hugely costly and probably outpaced by the rate of change in the business.  So my best advice here is simple, start with a narrow scope, just a single process or area, but in all its diverse forms.  Use as much data abstraction as possible, expect change to your source systems.  Do just in time requirements gathering for your analytical layer, again requirements will change there is no point in documenting something in great detail only for it to be obsolete by the time you finish writing it.  A good approach to this problem is presented by Oracle in their paper "Information Management and Big Data,  Reference Architecture".  This is not an Oracle specific architecture and could be applied using a range of solutions. Deloitte present a very similar model in their paper on "How to build a successful BI strategy".  The most important role you are going to need in your team here is a top notch data modeller.  The abstract data model that constitutes the foundation layer or Enterprise Data Warehouse as Deloitte refer to it is a critical success factor.  The key here is to be able to focus on the similarities between data sources and processes, not the differences, while the detail and the differences are important it is all too easy to get bogged down in the detail and suffer from analytical paralysis.  Just make sure your data modeller is a "glass half full" personality.  The other key role will be your business analyst, they again need to focus at least initially on the "big picture" of what types of questions does the system need to answer, rather that diving into the detail of the actual questions or worse still report layouts and columns.

So in summary perhaps one of the questions you should ask before taking up a BI architecture role should be "What is your enterprise architecture model and governance process?", if they cannot answer or there is no enterprise model, be vary wary and look carefully at what you are taking on, you might be taking on the role of chief alligator keeper. 

Saturday, 23 March 2013

Endeca Information Discovery 3.x breaks cover

While just looking at the Endeca documentation I noticed that Endeca 3.x has appeared.

I've not had time to do any in-depth digging around, but there look to be some nice headlines:

  • New chart types in studio (Bubble and Scatter look to be the biggest changes)
  • The documentation looks to have been tied up and better structured
  • End users can add their own data from spreadsheets to analyse - the data provisioning service
  • The Content Acquisition System has become the Integrator Acquisition System 
The new chart types may not seem to be that significant but I've always considered them to be two of the most useful ways of presenting many types of data.  I know that in some use cases I've come across their absence was a major blocker.

The data provisioning service is a major enhancement for many use cases.  I could go on at length about the pros and cons of allowing end users or analysts to "bring your own data" but not even having it as an option was a major blocker.  

I always felt that previous versions of the Endeca documentation had something of a "thrown together" quality about them in some areas, so it looks like there has been some tidying up.

The old content acquisition system, seems to have transformed into the Integrator Acquisition system.  I'm not sure what else has changed there.

I'll try to go through the new features in more detail as I get chance to explore them.  But at first glance this looks to be a major step forward, making this tool much more applicable in the general "data discovery" space against the likes of Tableau and Spotfire,  add in the unstructured power of of the Salience engine and I think Endeca looks all set for lift off.  

More information is available on the Oracle website.

Apologies for not being able to put in more detail at this time but I've got a half marathon to run in the morning.






Monday, 18 March 2013

So what is Endeca for?

I'm taking a slight diversion from my planned route with this post.  I was looking at  some of the technical aspects of how to use Endeca, but talking with various people there still seems to be some confusion about what Endeca is or does or more importantly where it fits in to an information management ecosystem.  My impression is that even Oracle are still not quite sure, some of the industry watchers see it as just an enterprise search tool.   So lets try to get to the root of what Endeca is and what are the right use cases for it.

The term BI is like "Big Data" becoming increasingly abused.  It used to be relatively simple to define what BI was, you had sources, ETL tools, Warehouses reporting, OLAP and dashboards.  Then the term "Enterprise Analytics" came along, so it that BI as well?  Usually this involved the implementation of some form of mathematical model to predict future performance based on past performance and a set of variables. So now BI covers what did happen, and what might happen.  But to enact business change and get the desired results of increased sales, improved retention, reduced costs, or whatever the objective might be there is still the human evaluation and partial speculation in the processing of the data.  There is still the gap of knowing why something happened, while I'd freely admit it is possible to refine descriptive BI to provide some answers to the question why did this happen, it is a rather slow and laborious process.  The BI system architects, analysts and developers needed to know where to get the answers to know why and how to process them to get the insight into why events happened.

In the reality of a very competitive business world people need to make decisions much faster than the ability of the data scientists to find the probable cause and the developers can add it into the BI platform.  Which is where tools like Qlikview appear on the scene.  By delivering the nirvana of self service BI business users could follow up on a hunch and confirm it much more quickly than previously.  Used correctly, products like Qlikview, Tibco Spotfire and Tableau  add immense business value alongside the descriptive BI platforms such as OBIEE, Cognos and Business Objects.  So where does Endeca sit in this space?  

My biggest concern about Endeca revolves around Oracles definition of what its for.  The official line would appear to be it's a tool for "Unlock Insights from Any Source".  But I'm not convinced that they have yet figured out how to unlock the value pitch from this proposition.  It is an expensive product and in any business you must be able to put together a reasonable business case to demonstrate that you will get a return on your investment.  Which straight away leads to "what insights?  and "what are they worth to us?"  finally "does it have to be built by IT?" and I think it's here that the problem lies.  Possibly in the parts of the world where the economy is showing signs of recovery and business leaders are less risk averse the "lets take the risk" approach might come off.    Currently in the UK this is certainly not the case, money is tight in most businesses and management is historically more conservative.  Can end users use Endeca on their own without IT involvement?  Certainly not yet, Clover ETL is a nice tool but it's still beyond the skill-set of most non-techies.So Oracle still have some work to do here, there needs to be a real means of demonstrating value to the business if Oracle are to get their UK sales.  

So lets see if I can help them out a little.  I rather like the concept of Endeca and technically it's very clever, so I'd like to see it succeed.  So first lets look at the business cases where we would use Endeca, or actually as a diversion where we would not use it.  If we're looking at just structured data of reasonable quality then Qlikview, Spotfire or Tableau come higher up the list of solutions.  Just from the perspective of product maturity and feature set all three are more capable than Endeca as it comes out of the box.  So do we need to be looking into the realms of "unstructured" or weakly typed data?  There are three aspects to this, where does this unstructured data come from and how do we get it into Endeca, what value is in the data and how do we extract it?

From a business perspective generally unstructured data could appear in a few places: 
  • email 
  • call logs
  • online publications
  • forums
  • twitter
  • special interest websites 
  • review sites
So is there anything of value in that lot?  If we look at the top business applications of Text Analytics:

  • Product/Brand/Reputation Management
  • Customer Experience Management
  • Search or question answering
  • "other research"
  • Competitive intelligence
There are some potential areas for overlap here, but in email communication we can strike the top two business applications.  In Emails from customers there may well be useful insight that could help identify problems with products and services early  that might adversely affect product or brand reputation.  Equally we might resolve issues around customer service experience by identify problematic customer service processes or people.  This is tightly targeted data of known context so it's reasonably easy to see how it could be mined for value.  To extract this value we could just use a simple white-list tagger of product names and have the metric counting the number of occurrences, any sudden change indicates something worth investigating.  But whats the value of this data? In high value consumer goods industries this would be about protecting a brands reputation,  almost certainly enough to justify the investment.  In service industries with high churn rates such as mobile telephony again there is probably value from brand protection, but there is probably a limit to the volume of email with more service calls via voice.  So this is rather simple analysis of "key word counting"  to produce an actionable metric, hardly rocket science but there is probably value there.    

So away from the product white-listing  where do we go?  Particularly in the rather more qualitative area of customer service "quality of engagement" this is not an easy problem to solve.  Which is where Lexalytics Salience engine steps in.  This is an extra licensed component in Endeca, but could be where a lot more value can be unlocked.  By scoring the overall sentiment of a piece of text, deriving metrics and picking out themes  actionable intelligence is created.  

The email and call log analytics is a very good use case, and entirely inside the corporate firewall.  Looking outside the firewall we can begin to address some of the other problems by looking at forums, special interest websites and review sites.  There are plenty of off the shelf sentiment analysis feeds such as  the salesforce marketing cloud or Radian6 as it was previously known.  This type of product is a feed of data for sites and systems that they choose to monitor and how they chose to monitor it.  So if you needed to answer more specific problems you either need to start getting highly skilled programmers, data scientists and analysts involved, or look at the Endeca content acquisition capabilities along with sentiment analysis.  This is now beginning to take you beyond conventional sentiment analysis as consumed by most marketing departments.  These sort of capabilities would have helped many companies get early warning of problems and issues that blindsided them, but is it possible to make the business case for this sort of investment ahead of time?  That is a more difficult case to make, building a business case on the "unknown, unknowns" is never going to be easy.

Is there a market for Endeca?  Undoubtedly, but I don't think it looks quite like what Oracle are trying to sell into, certainly in the UK.  In the hands of a data scientist or data analyst Endeca can do incredible things, but what it cannot yet do is allow end users to do their own analysis on "Any source".  So What else does Endeca need to really fly?  The methods of reading data need to improve, there needs to be end user level "loading of any data", while I find Clover ETL easy to use, most end users will not.  There are two further significant omissions; statistical analysis in the form of R and better visualisations.  While it is possible to do statistical analysis in EQL it is a rather painful business, requiring the developer to produce the functions from base mathematical functions.  The limited range of visualisations is also frustrating, the absence of bubble charts is a major omission.  

Is there anything that can be dome outside of the Endeca platform to address some of it's shortcomings? Possibly there is.  The system has many APIs and developer indexes so the problem of visualisation might be solvable.  The dynamic metadata capabilities of Clover could possibly give a way in for "any data" creating an end user data load capability. 

I'm sure with time Endeca will find its place in the information management ecosystem, I just hope it does not get stuck just being branded as "an enterprise search tool" because it is much more than that, or even caught up in the big data hype and rush towards Hadoop in some circles.





Wednesday, 20 February 2013

Agile BI with Endeca - Part 4 - In search of the secret Sauce

In my previous posts I setup an Endeca evaluation environment, in this post I'll go though a simple but hopefully reasonably realistic scenario to create an application in an attempt to find the Endeca "secret sauce" that has got everyone so excited.

Obviously to start with some sample data is needed. While I could reuse the quickstart data, as this is so well known and covered in the various examples to reuse it again would possibly be cheating.  Partially the unfamiliarity with the data gives a better example.

In this instance I've gone for some data that is freely available from UCAS in the UK.  These are complete datasets of the number of applications and acceptances for universities across the UK.  They are available for download here.  While this represents a reasonably easy entry point to start using Endeca, it also offers some interesting possibilities for enhancing the data from other sources.  I've started by downloading the exe files and extracting the xls files.  These then need to be uploaded to the Endeca server.

To load the files into the Endeca server we need to use an Integrator project.

At this point I will assume you have been through the YouTube Endeca webcast series this should give you the basics of how to use Endeca.

So we need to start up integrator, which needs an X Windows Session of some sort on Linux, so I'm using VNC again.  I put some basic instructions on how to connect to a VNC session in an earlier post. The first step here is to create a new project, I have called my project UKUni.  The only of cheating I'm going to do is copy the workspace file from the quickstart application and the InitDataStore graph.  I changed the DATA_STORE_NAME parameter in the workspace to something new, I used ukuni again.

Use shell command like
mv *.xls /oracle/workspace/UkUni/data-in/
to put your recently downloaded excel files in to the data-in folder of the project. Selecting "refresh" on the folder data-in in the integrator application will make the files appear in the folder.

To load the source data we need a new graph, so right-click on the graph folder and select "New->ETL Graph".  To do any useful analysis across years we need to load and merge the multiple files we have into the datastore, this can be dong using a merge transformation, so firstly drag one of the data files from "data-in" on to the new graph.  This will create a spreadsheet loader associated with the file.

Further down the line we are going to need to create some output metadata for this loader, this is easier to do if we have the output edge connected to the next element.   In this case we know that we will need to create a key and set the year that the file corresponds to, so I'm going to use a reformat transformation to do that.  Drag a Reformat transformation onto your graph and connect the output of the loader to the input of the Reformat transformation.

Double-clicking on the loader opens up the dialog box.  We can see the source file is listed but no mapping.  Click in the mapping cell and then open the mapping dialog box.  The first task is to create some metadata.  Select row 5, the heading row and click Output Metadata "From Selection".  This creates a default metadata mapping based on the row headings.  In addition to the metadata from the file we are also going to need to identify which year this dataset corresponds to and create a key used to join the data later on.  I've added two new fields to the metadata, "Application_year" and "data_key".

Before we dive in and create the transformation we need to connect the output of the Reformat transformation to another component.  In this case I'm going to merge the data from more than one file, so I've added a Merge transformation to the graph.  Link the output of the Reformat to one input of the merge.

The metadata we created in the loader dialog has appeared under the meta-data folder, just not with a very nice name, just edit this and change it to something like institution data, we can thenreuse this for the other files.  I'm also going to use it for the output edge meta-data of the Reformat transformation.

Double clicking on the Reformat transformation opens the dialog, select the transform cell and open the dialog to create the transformation.  Most of the fields can be auto mapped, so use "Automap by name".  The transformations we need to edit are for application_year and data_key.  Editing the transformation for application_year and setting it to "2007" is the simplest way of setting this value.  The data_key is slightly more complex, each row is unique by "Instcode", "JACS Subject Group", "Domicile", "Gender" and "Year".  So if we create a concatenation of these 5 fields in the transformation, as is done for the "FactSales_RecordSpec" in the quickstart application.   This is what my transformation key looks like:
$in.0.InstCode+'-'+$in.0.JACS_Subject_Group+'-'+$in.0.Domicile+'-'+$in.0.Gender+'-'+"2007"

To make this example useful add another years worth of data by dragging in the next years data file.  I just copied the "Reformat" transformation and edited the transformation for the second file.  The meta-data from the first load can be reused.  Using the merge component we added earlier the two streams can be merged using the key of "data_key".   Finally the Endeca bulk loader is needed. Drag in a "Add/Bulk Replace Records" component,  remembering to set the datastore and spec attribute parameters. Connect to the output of the join and select the meta-data to  institution_data.  This is now a basic but workable transformation.

At this point you can save and run the transformation, which hopefully completes without any errors!

The next step is to create an Endeca Studio application to view and explore the data.  To keep things tidy I'll create a new Community to store our new pages in.   Communities are a method of grouping related pages together, along with user access permissions.  You could consider a community to be an application with multiple pages.   Communities are created through the Endeca Studio web interface.  On the control panel Communities menu select Add.  Set the name and a brief description, I set the type to restricted.   Once created you need to assign users to your community.   This is under actions assign members.

While we are in the control panel this is a good opportunity to go in and set the Attribute Settings.  The Attribute settings allow you to group attributes together, select default sort order and multi-select options.  If you are used to a conventional Star Schema type of BI application you could think of this as designing your dimensions on top of our data.

To work out where to start with our attribute groups (dimensional modelling) lets go and look at the data.  The easiest way to begin exploring is to create a  studio page and add some widgets to explore the data.   Start off by going back to the community menu, under actions click  "Manage Pages".  Enter a page name, I used overview, then "Add Page".  Now if you leave the control panel, and navigate to "My Places", the name of the community you created earlier, Overview.  This should bring up a blank page.  Start off by setting the layout template, I used a two column 30/70 split.  There are always a few default components we add to the page, Search Box, Breadcrumbs and a Guided Navigation component.  Each of these will need their datasource setting to out ukuni datasource. Opening up an element in the guided navigation component will now show you the top values for that field.  So looking at "Domicile" and "Gender"  we can see it is information about applicants, instCode and instName are about institutions.  By going through each field, or a subset to start with you can almost sketch out a dimensional model.

The next job to perform is to group the attributes together.  On the Control Panel open the Attribute Settings menu and selecting our datasource loads the attributes for our application.  From looking at the data earlier we know a few things about it, there is data relating to Institutions (name, code), Courses (JACS subject group), Applicants (Domicile, gender) and Applications (Deg_choices, Deg_accepts, etc).  So define these as groups   and assign the attributes to the groups.  Now this is where things are interesting, we effectively have three dimensions (Institutions, Courses and Applicants) and one fact table (Applications).   Isn't that kind of neat? I didn't have to do any of that design in the ETL, and importantly if I want to change it around again I still don't have to go near the ETL.

So now we have a working if really dull and not very useful method of viewing our data, lets change that by adding a chart.  Charts in Endeca are built on top of views, views are the same in concept as views in a relational database such as Oracle, in that they area  query on top of the underlying "physical" tables. They are defined in the View Manager in the Control Panel.

This is what my view looks like:


I've set the dimension flag on our "dimension like" fields (gender etc).  Returning to the application add a Chart component to the right hand panel of the screen.  Use the preferences button on the component to configure the chart.  This is where you set the source (our new view) and chart type etc I went for a stacked bar chart.   On the chart configuration tab the metrics to show and the dimensions to view them by are set, I used Choices_all as the metric, institution region, institution name as the group dimensions and  Gender and Domicile as the series dimensions.  The hidden bit is the "Sort Options" button in the middle towards the bottom, I set the sort order to be by first metric then restrict the number of results to the first 20 to keep things tidy.

So to finish off the chart looks like this:

So thats gone from loading data to viewing it in a meaning full way.

Getting back to the initial objective of this post, was there any "Secret Sauce"?  Was there anything in this process that gives Endeca a crucial advantage?  I'm going to contend that there is, while the overall interface, look and feel and workflow are very slick and certainly up with the best there was something else that was different.  That was the point around the dimensional modelling, rather than the dimensional model being physical it is a logical overlaying on a flat physical store.

The reasons used in the past for having physical dimensional models vary, but generally are based around querying simplicity, search performance and data volumes.  By treating all of our dimensional attributes as degenerate dimensions and adding a logical layer to group them together are we experiencing any trade offs?  What would Dimensional modelling practice say about this?  In a practical sense we are making our fact table exceptionally large, so in theory this should slow down searching.  But if this is counteracted by the speedup factor that an in-memory search orientd database can add do we actually lose anything?  It's difficult to tell with the volume of data in this test application what the performance is, but with the huge difference in performance between disk and memory storage it is quite possible that our "inferior" dimensional design is more than offset by the increase in storage performance.  The MDEX engine appears to given us a tool that can more than offset the design disadvantage of the fully de-normalised data model.

In the next few posts I'll finish off this little application  a little more and "close the loop" on how to make this into a fully functioning "Production quality" Endeca BI application.  Following on from that it would be interesting to see if other open-source in-memory databases can also give some of the advantages of the Endeca MDEX engine.





Saturday, 9 February 2013

Agile BI with Endeca - Part 3

In previous posts I have looked at the broader picture of what Endeca is and given a brief outline of how to setup an evaluation environment on AWS.  In this article I will cover a few minor details of how to make your evaluation environment a little more friendly.

Starting and Stopping


One of the reasons for using AWS is that you only pay for the resource while you are using it, but its rather inconvenient to have to keep logging in via ssh each time to restart the server and portal.   I'm rather surprised that Oracle did not already provide an init script to do this for you, if they have then I have not been able to find it anywhere.  

Linux servers start and stop services with scripts located in /etc/init.d take a look at a few of the examples in there.  

This is one I have written myself to start, stop and check the status of Endeca.  It is not perfect, and there is lots of scope for improvement but it is a starting point.

#!/bin/bash
#
#       /etc/init.d/endeca
#
# Starts the Endeca services
#
# Author: Mark Melton
#
# chkconfig: - 95 5
#   

# Source function library.
. /etc/init.d/functions

RETVAL=0

#
#       See how we were called.
#

server_path="/oracle/Oracle/Endeca/Server/7.4.0/endeca-server"
server_start_cmd="start.sh"
server_stop_cmd="stop.sh"

portal_path="/oracle/Oracle/Endeca/Discovery/2.4.0/endeca-portal/tomcat-6.0.35/bin"
portal_start_cmd="startup.sh"
portal_stop_cmd="shutdown.sh"


start() {


        # Endeca Server
        #
        # Check if endeca-server is already running
        server_pid=$(ps -ef | grep endeca-server | grep -v grep | awk '{print $2}')
        if [ -z "$server_pid" ]; then
            echo -ne $"Starting Endeca Server\n"
            cd $server_path
            $cmd ./$server_start_cmd >> logs/server.log &
            RETVAL=$?
        else
            echo -ne "endeca_server already running with PID $server_pid\n"
        fi

        # Endeca Portal
        #
        # Check if endeca-portal is already running
        portal_pid=$(ps -ef | grep endeca-portal | grep -v grep | awk '{print $2}')
        if [ -z "$portal_pid" ]; then
            echo -ne $"Starting Endeca Portal\n"
            cd $portal_path
            $cmd ./$portal_start_cmd >> portal.log &
            RETVAL=$?
        else
            echo -ne "endeca_portal already running with PID $portal_pid\n"
        fi                                                                                                                                                                   


        return $RETVAL
}

stop() {
        echo -ne $"Stopping endeca-server\n"
        cd $server_path
        $cmd ./$server_stop_cmd server.log &
        RETVAL=$?
        echo

        if [ ! -z "$RETVAL" ]; then
            echo -ne "There was a problem stopping the endeca-server"                                                                                                        
   
            return $RETVAL
        fi
        cd $portal_path  
        $cmd./$portal_stop_cmd >> portal.log &
        RETVAL=$?

        return $RETVAL
}


restart() {
        stop
        start
}

reload() {
        restart
}

status_at() {
        server_pid=$(ps -ef | grep endeca-server | grep -v grep | awk '{print $2}')
        if [ -z "$server_pid" ]; then
            echo -ne $"Endeca Server is not running\n"
        else
            echo -ne $"Endeca Server is running\n"
        fi

        portal_pid=$(ps -ef | grep endeca-portal | grep -v grep | awk '{print $2}')
        if [ -z "$portal_pid" ]; then
            echo -ne $"Endeca Portal is not running\n"
        else
            echo -ne $"Endeca Portal is running\n"
        fi


}

case "$1" in
start)
        start
        ;;
stop)
        stop
        ;;
reload|restart)
         restart                                                                                                                                                             
        ;;
condrestart)
        if [ -f /var/lock/subsys/wdaemon ]; then
            restart
        fi
        ;;
status)
        status_at
        ;;
*)
        echo $"Usage: $0 {start|stop|restart|condrestart|status}"
        exit 1
esac

exit $?
exit $RETVAL

There is just one more command to run:

# chkconfig endeca on

Now when you reboot your Endeca services should restart automatically.

Location


Just one last thing to make it easier to find your Endeca environment is to use an Elastic IP address.  You can have one elastic IP address associated with a running instance at no charge.  Go to the Elastic IP address tab on your AWS console, request a new Elastic IP address and associate it with your Endeca Instance.  You will now be able to access your Endeca instance on  elastic_ip:8080.  You do have to associate the IP address with the server instance each time you start it up, but this can easily be done through the EC2 console.

It just remains to give your server and stop and start just to make sure that everything is working as expected.