Wednesday 20 February 2013

Agile BI with Endeca - Part 4 - In search of the secret Sauce

In my previous posts I setup an Endeca evaluation environment, in this post I'll go though a simple but hopefully reasonably realistic scenario to create an application in an attempt to find the Endeca "secret sauce" that has got everyone so excited.

Obviously to start with some sample data is needed. While I could reuse the quickstart data, as this is so well known and covered in the various examples to reuse it again would possibly be cheating.  Partially the unfamiliarity with the data gives a better example.

In this instance I've gone for some data that is freely available from UCAS in the UK.  These are complete datasets of the number of applications and acceptances for universities across the UK.  They are available for download here.  While this represents a reasonably easy entry point to start using Endeca, it also offers some interesting possibilities for enhancing the data from other sources.  I've started by downloading the exe files and extracting the xls files.  These then need to be uploaded to the Endeca server.

To load the files into the Endeca server we need to use an Integrator project.

At this point I will assume you have been through the YouTube Endeca webcast series this should give you the basics of how to use Endeca.

So we need to start up integrator, which needs an X Windows Session of some sort on Linux, so I'm using VNC again.  I put some basic instructions on how to connect to a VNC session in an earlier post. The first step here is to create a new project, I have called my project UKUni.  The only of cheating I'm going to do is copy the workspace file from the quickstart application and the InitDataStore graph.  I changed the DATA_STORE_NAME parameter in the workspace to something new, I used ukuni again.

Use shell command like
mv *.xls /oracle/workspace/UkUni/data-in/
to put your recently downloaded excel files in to the data-in folder of the project. Selecting "refresh" on the folder data-in in the integrator application will make the files appear in the folder.

To load the source data we need a new graph, so right-click on the graph folder and select "New->ETL Graph".  To do any useful analysis across years we need to load and merge the multiple files we have into the datastore, this can be dong using a merge transformation, so firstly drag one of the data files from "data-in" on to the new graph.  This will create a spreadsheet loader associated with the file.

Further down the line we are going to need to create some output metadata for this loader, this is easier to do if we have the output edge connected to the next element.   In this case we know that we will need to create a key and set the year that the file corresponds to, so I'm going to use a reformat transformation to do that.  Drag a Reformat transformation onto your graph and connect the output of the loader to the input of the Reformat transformation.

Double-clicking on the loader opens up the dialog box.  We can see the source file is listed but no mapping.  Click in the mapping cell and then open the mapping dialog box.  The first task is to create some metadata.  Select row 5, the heading row and click Output Metadata "From Selection".  This creates a default metadata mapping based on the row headings.  In addition to the metadata from the file we are also going to need to identify which year this dataset corresponds to and create a key used to join the data later on.  I've added two new fields to the metadata, "Application_year" and "data_key".

Before we dive in and create the transformation we need to connect the output of the Reformat transformation to another component.  In this case I'm going to merge the data from more than one file, so I've added a Merge transformation to the graph.  Link the output of the Reformat to one input of the merge.

The metadata we created in the loader dialog has appeared under the meta-data folder, just not with a very nice name, just edit this and change it to something like institution data, we can thenreuse this for the other files.  I'm also going to use it for the output edge meta-data of the Reformat transformation.

Double clicking on the Reformat transformation opens the dialog, select the transform cell and open the dialog to create the transformation.  Most of the fields can be auto mapped, so use "Automap by name".  The transformations we need to edit are for application_year and data_key.  Editing the transformation for application_year and setting it to "2007" is the simplest way of setting this value.  The data_key is slightly more complex, each row is unique by "Instcode", "JACS Subject Group", "Domicile", "Gender" and "Year".  So if we create a concatenation of these 5 fields in the transformation, as is done for the "FactSales_RecordSpec" in the quickstart application.   This is what my transformation key looks like:
$in.0.InstCode+'-'+$in.0.JACS_Subject_Group+'-'+$in.0.Domicile+'-'+$in.0.Gender+'-'+"2007"

To make this example useful add another years worth of data by dragging in the next years data file.  I just copied the "Reformat" transformation and edited the transformation for the second file.  The meta-data from the first load can be reused.  Using the merge component we added earlier the two streams can be merged using the key of "data_key".   Finally the Endeca bulk loader is needed. Drag in a "Add/Bulk Replace Records" component,  remembering to set the datastore and spec attribute parameters. Connect to the output of the join and select the meta-data to  institution_data.  This is now a basic but workable transformation.

At this point you can save and run the transformation, which hopefully completes without any errors!

The next step is to create an Endeca Studio application to view and explore the data.  To keep things tidy I'll create a new Community to store our new pages in.   Communities are a method of grouping related pages together, along with user access permissions.  You could consider a community to be an application with multiple pages.   Communities are created through the Endeca Studio web interface.  On the control panel Communities menu select Add.  Set the name and a brief description, I set the type to restricted.   Once created you need to assign users to your community.   This is under actions assign members.

While we are in the control panel this is a good opportunity to go in and set the Attribute Settings.  The Attribute settings allow you to group attributes together, select default sort order and multi-select options.  If you are used to a conventional Star Schema type of BI application you could think of this as designing your dimensions on top of our data.

To work out where to start with our attribute groups (dimensional modelling) lets go and look at the data.  The easiest way to begin exploring is to create a  studio page and add some widgets to explore the data.   Start off by going back to the community menu, under actions click  "Manage Pages".  Enter a page name, I used overview, then "Add Page".  Now if you leave the control panel, and navigate to "My Places", the name of the community you created earlier, Overview.  This should bring up a blank page.  Start off by setting the layout template, I used a two column 30/70 split.  There are always a few default components we add to the page, Search Box, Breadcrumbs and a Guided Navigation component.  Each of these will need their datasource setting to out ukuni datasource. Opening up an element in the guided navigation component will now show you the top values for that field.  So looking at "Domicile" and "Gender"  we can see it is information about applicants, instCode and instName are about institutions.  By going through each field, or a subset to start with you can almost sketch out a dimensional model.

The next job to perform is to group the attributes together.  On the Control Panel open the Attribute Settings menu and selecting our datasource loads the attributes for our application.  From looking at the data earlier we know a few things about it, there is data relating to Institutions (name, code), Courses (JACS subject group), Applicants (Domicile, gender) and Applications (Deg_choices, Deg_accepts, etc).  So define these as groups   and assign the attributes to the groups.  Now this is where things are interesting, we effectively have three dimensions (Institutions, Courses and Applicants) and one fact table (Applications).   Isn't that kind of neat? I didn't have to do any of that design in the ETL, and importantly if I want to change it around again I still don't have to go near the ETL.

So now we have a working if really dull and not very useful method of viewing our data, lets change that by adding a chart.  Charts in Endeca are built on top of views, views are the same in concept as views in a relational database such as Oracle, in that they area  query on top of the underlying "physical" tables. They are defined in the View Manager in the Control Panel.

This is what my view looks like:


I've set the dimension flag on our "dimension like" fields (gender etc).  Returning to the application add a Chart component to the right hand panel of the screen.  Use the preferences button on the component to configure the chart.  This is where you set the source (our new view) and chart type etc I went for a stacked bar chart.   On the chart configuration tab the metrics to show and the dimensions to view them by are set, I used Choices_all as the metric, institution region, institution name as the group dimensions and  Gender and Domicile as the series dimensions.  The hidden bit is the "Sort Options" button in the middle towards the bottom, I set the sort order to be by first metric then restrict the number of results to the first 20 to keep things tidy.

So to finish off the chart looks like this:

So thats gone from loading data to viewing it in a meaning full way.

Getting back to the initial objective of this post, was there any "Secret Sauce"?  Was there anything in this process that gives Endeca a crucial advantage?  I'm going to contend that there is, while the overall interface, look and feel and workflow are very slick and certainly up with the best there was something else that was different.  That was the point around the dimensional modelling, rather than the dimensional model being physical it is a logical overlaying on a flat physical store.

The reasons used in the past for having physical dimensional models vary, but generally are based around querying simplicity, search performance and data volumes.  By treating all of our dimensional attributes as degenerate dimensions and adding a logical layer to group them together are we experiencing any trade offs?  What would Dimensional modelling practice say about this?  In a practical sense we are making our fact table exceptionally large, so in theory this should slow down searching.  But if this is counteracted by the speedup factor that an in-memory search orientd database can add do we actually lose anything?  It's difficult to tell with the volume of data in this test application what the performance is, but with the huge difference in performance between disk and memory storage it is quite possible that our "inferior" dimensional design is more than offset by the increase in storage performance.  The MDEX engine appears to given us a tool that can more than offset the design disadvantage of the fully de-normalised data model.

In the next few posts I'll finish off this little application  a little more and "close the loop" on how to make this into a fully functioning "Production quality" Endeca BI application.  Following on from that it would be interesting to see if other open-source in-memory databases can also give some of the advantages of the Endeca MDEX engine.





No comments:

Post a Comment