Mark Melton on Business Intelligence

Saturday, 23 March 2013

Endeca Information Discovery 3.x breaks cover

While just looking at the Endeca documentation I noticed that Endeca 3.x has appeared.

I've not had time to do any in-depth digging around, but there look to be some nice headlines:

New chart types in studio (Bubble and Scatter look to be the biggest changes)
The documentation looks to have been tied up and better structured
End users can add their own data from spreadsheets to analyse - the data provisioning service
The Content Acquisition System has become the Integrator Acquisition System

The new chart types may not seem to be that significant but I've always considered them to be two of the most useful ways of presenting many types of data. I know that in some use cases I've come across their absence was a major blocker.

The data provisioning service is a major enhancement for many use cases. I could go on at length about the pros and cons of allowing end users or analysts to "bring your own data" but not even having it as an option was a major blocker.

I always felt that previous versions of the Endeca documentation had something of a "thrown together" quality about them in some areas, so it looks like there has been some tidying up.

The old content acquisition system, seems to have transformed into the Integrator Acquisition system. I'm not sure what else has changed there.

I'll try to go through the new features in more detail as I get chance to explore them. But at first glance this looks to be a major step forward, making this tool much more applicable in the general "data discovery" space against the likes of Tableau and Spotfire, add in the unstructured power of of the Salience engine and I think Endeca looks all set for lift off.

More information is available on the Oracle website.

Apologies for not being able to put in more detail at this time but I've got a half marathon to run in the morning.

Monday, 18 March 2013

So what is Endeca for?

I'm taking a slight diversion from my planned route with this post. I was looking at some of the technical aspects of how to use Endeca, but talking with various people there still seems to be some confusion about what Endeca is or does or more importantly where it fits in to an information management ecosystem. My impression is that even Oracle are still not quite sure, some of the industry watchers see it as just an enterprise search tool. So lets try to get to the root of what Endeca is and what are the right use cases for it.

The term BI is like "Big Data" becoming increasingly abused. It used to be relatively simple to define what BI was, you had sources, ETL tools, Warehouses reporting, OLAP and dashboards. Then the term "Enterprise Analytics" came along, so it that BI as well? Usually this involved the implementation of some form of mathematical model to predict future performance based on past performance and a set of variables. So now BI covers what did happen, and what might happen. But to enact business change and get the desired results of increased sales, improved retention, reduced costs, or whatever the objective might be there is still the human evaluation and partial speculation in the processing of the data. There is still the gap of knowing why something happened, while I'd freely admit it is possible to refine descriptive BI to provide some answers to the question why did this happen, it is a rather slow and laborious process. The BI system architects, analysts and developers needed to know where to get the answers to know why and how to process them to get the insight into why events happened.

In the reality of a very competitive business world people need to make decisions much faster than the ability of the data scientists to find the probable cause and the developers can add it into the BI platform. Which is where tools like Qlikview appear on the scene. By delivering the nirvana of self service BI business users could follow up on a hunch and confirm it much more quickly than previously. Used correctly, products like Qlikview, Tibco Spotfire and Tableau add immense business value alongside the descriptive BI platforms such as OBIEE, Cognos and Business Objects. So where does Endeca sit in this space?

My biggest concern about Endeca revolves around Oracles definition of what its for. The official line would appear to be it's a tool for "Unlock Insights from Any Source". But I'm not convinced that they have yet figured out how to unlock the value pitch from this proposition. It is an expensive product and in any business you must be able to put together a reasonable business case to demonstrate that you will get a return on your investment. Which straight away leads to "what insights? and "what are they worth to us?" finally "does it have to be built by IT?" and I think it's here that the problem lies. Possibly in the parts of the world where the economy is showing signs of recovery and business leaders are less risk averse the "lets take the risk" approach might come off. Currently in the UK this is certainly not the case, money is tight in most businesses and management is historically more conservative. Can end users use Endeca on their own without IT involvement? Certainly not yet, Clover ETL is a nice tool but it's still beyond the skill-set of most non-techies.So Oracle still have some work to do here, there needs to be a real means of demonstrating value to the business if Oracle are to get their UK sales.

So lets see if I can help them out a little. I rather like the concept of Endeca and technically it's very clever, so I'd like to see it succeed. So first lets look at the business cases where we would use Endeca, or actually as a diversion where we would not use it. If we're looking at just structured data of reasonable quality then Qlikview, Spotfire or Tableau come higher up the list of solutions. Just from the perspective of product maturity and feature set all three are more capable than Endeca as it comes out of the box. So do we need to be looking into the realms of "unstructured" or weakly typed data? There are three aspects to this, where does this unstructured data come from and how do we get it into Endeca, what value is in the data and how do we extract it?

From a business perspective generally unstructured data could appear in a few places:

email
call logs
online publications
forums
twitter
special interest websites
review sites

So is there anything of value in that lot? If we look at the top business applications of Text Analytics:

Product/Brand/Reputation Management
Customer Experience Management
Search or question answering
"other research"
Competitive intelligence

There are some potential areas for overlap here, but in email communication we can strike the top two business applications. In Emails from customers there may well be useful insight that could help identify problems with products and services early that might adversely affect product or brand reputation. Equally we might resolve issues around customer service experience by identify problematic customer service processes or people. This is tightly targeted data of known context so it's reasonably easy to see how it could be mined for value. To extract this value we could just use a simple white-list tagger of product names and have the metric counting the number of occurrences, any sudden change indicates something worth investigating. But whats the value of this data? In high value consumer goods industries this would be about protecting a brands reputation, almost certainly enough to justify the investment. In service industries with high churn rates such as mobile telephony again there is probably value from brand protection, but there is probably a limit to the volume of email with more service calls via voice. So this is rather simple analysis of "key word counting" to produce an actionable metric, hardly rocket science but there is probably value there.

So away from the product white-listing where do we go? Particularly in the rather more qualitative area of customer service "quality of engagement" this is not an easy problem to solve. Which is where Lexalytics Salience engine steps in. This is an extra licensed component in Endeca, but could be where a lot more value can be unlocked. By scoring the overall sentiment of a piece of text, deriving metrics and picking out themes actionable intelligence is created.

The email and call log analytics is a very good use case, and entirely inside the corporate firewall. Looking outside the firewall we can begin to address some of the other problems by looking at forums, special interest websites and review sites. There are plenty of off the shelf sentiment analysis feeds such as the salesforce marketing cloud or Radian6 as it was previously known. This type of product is a feed of data for sites and systems that they choose to monitor and how they chose to monitor it. So if you needed to answer more specific problems you either need to start getting highly skilled programmers, data scientists and analysts involved, or look at the Endeca content acquisition capabilities along with sentiment analysis. This is now beginning to take you beyond conventional sentiment analysis as consumed by most marketing departments. These sort of capabilities would have helped many companies get early warning of problems and issues that blindsided them, but is it possible to make the business case for this sort of investment ahead of time? That is a more difficult case to make, building a business case on the "unknown, unknowns" is never going to be easy.

Is there a market for Endeca? Undoubtedly, but I don't think it looks quite like what Oracle are trying to sell into, certainly in the UK. In the hands of a data scientist or data analyst Endeca can do incredible things, but what it cannot yet do is allow end users to do their own analysis on "Any source". So What else does Endeca need to really fly? The methods of reading data need to improve, there needs to be end user level "loading of any data", while I find Clover ETL easy to use, most end users will not. There are two further significant omissions; statistical analysis in the form of R and better visualisations. While it is possible to do statistical analysis in EQL it is a rather painful business, requiring the developer to produce the functions from base mathematical functions. The limited range of visualisations is also frustrating, the absence of bubble charts is a major omission.

Is there anything that can be dome outside of the Endeca platform to address some of it's shortcomings? Possibly there is. The system has many APIs and developer indexes so the problem of visualisation might be solvable. The dynamic metadata capabilities of Clover could possibly give a way in for "any data" creating an end user data load capability.

I'm sure with time Endeca will find its place in the information management ecosystem, I just hope it does not get stuck just being branded as "an enterprise search tool" because it is much more than that, or even caught up in the big data hype and rush towards Hadoop in some circles.

Wednesday, 20 February 2013

Agile BI with Endeca - Part 4 - In search of the secret Sauce

In my previous posts I setup an Endeca evaluation environment, in this post I'll go though a simple but hopefully reasonably realistic scenario to create an application in an attempt to find the Endeca "secret sauce" that has got everyone so excited.

Obviously to start with some sample data is needed. While I could reuse the quickstart data, as this is so well known and covered in the various examples to reuse it again would possibly be cheating. Partially the unfamiliarity with the data gives a better example.

In this instance I've gone for some data that is freely available from UCAS in the UK. These are complete datasets of the number of applications and acceptances for universities across the UK. They are available for download here. While this represents a reasonably easy entry point to start using Endeca, it also offers some interesting possibilities for enhancing the data from other sources. I've started by downloading the exe files and extracting the xls files. These then need to be uploaded to the Endeca server.

To load the files into the Endeca server we need to use an Integrator project.

At this point I will assume you have been through the YouTube Endeca webcast series this should give you the basics of how to use Endeca.

So we need to start up integrator, which needs an X Windows Session of some sort on Linux, so I'm using VNC again. I put some basic instructions on how to connect to a VNC session in an earlier post. The first step here is to create a new project, I have called my project UKUni. The only of cheating I'm going to do is copy the workspace file from the quickstart application and the InitDataStore graph. I changed the DATA_STORE_NAME parameter in the workspace to something new, I used ukuni again.

Use shell command like
mv *.xls /oracle/workspace/UkUni/data-in/
to put your recently downloaded excel files in to the data-in folder of the project. Selecting "refresh" on the folder data-in in the integrator application will make the files appear in the folder.

To load the source data we need a new graph, so right-click on the graph folder and select "New->ETL Graph". To do any useful analysis across years we need to load and merge the multiple files we have into the datastore, this can be dong using a merge transformation, so firstly drag one of the data files from "data-in" on to the new graph. This will create a spreadsheet loader associated with the file.

Further down the line we are going to need to create some output metadata for this loader, this is easier to do if we have the output edge connected to the next element. In this case we know that we will need to create a key and set the year that the file corresponds to, so I'm going to use a reformat transformation to do that. Drag a Reformat transformation onto your graph and connect the output of the loader to the input of the Reformat transformation.

Double-clicking on the loader opens up the dialog box. We can see the source file is listed but no mapping. Click in the mapping cell and then open the mapping dialog box. The first task is to create some metadata. Select row 5, the heading row and click Output Metadata "From Selection". This creates a default metadata mapping based on the row headings. In addition to the metadata from the file we are also going to need to identify which year this dataset corresponds to and create a key used to join the data later on. I've added two new fields to the metadata, "Application_year" and "data_key".

Before we dive in and create the transformation we need to connect the output of the Reformat transformation to another component. In this case I'm going to merge the data from more than one file, so I've added a Merge transformation to the graph. Link the output of the Reformat to one input of the merge.

The metadata we created in the loader dialog has appeared under the meta-data folder, just not with a very nice name, just edit this and change it to something like institution data, we can thenreuse this for the other files. I'm also going to use it for the output edge meta-data of the Reformat transformation.

Double clicking on the Reformat transformation opens the dialog, select the transform cell and open the dialog to create the transformation. Most of the fields can be auto mapped, so use "Automap by name". The transformations we need to edit are for application_year and data_key. Editing the transformation for application_year and setting it to "2007" is the simplest way of setting this value. The data_key is slightly more complex, each row is unique by "Instcode", "JACS Subject Group", "Domicile", "Gender" and "Year". So if we create a concatenation of these 5 fields in the transformation, as is done for the "FactSales_RecordSpec" in the quickstart application. This is what my transformation key looks like:
$in.0.InstCode+'-'+$in.0.JACS_Subject_Group+'-'+$in.0.Domicile+'-'+$in.0.Gender+'-'+"2007"

To make this example useful add another years worth of data by dragging in the next years data file. I just copied the "Reformat" transformation and edited the transformation for the second file. The meta-data from the first load can be reused. Using the merge component we added earlier the two streams can be merged using the key of "data_key". Finally the Endeca bulk loader is needed. Drag in a "Add/Bulk Replace Records" component, remembering to set the datastore and spec attribute parameters. Connect to the output of the join and select the meta-data to institution_data. This is now a basic but workable transformation.

At this point you can save and run the transformation, which hopefully completes without any errors!

The next step is to create an Endeca Studio application to view and explore the data. To keep things tidy I'll create a new Community to store our new pages in. Communities are a method of grouping related pages together, along with user access permissions. You could consider a community to be an application with multiple pages. Communities are created through the Endeca Studio web interface. On the control panel Communities menu select Add. Set the name and a brief description, I set the type to restricted. Once created you need to assign users to your community. This is under actions assign members.

While we are in the control panel this is a good opportunity to go in and set the Attribute Settings. The Attribute settings allow you to group attributes together, select default sort order and multi-select options. If you are used to a conventional Star Schema type of BI application you could think of this as designing your dimensions on top of our data.

To work out where to start with our attribute groups (dimensional modelling) lets go and look at the data. The easiest way to begin exploring is to create a studio page and add some widgets to explore the data. Start off by going back to the community menu, under actions click "Manage Pages". Enter a page name, I used overview, then "Add Page". Now if you leave the control panel, and navigate to "My Places", the name of the community you created earlier, Overview. This should bring up a blank page. Start off by setting the layout template, I used a two column 30/70 split. There are always a few default components we add to the page, Search Box, Breadcrumbs and a Guided Navigation component. Each of these will need their datasource setting to out ukuni datasource. Opening up an element in the guided navigation component will now show you the top values for that field. So looking at "Domicile" and "Gender" we can see it is information about applicants, instCode and instName are about institutions. By going through each field, or a subset to start with you can almost sketch out a dimensional model.

The next job to perform is to group the attributes together. On the Control Panel open the Attribute Settings menu and selecting our datasource loads the attributes for our application. From looking at the data earlier we know a few things about it, there is data relating to Institutions (name, code), Courses (JACS subject group), Applicants (Domicile, gender) and Applications (Deg_choices, Deg_accepts, etc). So define these as groups and assign the attributes to the groups. Now this is where things are interesting, we effectively have three dimensions (Institutions, Courses and Applicants) and one fact table (Applications). Isn't that kind of neat? I didn't have to do any of that design in the ETL, and importantly if I want to change it around again I still don't have to go near the ETL.

So now we have a working if really dull and not very useful method of viewing our data, lets change that by adding a chart. Charts in Endeca are built on top of views, views are the same in concept as views in a relational database such as Oracle, in that they area query on top of the underlying "physical" tables. They are defined in the View Manager in the Control Panel.

This is what my view looks like:

I've set the dimension flag on our "dimension like" fields (gender etc). Returning to the application add a Chart component to the right hand panel of the screen. Use the preferences button on the component to configure the chart. This is where you set the source (our new view) and chart type etc I went for a stacked bar chart. On the chart configuration tab the metrics to show and the dimensions to view them by are set, I used Choices_all as the metric, institution region, institution name as the group dimensions and Gender and Domicile as the series dimensions. The hidden bit is the "Sort Options" button in the middle towards the bottom, I set the sort order to be by first metric then restrict the number of results to the first 20 to keep things tidy.

So to finish off the chart looks like this:

So thats gone from loading data to viewing it in a meaning full way.

Getting back to the initial objective of this post, was there any "Secret Sauce"? Was there anything in this process that gives Endeca a crucial advantage? I'm going to contend that there is, while the overall interface, look and feel and workflow are very slick and certainly up with the best there was something else that was different. That was the point around the dimensional modelling, rather than the dimensional model being physical it is a logical overlaying on a flat physical store.

The reasons used in the past for having physical dimensional models vary, but generally are based around querying simplicity, search performance and data volumes. By treating all of our dimensional attributes as degenerate dimensions and adding a logical layer to group them together are we experiencing any trade offs? What would Dimensional modelling practice say about this? In a practical sense we are making our fact table exceptionally large, so in theory this should slow down searching. But if this is counteracted by the speedup factor that an in-memory search orientd database can add do we actually lose anything? It's difficult to tell with the volume of data in this test application what the performance is, but with the huge difference in performance between disk and memory storage it is quite possible that our "inferior" dimensional design is more than offset by the increase in storage performance. The MDEX engine appears to given us a tool that can more than offset the design disadvantage of the fully de-normalised data model.

In the next few posts I'll finish off this little application a little more and "close the loop" on how to make this into a fully functioning "Production quality" Endeca BI application. Following on from that it would be interesting to see if other open-source in-memory databases can also give some of the advantages of the Endeca MDEX engine.

Saturday, 9 February 2013

Agile BI with Endeca - Part 3

In previous posts I have looked at the broader picture of what Endeca is and given a brief outline of how to setup an evaluation environment on AWS. In this article I will cover a few minor details of how to make your evaluation environment a little more friendly.

Starting and Stopping

One of the reasons for using AWS is that you only pay for the resource while you are using it, but its rather inconvenient to have to keep logging in via ssh each time to restart the server and portal. I'm rather surprised that Oracle did not already provide an init script to do this for you, if they have then I have not been able to find it anywhere.

Linux servers start and stop services with scripts located in /etc/init.d take a look at a few of the examples in there.

This is one I have written myself to start, stop and check the status of Endeca. It is not perfect, and there is lots of scope for improvement but it is a starting point.

#!/bin/bash

# /etc/init.d/endeca

# Starts the Endeca services

# Author: Mark Melton

# chkconfig: - 95 5

# Source function library.

. /etc/init.d/functions

RETVAL=0

# See how we were called.

server_path="/oracle/Oracle/Endeca/Server/7.4.0/endeca-server"

server_start_cmd="start.sh"

server_stop_cmd="stop.sh"

portal_path="/oracle/Oracle/Endeca/Discovery/2.4.0/endeca-portal/tomcat-6.0.35/bin"

portal_start_cmd="startup.sh"

portal_stop_cmd="shutdown.sh"

start() {

# Endeca Server

# Check if endeca-server is already running

server_pid=$(ps -ef | grep endeca-server | grep -v grep | awk '{print $2}')

if [ -z "$server_pid" ]; then

echo -ne $"Starting Endeca Server\n"

cd $server_path

$cmd ./$server_start_cmd >> logs/server.log &

RETVAL=$?

else

echo -ne "endeca_server already running with PID $server_pid\n"

# Endeca Portal

# Check if endeca-portal is already running

portal_pid=$(ps -ef | grep endeca-portal | grep -v grep | awk '{print $2}')

if [ -z "$portal_pid" ]; then

echo -ne $"Starting Endeca Portal\n"

cd $portal_path

$cmd ./$portal_start_cmd >> portal.log &

RETVAL=$?

else

echo -ne "endeca_portal already running with PID $portal_pid\n"

return $RETVAL

}

stop() {

echo -ne $"Stopping endeca-server\n"

cd $server_path

$cmd ./$server_stop_cmd server.log &

RETVAL=$?

echo

if [ ! -z "$RETVAL" ]; then

echo -ne "There was a problem stopping the endeca-server"

return $RETVAL

cd $portal_path

$cmd./$portal_stop_cmd >> portal.log &

RETVAL=$?

return $RETVAL

}

restart() {

stop

start

}

reload() {

restart

}

status_at() {

server_pid=$(ps -ef | grep endeca-server | grep -v grep | awk '{print $2}')

if [ -z "$server_pid" ]; then

echo -ne $"Endeca Server is not running\n"

else

echo -ne $"Endeca Server is running\n"

portal_pid=$(ps -ef | grep endeca-portal | grep -v grep | awk '{print $2}')

if [ -z "$portal_pid" ]; then

echo -ne $"Endeca Portal is not running\n"

else

echo -ne $"Endeca Portal is running\n"

}

case "$1" in

start)

start

;;

stop)

stop

;;

reload|restart)

restart

;;

condrestart)

if [ -f /var/lock/subsys/wdaemon ]; then

restart

;;

status)

status_at

;;

echo $"Usage: $0 {start|stop|restart|condrestart|status}"

exit 1

esac

exit $?

exit $RETVAL

There is just one more command to run:

# chkconfig endeca on

Now when you reboot your Endeca services should restart automatically.

Location

Just one last thing to make it easier to find your Endeca environment is to use an Elastic IP address. You can have one elastic IP address associated with a running instance at no charge. Go to the Elastic IP address tab on your AWS console, request a new Elastic IP address and associate it with your Endeca Instance. You will now be able to access your Endeca instance on elastic_ip:8080. You do have to associate the IP address with the server instance each time you start it up, but this can easily be done through the EC2 console.

It just remains to give your server and stop and start just to make sure that everything is working as expected.

Sunday, 3 February 2013

Agile BI with Endeca - Part 2

Yesterday I looked at possible use cases for Endeca in Enterprise environments, today I'm going to look at possible starting points for taking advantage of the capabilities it offers.

Firstly I'll take a look at your options for licensing if you want to evaluate Endeca, then move on to some options for hosting an evaluation before moving onto an outline of setting up a system and getting started.

First off, Oracle Endeca Information Discovery is a commercial package, obviously this means you need a license to use it. There are time limited Oracle evaluation licenses available, and if you talk to your Oracle account manager nicely I'm sure trial licenses can be arranged. But I'll leave that as an exercise for the reader. In future posts I'm planning on looking at the possible options for recreating some of the Endeca capabilities using community open source packages, but that's for the future.

So what do you need to install Endeca on. While we'd all love to have an Exalytics box sat under our desks, that's not really very practical for most. Also, if you go and talk to your infrastructure department and ask for a server, there is a lot of sucking of teeth and a question of how many months time do you want it in? So it's time to be a little creative. I'm sure that a lot of you are familiar with Amazons cloud service, if you are not then I'll briefly explain (I'm no AWS expert, I know as much as I need to to get by). Amazon Web Services is an offering from Amazon that enables you to create and use virtual computing services via a number of methods. EC2 allows you to create virtual servers and access them in over the web. You pay for the server by the hour and shut it down when you do not need it. This makes it perfect for evaluations and demos. In addition to this instances come in a range of sizes and prices, making it possible to start small at low cost and then move up to the more costly options.

One word of warning here, there are differing types of images, those based on instance store volumes and those based on EBS. Instance store volumes are lost when the instance shuts down, while useful in some instances they are not really suitable for what we need here.

So for a start you need an Amazon AWS account, I'm not going to go through the details of how to get started because enough has been written on that subject already.

To get started we need an EC2 instance. There are a number of different types available but I started with a 64 bit, Red Hat EBS backed m1.medium image, that gives 1 virtual core, 3.75GB Ram. I also added a second 20GB EBS volume to use for application data storage. I went for Red Hat because it's a supported OS for installing Endeca (so hopefully no compatibility issues) , and I'm a Linux person by preference.

Once you have your image created there are a few tasks to do, I'm not going to give step by step instructions for the Linux admin tasks, there are plenty of references out there already on how to do this.

First thing after booting is to login using your SSH client. This can be done using Actions->Connect from the EC2 console instances page.

The next task I performed was to create a mount point and edit the /etc/fstab file to mount the extra EBS storage at boot time. I created this as /oracle but you can put this where you like, /apps might be a more appropriate name.

You are also going to need an X windows system to be able to run the Integrator application locally on the server. Instructions for getting an X windows system installed can be found here.

At this point I'll also stress that we only have the root user on the server, this is not a good idea. You really need to create user/group to run Endeca.

Oracle Endeca is available from the Oracle software delivery Cloud (you will have to register, remember Endeca is not free, but there is an evaluation trial license), there are a bewildering array of downloads but as a minimum you need:

Endeca Server - the "database"
Information Discovery Studio - the web interface
Information Discovery Integrator - the ETL tool

Again I'm not going to give you step by step instructions on installing Endeca because Oracle have already done that, their documentation is available here.

But as an outline, start with installing the server, then Integrator and Studio.

If you want any demo data in your system you will need to follow the quick start guide. This helps you get out of the blocks and see something running. I used VNC to access Integrator on X Windows to run the quick start data load. This can be done by using a tunnel for port 5901 on your ssh session, information on how to do that is here.

The only two "gotchas" that you might hit are the Red Hat firewall being closed by default, so you need to open up port 8080. This how-to should point you in the right direction. Additionally you will need to ensure that your Amazon network security group has port 8080 open.

If you start the server and the portal (see the installation guides) you should be able to login to your server at its public url on port 8080. You should then see the login page, the default login details are in the studio installation guide.

From there I'd suggest that you go through the excellent YouTube videos on how to get started.

Next time I'll follow up with a few practical details on how to startup Endeca at boot time, the hosting ports and how to proxy through apache. Then I'll go through a few practical steps to build real apps a little more independently than the "just copy the example file" used on the getting started videos.

Saturday, 2 February 2013

Agile BI with Endeca

Recently I've been looking at the wider possibilities for using the capabilities of Oracle Endeca in enterprise environments. Much has been written about Endeca since the Oracle purchase last year, their is a good summary of Endeca here.

Over the next few posts I will look at what Endeca could do for Enterprise BI and how to start evaluating what it can do:

So before diving into the detail and looking at what Endeca can do I'd like to look at what we are trying to achieve with BI in the enterprise. In any enterprise, old or new they are trying to be better than their competitors in one or more areas, being more efficient and competing on price, or perhaps differentiating themselves by having better products. As BI professionals we are trying to enable the business to be better at their targeted competitive strategy. So we can focus on improving the performance sales organisations, better product research, better marketing, operational efficiencies, etc. So how do we go about this and where do we start? The path from this points well trodden, and quite well understood.

If your lucky enough to be working in a green field site you have a wealth of options available, the traditional enterprise data warehouse is a realistic target. Much has been written about the likes of the transformational BI in NetFlix and Capital One, both essentially green field sites. If you are using commercial off the shelf business packages then the off the shelf BI packages such as OBIA are a reasonable starting point and offer a low cost entry point. Beyond this the warehouse models such as the oracle warehouse reference model are available and tried and tested methods for best practice custom BI implementations. But the world is changing, where does Big Data fit with the EDW model, how do you analyse unstructured data? Various schemes have been proposed but so far I'm not sure any of them really "feel right".

Sadly though the reality is that the majority of businesses are distinctly brown field. Hopefully your enterprise architects have read the excellent Enterprise Architecture as Strategy, and are banging on the door of the C-level executives with a copy. So while you might be looking at a brighter future the reality is we have to help the business as it exists today, with a large estate of legacy systems and planned system retirements over the next 5 years or so. Even in this harsh environment it is possible to deliver using traditional EDW techniques, but as many have found to their cost without the highest level executive support and the associated funding and resources you face an uphill battle. To get that executive sponsorship you have to have a proven track record of success, and to do that you need to start somewhere, possibly with little or no budget, resource or support. The business probably also balk at the costs that you start to discuss when talking about EDW solutions, they are used to their landscape of spread marts, Access databases and VB Heath Robinson solutions. While being cheap these are hopelessly inflexible after a very short period of time, not scalable, inaccurate silos of data. But when you roll up and say it will cost them $250k or more to replace their spreadsheet they are not unreasonably a little shocked.

So lets look at where the costs go in EDW. If we take the Kimball lifecycle as being a typical model we not unsurprisingly start with the project governance and requirements gathering. This needs to be done to a reasonably detailed level before you can even contemplate the next level of data model design. I've also known some companies get this so wrong it was laughable. Because they treated all "development" as being the same a poor developer was not allowed to start any design work or coding before he had got a requirements document for "each" report signed off. Naturally the business did not really know what they wanted and kept changing their minds, so after 50 days of effort it was still not signed off, even taking a low estimate of $250 a day, this requirements document cost $12500.

Even if you have a project team who are are effective at BI development this is still not a small undertaking, it will be several weeks of information discovery, requirements gathering, design, development, deployment and testing before the end point business partner gets to see anything. This is not even factoring in the problems around resource planning, budget approvals, business case sign off etc. But also factor in the constantly shifting sands of the IT estate, this route looks less and less attractive.

So is this an opportunity for Endeca to ride to the rescue? Well lets look at what we could do differently to enable us to trim some fat from the development process. Why do we need project governance? Put simply it is required to coordinate multiple work threads and resources to collectively achieve their objectives. As the size and complexity of the planned work increases, along with the size of the delivery team, so must the quantity of co-ordination work. So it follows that if we can reduce the size of the delivery team the less project management we need. So lets look at the next step, business requirements definition. Why are we doing this stage? Stating the obvious, its so that the design and build team know what requirements of the system they have been tasked with delivering has to do. Based on the premise that change and rework is expensive, you want to build something once and get it right. You are probably at this point thinking that a few problems are cropping up here for our brown field site. But lets carry on. We are now designing our dimensional models and BI applications, because we have to design them right? How else will we know what to build. But we are at this point "fixing" aspects of what and how we can report on our data. Now we go away and design our ETL to take our source data, cleanse and load it and put it into some data structures we can query. What if we could carry the flexibility of not having to fix our dimensional model so early? What if we did not even have to do it until we were exploring how we wanted to query the data.

There looks to be something promising here. So if we could get away from rework and change actually being costly and something we just expect and factor in we can really back off on the depth of requirements gathering. What if we did not have to define a dimensional model? What if we had a tool so simple to use that someone with a broad skill set of business analyst and developer could use it? That tool is Endeca. We are now looking at genuinely agile BI development. It is possible to use a one or two person team to deliver on the entire project. Even if you use Endeca for nothing more than a proof of concept to enable the business to crystallise the requirements and work out exactly what they need to know it would be beneficial.

So is it practical to embed a single highly skilled individual who is architect, analyst and developer in the business units as required to solve business problems and boost business performance? I believe it is and over the next few blog posts I'm going to look in some detail about how you can use Endeca in an agile way to deliver useful BI enabling business performance improvement.

Tuesday, 11 December 2012

Social Data

In my last post I recounted a "manual" data intelligence discovery exercise I did while heading home on the train. In this post I'll explore this topic and look into some of the possibilities.

This is made easier by finding a Wikipedia entry that had fallen through a hole in the space time continuum in a beta version of Time Machine on my Mac...

"The social metrix corporation (SoMet) traces its roots back to early 2013 when it received a large venture capital investment from a major hardware vendor. It launched its first App 'PFYT' 6 months later. Penny For Your Thoughts paid subscribers by the megabyte for being left switched on in public places, such as public transport. SoMet was rather secretive about what it did with this information, but in late 2013 started to offer 'information services' to invited subscribers. PFYT became increasingly popular increasing it's payment rates per MB to the point that it was possibly to pay for half your rail fare by leaving the app switched on for the entire trip.

SoMet became highly profitable having many high revenue subscribers. SoMet went on through acquisition to become the major information media and information corporation on Planet Earth. It was not until 2018 that the true nature of the early days of SoMet emerged. PFYT was an application that just recorded all available sound while the app was running and uploaded this to the SoMet BigData farm. Here powerful audio filtering and natural language algorithms were used to digitize conversations. Utilizing readily available search farms this data was then given a contextual framework and added to the SoMet intelligence database. As the popularity grew and went international work gap analysis was used to join together either end of conversations increasing the value of available information. SoMet analysts using the intelligence engine would then identify valuable information that subscribers would then be offered 'exclusive' access to while they remained subscribed to the SoMet services.

SoMet used freely available data from the public domain to blackmail corporations and individuals on a grand scale. By the time that their information source began to dry up in mid 2015 they had made sufficient profit to move into other more legitimate business areas. SoMet are credited with the silence that and whispered conversation that is now common in all public areas."

Clearly SoMet is a made up concept and the reality is that the lid would be blown on such an organisation almost immediately, but what of the data concepts in there? For the sake of convenience I'll skip over the obvious detail of filtering out individual conversation from the background noise, but as the human ear and brain combination can resolve this problem its clearly not insurmountable. Voice recognition is also another area that while not easy is being resolved. So this gets us to the point where we have multiple streams of data. But what can we do with this information to give it context?

Even in it's basic form the audio stream can provide useful information. By analyzing the pattern of word gaps and lengths of conversation, simple one to one conversations could be matched together. Obviously multiparty conference calls would be a rather more difficult proposition due to the more complex interleaving of speakers. Linking both parts of the conversation clearly adds value by filling on context and linking more information.

The real value is in the text stream that comes out of the language processing. This is quite a well studied field already, with many approaches available already, including implementations on Hadoop. This is akin to the process I did manually while sat on the train by using various search engines, a big data work thread could churn through this automatically. By analyzing the language and relations the really useful information could be located. Once candidate conversations are identified each could be recalled for analysts to listen to and add further information.

So by following this simple excise there really is little in the way other than the source of the raw material from doing this sort of processing today. Perhaps someone already is? So just to be safe it's probably best to leave that work conversation for the office.