Sunday, 3 February 2013

Agile BI with Endeca - Part 2

Yesterday I looked at possible use cases for Endeca in Enterprise environments, today I'm going to look at possible starting points for taking advantage of the capabilities it offers.

Firstly I'll take a look at your options for licensing if you want to evaluate Endeca, then move on to some options for hosting an evaluation before moving onto an outline of setting up a system and getting started.

First off, Oracle Endeca Information Discovery is a commercial package, obviously this means you need a license to use it.  There are time limited Oracle evaluation licenses available, and if you talk to your Oracle account manager nicely I'm sure trial licenses can be arranged.  But I'll leave that as an exercise for the reader.   In future posts I'm planning on looking at the possible options for recreating some of the Endeca capabilities using community open source packages, but that's for the future.

So what do you need to install Endeca on.  While we'd all love to have an Exalytics box sat under our desks, that's not really very practical for most.  Also, if you go and talk to your infrastructure department and ask for a server, there is a lot of sucking of teeth and a question of how many months time do you want it in?  So it's time to be a little creative.   I'm sure that a lot of you are familiar with Amazons cloud service, if you are not then I'll briefly explain (I'm no AWS expert, I know as much as I need to to get by).  Amazon Web Services is an offering from Amazon that enables you to create and use virtual computing services via a number of methods.  EC2 allows you to create virtual servers and access them in over the web.  You pay for the server by the hour and shut it down when you do not need it.  This makes it perfect for evaluations and demos.   In addition to this instances come in a range of sizes and prices, making it possible to start small at low cost and then move up to the more costly options.

One word of warning here, there are differing types of images, those based on instance store volumes and those based on EBS.  Instance store volumes are lost when the instance shuts down, while useful in some instances they are not really suitable for what we need here.

So for a start you need an Amazon AWS account, I'm not going to go through the details of how to get started because enough has been written on that subject already.

To get started we need an EC2 instance.  There are a number of different types available but I started with a 64 bit, Red Hat EBS backed m1.medium image,   that gives 1 virtual core, 3.75GB Ram.  I also added a second 20GB EBS volume to use for application data storage.  I went for Red Hat because it's a supported OS for installing Endeca (so hopefully no compatibility issues) , and I'm a Linux person by preference.

Once you have your image created there are a few tasks to do, I'm not going to give step by step instructions for the Linux admin tasks, there are plenty of references out there already on how to do this.

First thing after booting is to login using your SSH client.  This can be done using Actions->Connect from the EC2 console instances page.

The next task I performed was to create a mount point and edit the /etc/fstab file to mount the extra EBS storage at boot time.  I created this as /oracle but you can put this where you like, /apps might be a more appropriate name.

You are also going to need an X windows system to be able to run the Integrator application locally on the server.  Instructions for getting an X windows system installed can be found here.

At this point I'll also stress that we only have the root user on the server, this is not a good idea. You really need to create user/group to run Endeca.

Oracle Endeca is available from the Oracle software delivery Cloud (you will have to register, remember Endeca is not free, but there is an evaluation trial license), there are a bewildering array of downloads but as a minimum you need:

  • Endeca Server - the "database"
  • Information Discovery Studio - the web interface
  • Information Discovery Integrator - the ETL tool


Again I'm not going to give you step by step instructions on installing Endeca because Oracle have already done that, their documentation is available here.

But as an outline, start with installing the server, then Integrator and Studio.

If you want any demo data in your system you will need to follow the quick start guide.  This helps you get out of the blocks and see something running.  I used VNC to access Integrator on X Windows to run the quick start data load.  This can be done by using a tunnel for port 5901 on your ssh session, information on how to do that is here.

The only two "gotchas" that you might hit are the Red Hat firewall being closed by default, so you need to open up port 8080.  This how-to should point you in the right direction.  Additionally you will need to ensure that your Amazon network security group has port 8080 open.

If you start the server and the portal (see the installation guides)  you should be able to login to your server at its public url on port 8080.  You should then see the login page, the default login details are in the studio installation guide.

From there I'd suggest that you go through the excellent YouTube videos on how to get started.

Next time I'll follow up with a few practical details on how to startup Endeca at boot time, the hosting ports and how to proxy through apache.  Then I'll go through a few practical steps to build real apps a little more independently than the  "just copy the example file" used on the getting started videos.


Saturday, 2 February 2013

Agile BI with Endeca

Recently I've been looking at the wider possibilities for using the capabilities of Oracle Endeca in enterprise environments. Much has been written about Endeca since the Oracle purchase last year, their is a good summary of Endeca here.

Over the next few posts I will look at what Endeca could do for Enterprise BI and how to start evaluating what it can do:



So before diving into the detail and looking at what Endeca can do I'd like to look at what we are trying to achieve with BI in the enterprise. In any enterprise, old or new they are trying to be better than their competitors in one or more areas, being more efficient and competing on price, or perhaps differentiating themselves by having better products. As BI professionals we are trying to enable the business to be better at their targeted competitive strategy. So we can focus on improving the performance sales organisations, better product research, better marketing, operational efficiencies, etc. So how do we go about this and where do we start? The path from this points well trodden, and quite well understood.

If your lucky enough to be working in a green field site you have a wealth of options available, the traditional enterprise data warehouse is a realistic target.  Much has been written about the likes of the transformational BI in NetFlix and Capital One, both essentially green field sites.  If you are using commercial off the shelf business packages then the off the shelf BI packages such as OBIA are a reasonable starting point and offer a low cost entry point. Beyond this the warehouse models such as the oracle warehouse reference model are available and tried and tested methods for best practice custom BI implementations. But the world is changing, where does Big Data fit with the EDW model, how do you analyse unstructured data? Various schemes have been proposed but so far I'm not sure any of them really "feel right".

Sadly though the reality is that the majority of businesses are distinctly brown field.   Hopefully your enterprise architects have read the excellent Enterprise Architecture as Strategy, and are banging on the door of the C-level executives with a copy.  So while you might be looking at a brighter future the reality is we have to help the business as it exists today, with a large estate of legacy systems and planned system retirements over the next 5 years or so.   Even in this harsh environment it is possible to deliver using traditional EDW techniques, but as many have found to their cost without the highest level executive support and the associated funding and resources you face an uphill battle.  To get that executive sponsorship you have to have a proven track record of success, and to do that you need to start somewhere, possibly with little or no budget, resource or support. The business probably also balk at the costs that you start to discuss when talking about EDW solutions, they are used to their landscape of spread marts, Access databases and VB Heath Robinson solutions.   While being cheap these are hopelessly inflexible after a very short period of time, not scalable, inaccurate silos of data.  But when you roll up and say it will cost them $250k or more to replace their spreadsheet they are not unreasonably a little shocked.

So lets look at where the costs go in EDW.  If we take the Kimball lifecycle as being a typical model we not unsurprisingly start with the project governance and requirements gathering.  This needs to be done to a reasonably detailed level before you can even contemplate the next level of data model design.  I've also known some companies get this so wrong it was laughable.  Because they treated all "development" as being the same a poor developer  was not allowed to start any design work or coding before he had got a requirements document for "each" report signed off.  Naturally the business did not  really know what they wanted and kept changing their minds, so after 50 days of effort it was still not signed off,  even taking a low estimate of $250 a day, this requirements document cost $12500.

Even if you have a project team who are are effective at BI development this is still not a small undertaking, it will be several weeks of information discovery, requirements gathering, design, development, deployment and testing before the end point business partner gets to see anything.  This is not even factoring in the problems around resource planning, budget approvals, business case sign off etc.  But also factor in the constantly shifting sands of the IT estate, this route looks less and less attractive.

So is this an opportunity for Endeca to ride to the rescue?  Well lets look at what we could do differently to enable us to trim some fat from the development process.   Why do we need project governance?  Put simply it is required to coordinate multiple work threads and resources to collectively achieve their objectives.  As the size and complexity of the planned work increases, along with the size of the delivery team, so must the quantity of co-ordination work.  So it follows that if we can reduce the size of the delivery team the less project management we need.  So lets look at the next step, business requirements definition.  Why are we doing this stage?  Stating the obvious, its so that the design and build team know what requirements of the system they have been tasked with delivering has to do.  Based on the premise that change and rework is expensive, you want to build something once and get it right.  You are probably at this point thinking that a few problems are cropping up here for our brown field site.  But lets carry on.   We are now designing our dimensional models and BI applications, because we have to design them right?  How else will we know what to build.    But we are at this point "fixing" aspects of what and how we can report on our data.  Now we go away and design our ETL to take our source data, cleanse and load it and put it into some data structures we can query.   What if we could carry the flexibility of not having to fix our dimensional model so early?  What if we did not even have to do it until we were exploring how we wanted to query the data.

There looks to be something promising here.  So if we could get away from rework and change actually being costly and something we just expect and factor in we can really back off on the depth of requirements gathering.  What if we did not  have to define a dimensional model?  What if we had a tool so simple to use that someone with a broad skill set of business analyst and developer could use it? That tool is Endeca.   We are now looking at genuinely agile BI development.  It is possible to use a one or two person team to deliver on the entire project.   Even if you use Endeca for nothing more than a proof of concept to enable the business to crystallise the requirements and work out exactly what they need to know it would be beneficial.

So is it practical to embed a single highly skilled individual who is architect, analyst and developer in the business units as required to solve business problems and boost business performance?  I believe it is and over the next few blog posts I'm going to look in some detail about how you can use Endeca in an agile way to deliver useful BI enabling business performance improvement.





Tuesday, 11 December 2012

Social Data

In my last post I recounted a "manual" data intelligence discovery exercise I did while heading home on the train.   In this post I'll explore this topic and look into some of the possibilities.    

This is made easier by finding a Wikipedia entry that had fallen through a hole in the space time continuum in a beta version of Time Machine on my Mac...

"The social metrix corporation (SoMet) traces its roots back to early 2013 when it received a large venture capital investment from a major hardware vendor.   It launched its first App 'PFYT' 6 months later.  Penny For Your Thoughts paid subscribers by the megabyte for being left switched on in public places, such as public transport.  SoMet was rather secretive about what it did with this information, but in late 2013 started to offer 'information services' to invited subscribers.  PFYT became increasingly popular increasing it's payment rates per MB to the point that it was possibly to pay for half your rail fare by leaving the app switched on for the entire trip.  

SoMet became highly profitable having many high revenue subscribers.  SoMet went on through acquisition to become the major information media and information corporation on Planet Earth.    It was not until 2018 that the true nature of the early days of SoMet emerged.  PFYT was an application that just recorded all available sound while the app was running and uploaded this to the SoMet BigData farm.  Here powerful audio filtering and natural language algorithms  were used to digitize conversations.  Utilizing readily available search farms this data was then given a contextual framework and added to the SoMet intelligence database.  As the popularity grew and went international work gap analysis was used to join together either end of conversations increasing the value of available information.  SoMet analysts using the intelligence engine would then identify valuable information that subscribers would then be offered 'exclusive' access to while they remained subscribed to the SoMet services.

SoMet used freely available data from the public domain to blackmail corporations and individuals on a grand scale.  By the time that their information source began to dry up in mid 2015 they had made sufficient profit to move into other more legitimate business areas.  SoMet are credited with the silence that and whispered conversation that is now common in all public areas."

Clearly SoMet is a made up concept and the reality is that the lid would be blown on such an organisation almost immediately, but what of the data concepts in there?  For the sake of convenience I'll skip over the obvious detail of filtering out individual conversation from the background noise, but as the human ear and brain combination can resolve this problem its clearly not insurmountable.  Voice recognition is also another area that while not easy is being resolved.  So this gets us to the point where we have multiple streams of data.  But what can we do with this information to give it context?

Even in it's basic form the audio stream can provide useful information.  By analyzing the pattern of word gaps and lengths of conversation, simple one to one conversations could be matched together.  Obviously multiparty conference calls would be a  rather more difficult proposition due to the more complex interleaving of speakers.  Linking both parts of the conversation clearly adds value by filling on context and linking more information.  

The real value is in the text stream that comes out of the language processing.  This is quite a well studied field already, with many approaches available already, including implementations on Hadoop.  This is akin to the process I did manually while sat on the train by using various search engines, a big data work thread could churn through this automatically. By analyzing the language and relations the really useful information could be located.  Once candidate conversations are identified each could be recalled for analysts to listen to and add further information.  

So by following this simple excise there really is little in the way other than the source of the raw material from doing this sort of processing today.  Perhaps someone already is?  So just to be safe it's probably best to leave that work conversation for the office.




Wednesday, 14 November 2012

17:48 from Paddington

I was planning on starting this blog by talking about my journey through BI, but instead I'm going to talk about a different journey that highlights some relevant points.

This story is based on events on the 17:48 train from Paddington last Wednesday, this isn't a verbatim transcript, but it gives you the idea.  

In the UK people are rightly very sensitive about the use of their personal data, particularly in the area of health, yet are perfectly prepared to broadcast the details to all in ear shot.  A pair of people were heading home after one of them had clearly been for a consultation about quite a serious matter.  Their companion then proceeded to ring the rest of the family to let them know the news.  In doing this of course they were then relating this individuals medical history to the 20 or so people in ear shot.  

This got me thinking about how many people have double standards on the value and security of their data.  While I was thinking about this topic I was then gifted a piece of solid gold.  


A particularly loud individual two rows in front decided to continue his working day on the way home and make the phone calls he hadn't had time for earlier in the day.  His first call was to a collegue, discussing a new opportunity, so he's probably connected with sales.  Following this he then named the customer who the opportunity was with and wanted to make sure that his collegue had recorded the opportunity so they secure their 5%.  So now I know the name of a potential customer and the sales markup.  They then discussed the performance of one of their team members, Dave it would appear your star  is no longer rising.  So our friend on the train is clearly a senior player in the sales organisation.  

The next call is to collegue on business in Brussles who he named, so hello Brian you are now slotted into this jigsaw as well.  At this point they mentioned a product name, so a brief google on my iPhone later and I've identified the organisation you are selling for.  

There then followed a more personal conversation, with 'darling', I'll see you next Monday at a named London hotel.  

So a brief recap, at this point we know a product name, a client name, a markup, a subordinates first name, and the existence of darling.  The next call revealed another client name, another product range, another team member and the total margin on a deal.  

The conversation then turned to rearranging the distribution model, so this was clearly someone in the organisation, not just a reseller.  So linked in, can you help?  A few blind alleys and I get a set of hits that lineup the names, so Mr X I'm 90%  certain I know who you are.  Then the interesting call, "yes I'm sorry darling, I was tied up with work so I haven't been able to call, yes I'll be home later".  Rather revealing, weren't you talking to darling earlier?  

So after finding all this out I then noticed the stop you got off at, is 192.com worth a shot?   That's the address sorted.  

So, I know who you are, I know you are having an affair, I know who you work for, I know two of your customers names and the markups, I know your team members who are on their way out, I know where you are meeting next Monday, and I know your current distributors are in the firing line.  

Next time you need to make that call in public, you might want to think about who the 20 or so people are in ear shot.  By joining the dots and using external data it's possible to deduce far more than the raw data suggests, in essence that is what we data professionals aim to do every day.  In my next post I'll explore where this particular data feed might take us.