Tuesday, 11 December 2012

Social Data

In my last post I recounted a "manual" data intelligence discovery exercise I did while heading home on the train.   In this post I'll explore this topic and look into some of the possibilities.    

This is made easier by finding a Wikipedia entry that had fallen through a hole in the space time continuum in a beta version of Time Machine on my Mac...

"The social metrix corporation (SoMet) traces its roots back to early 2013 when it received a large venture capital investment from a major hardware vendor.   It launched its first App 'PFYT' 6 months later.  Penny For Your Thoughts paid subscribers by the megabyte for being left switched on in public places, such as public transport.  SoMet was rather secretive about what it did with this information, but in late 2013 started to offer 'information services' to invited subscribers.  PFYT became increasingly popular increasing it's payment rates per MB to the point that it was possibly to pay for half your rail fare by leaving the app switched on for the entire trip.  

SoMet became highly profitable having many high revenue subscribers.  SoMet went on through acquisition to become the major information media and information corporation on Planet Earth.    It was not until 2018 that the true nature of the early days of SoMet emerged.  PFYT was an application that just recorded all available sound while the app was running and uploaded this to the SoMet BigData farm.  Here powerful audio filtering and natural language algorithms  were used to digitize conversations.  Utilizing readily available search farms this data was then given a contextual framework and added to the SoMet intelligence database.  As the popularity grew and went international work gap analysis was used to join together either end of conversations increasing the value of available information.  SoMet analysts using the intelligence engine would then identify valuable information that subscribers would then be offered 'exclusive' access to while they remained subscribed to the SoMet services.

SoMet used freely available data from the public domain to blackmail corporations and individuals on a grand scale.  By the time that their information source began to dry up in mid 2015 they had made sufficient profit to move into other more legitimate business areas.  SoMet are credited with the silence that and whispered conversation that is now common in all public areas."

Clearly SoMet is a made up concept and the reality is that the lid would be blown on such an organisation almost immediately, but what of the data concepts in there?  For the sake of convenience I'll skip over the obvious detail of filtering out individual conversation from the background noise, but as the human ear and brain combination can resolve this problem its clearly not insurmountable.  Voice recognition is also another area that while not easy is being resolved.  So this gets us to the point where we have multiple streams of data.  But what can we do with this information to give it context?

Even in it's basic form the audio stream can provide useful information.  By analyzing the pattern of word gaps and lengths of conversation, simple one to one conversations could be matched together.  Obviously multiparty conference calls would be a  rather more difficult proposition due to the more complex interleaving of speakers.  Linking both parts of the conversation clearly adds value by filling on context and linking more information.  

The real value is in the text stream that comes out of the language processing.  This is quite a well studied field already, with many approaches available already, including implementations on Hadoop.  This is akin to the process I did manually while sat on the train by using various search engines, a big data work thread could churn through this automatically. By analyzing the language and relations the really useful information could be located.  Once candidate conversations are identified each could be recalled for analysts to listen to and add further information.  

So by following this simple excise there really is little in the way other than the source of the raw material from doing this sort of processing today.  Perhaps someone already is?  So just to be safe it's probably best to leave that work conversation for the office.




Wednesday, 14 November 2012

17:48 from Paddington

I was planning on starting this blog by talking about my journey through BI, but instead I'm going to talk about a different journey that highlights some relevant points.

This story is based on events on the 17:48 train from Paddington last Wednesday, this isn't a verbatim transcript, but it gives you the idea.  

In the UK people are rightly very sensitive about the use of their personal data, particularly in the area of health, yet are perfectly prepared to broadcast the details to all in ear shot.  A pair of people were heading home after one of them had clearly been for a consultation about quite a serious matter.  Their companion then proceeded to ring the rest of the family to let them know the news.  In doing this of course they were then relating this individuals medical history to the 20 or so people in ear shot.  

This got me thinking about how many people have double standards on the value and security of their data.  While I was thinking about this topic I was then gifted a piece of solid gold.  


A particularly loud individual two rows in front decided to continue his working day on the way home and make the phone calls he hadn't had time for earlier in the day.  His first call was to a collegue, discussing a new opportunity, so he's probably connected with sales.  Following this he then named the customer who the opportunity was with and wanted to make sure that his collegue had recorded the opportunity so they secure their 5%.  So now I know the name of a potential customer and the sales markup.  They then discussed the performance of one of their team members, Dave it would appear your star  is no longer rising.  So our friend on the train is clearly a senior player in the sales organisation.  

The next call is to collegue on business in Brussles who he named, so hello Brian you are now slotted into this jigsaw as well.  At this point they mentioned a product name, so a brief google on my iPhone later and I've identified the organisation you are selling for.  

There then followed a more personal conversation, with 'darling', I'll see you next Monday at a named London hotel.  

So a brief recap, at this point we know a product name, a client name, a markup, a subordinates first name, and the existence of darling.  The next call revealed another client name, another product range, another team member and the total margin on a deal.  

The conversation then turned to rearranging the distribution model, so this was clearly someone in the organisation, not just a reseller.  So linked in, can you help?  A few blind alleys and I get a set of hits that lineup the names, so Mr X I'm 90%  certain I know who you are.  Then the interesting call, "yes I'm sorry darling, I was tied up with work so I haven't been able to call, yes I'll be home later".  Rather revealing, weren't you talking to darling earlier?  

So after finding all this out I then noticed the stop you got off at, is 192.com worth a shot?   That's the address sorted.  

So, I know who you are, I know you are having an affair, I know who you work for, I know two of your customers names and the markups, I know your team members who are on their way out, I know where you are meeting next Monday, and I know your current distributors are in the firing line.  

Next time you need to make that call in public, you might want to think about who the 20 or so people are in ear shot.  By joining the dots and using external data it's possible to deduce far more than the raw data suggests, in essence that is what we data professionals aim to do every day.  In my next post I'll explore where this particular data feed might take us.