In my last post I recounted a "manual" data intelligence discovery exercise I did while heading home on the train. In this post I'll explore this topic and look into some of the possibilities.
This is made easier by finding a Wikipedia entry that had fallen through a hole in the space time continuum in a beta version of Time Machine on my Mac...
"The social metrix corporation (SoMet) traces its roots back to early 2013 when it received a large venture capital investment from a major hardware vendor. It launched its first App 'PFYT' 6 months later. Penny For Your Thoughts paid subscribers by the megabyte for being left switched on in public places, such as public transport. SoMet was rather secretive about what it did with this information, but in late 2013 started to offer 'information services' to invited subscribers. PFYT became increasingly popular increasing it's payment rates per MB to the point that it was possibly to pay for half your rail fare by leaving the app switched on for the entire trip.
SoMet became highly profitable having many high revenue subscribers. SoMet went on through acquisition to become the major information media and information corporation on Planet Earth. It was not until 2018 that the true nature of the early days of SoMet emerged. PFYT was an application that just recorded all available sound while the app was running and uploaded this to the SoMet BigData farm. Here powerful audio filtering and natural language algorithms were used to digitize conversations. Utilizing readily available search farms this data was then given a contextual framework and added to the SoMet intelligence database. As the popularity grew and went international work gap analysis was used to join together either end of conversations increasing the value of available information. SoMet analysts using the intelligence engine would then identify valuable information that subscribers would then be offered 'exclusive' access to while they remained subscribed to the SoMet services.
SoMet used freely available data from the public domain to blackmail corporations and individuals on a grand scale. By the time that their information source began to dry up in mid 2015 they had made sufficient profit to move into other more legitimate business areas. SoMet are credited with the silence that and whispered conversation that is now common in all public areas."
Clearly SoMet is a made up concept and the reality is that the lid would be blown on such an organisation almost immediately, but what of the data concepts in there? For the sake of convenience I'll skip over the obvious detail of filtering out individual conversation from the background noise, but as the human ear and brain combination can resolve this problem its clearly not insurmountable. Voice recognition is also another area that while not easy is being resolved. So this gets us to the point where we have multiple streams of data. But what can we do with this information to give it context?
Even in it's basic form the audio stream can provide useful information. By analyzing the pattern of word gaps and lengths of conversation, simple one to one conversations could be matched together. Obviously multiparty conference calls would be a rather more difficult proposition due to the more complex interleaving of speakers. Linking both parts of the conversation clearly adds value by filling on context and linking more information.
The real value is in the text stream that comes out of the language processing. This is quite a well studied field already, with many approaches available already, including implementations on Hadoop. This is akin to the process I did manually while sat on the train by using various search engines, a big data work thread could churn through this automatically. By analyzing the language and relations the really useful information could be located. Once candidate conversations are identified each could be recalled for analysts to listen to and add further information.
So by following this simple excise there really is little in the way other than the source of the raw material from doing this sort of processing today. Perhaps someone already is? So just to be safe it's probably best to leave that work conversation for the office.