Mining and Remixing Your Personal Data Silos

One of the big ideas we were exploring at Judy’s Book was the topic of tacit or latent knowledge. Everyone’s head is filled with a richly personal blend of facts and opinions: where to get the best burrito in San Francisco, my current favorite band, the name and contact information of my childhood friend who now runs a hedge fund. The hard problem is finding a way to get this information out of people’s heads that feels effortless, or at least that creates so much “me value” that the effort feels worth it.

At Judy’s Book, our attempts to reduce the friction in this process included: licensing a local listings DB to reduce the data entry burden; offering creative ways for users to ask questions (both of their friends and of the community as a whole) to elicit recommendations; building emotional identification with friends and the larger member community to confer “social status” on active contributors; and adding gaming elements (e.g., scoring, leaderboards, promotions) to channel users’ innate competitiveness toward content creation.

Some of these methods were more effective than others, but our ultimate conclusion was this: building a service exclusively or primarily around tacit knowledge is a bad business proposition. Our internal shorthand for this was the (not very original) “icing and cake” metaphor. Fresh, user-generated content (UGC) is “icing”: it makes a service feel engaging and alive and keeps users coming back. But if you want to create value for a large number of users right away, and across a broad spectrum of informational needs (i.e., the “cake”), you need to find a way to deliver that value with the absolute minimum required contribution from your users.

This pattern was burned into my brain so deeply at Judy’s Book that I now apply it instinctively to every new business I learn about. It doesn’t always fit, but when it does it triggers an immediate emotional reaction – positive or negative – depending on how cleverly business is going about solving the problem. Here’s the 30,ooo-foot view of the framework:

  • There are three classes of solution to the tacit knowledge problem:
  1. Declared – e.g., user-initiated reviews, recommendations, blog posts
  2. Elicited – information extracted from users via surveys, Q&A, friend requests, etc.
  3. Inferred – data and patterns extracted from existing private and public data stores
  • The utility of the solution (a.k.a. the “me value delta“) is inversely proportional to the level of effort required of the user.
  • The engineering effort required to implement the solution is roughly proportional to the utility delivered (i.e., correctly inferring patterns from large and ‘dirty’ datasets is orders of magnitude harder than capturing and publishing user-entered text).

When we started Judy’s Book, we felt pretty good about our insight that facilitating Elicited information was more effective than relying solely on Declared, but we weren’t smart or technical enough to make the leap to an Inferred solution (though we fantasized about getting access to users credit card transaction histories to auto-populate their lists of favorite restaurants and service providers).

Inferred solutions are typically the most powerful and user-delighting, fulfilling Arthur C. Clarke‘s famous maxim that “any sufficiently advanced technology is indistinguishable from magic” (which has to be the intellectual root of the excellent and useful coinage: ‘automagic’). Until recently, the data and engineering overhead required to build an Inferred solution has meant that only large enterprises with a significant vested interest (i.e., credit card fraud departments that need to limit losses, or whose thin margins demand efficient methods of driving incremental purchases) have been able to afford the investment.

However, in the past few years, I’ve watched with interest (and envy) as people much smarter than I am have begun to implement Inferred solutions that tackle specific silos of personal information outside the enterprise. A few of my current favorites include:

  • – A website and desktop widget that extracts my music listening history and habits from iTunes to create a listener profile for me, suggest related music I might like, match me with other listeners with similar profiles, and help me track upcoming shows in my area. All I have to do is create a profile and download the widget; everything else happens automagically.
  • Lijit – A blog widget and profile aggregator that indexes of all my past blog posts (and other social media platforms), maintaining a tagcloud and search box for my entire (public) online media output. Again, I don’t have to do anything once I’ve installed the widget – they do all the heavy lifting in the background.
  • Wesabe – I’ve only dabbled with this one as the user input overhead is significantly greater than the other two, but their goal is to automate tracking and pattern recognition of personal spending habits, using bank account and credit card statements as the base input, but then (like searching for patterns and money-saving recommendations across their entire user base.
  • Xobni – I’ve abandoned Outlook, but those who use it report that Xobni’s plugin transforms the experience, parsing your email history to expose and analyze the real world social network represented by your contacts and your pattern of communication with them.
  • (He still has considerable work cut out for him, but my friend Christopher Parks deserves mention here for his MedBillManager, a project to automate and surface helpful and money-saving patterns in the blizzard of communications patients receive from their medical insurers and providers).

Each of these efforts represents an effort to unlock value in an individual silo of personal activity (e.g., music listening, social media creation, personal spending, email communications) by freeing the data from the commercial service(s) or application(s) in which it was created and exposing it as a remixable data asset. This is especially powerful when, as with or Wesabe, each individual user’s data is blended with thousands of others’ to surface statistical patterns, creating new kinds of value for each individual user.

In doing so, they point the way toward a future in which all of our personal data is available to us for remixing, not just in individual silos, but as an integrated (or integratable) data store. This won’t happen quickly, as many institutions (particularly those who view your data as a proprietary corporate asset) will resist the demand to free their customers’ data on competitive grounds. (Amazon is an easy example here: they won’t love the idea of offering each customer an easily-downloaded XML file of their lifetime purchase history so it can be shared with competing online retailers). And even if they concede the right of their customers to own their own data, others – think the highly-regulated and security-phobic financial services industry – will struggle to implement a workable technology solution in the face of cultural opposition from their IT and compliance teams.

Despite the inevitable competitive and cultural hurdles, customers will ultimately win the right to their entire universe of personal data. This integrated personal data store will become the foundation of a personalized analytics, recommendation and content creation engine that can only be gestured at using current examples. And I don’t think this is a decades-away idea; some companies (think Google) have the engineering and financial wherewithal to begin moving in this direction today.

If anyone wants to join the fight on this, here’s my personal wish list of the remixable data elements I want to pull into my personal analytics suite:

  • Credit card transaction history, including accurate timestamps on each transaction
  • Completed and planned itineraries for any airline, hotel or rental car company
  • Current and historical lat/long time series (maybe at 5-minute increments to make it more manageable) from my cellphone network pings
  • SKU-level product purchase history from all major retailers (starting with Amazon)
  • Email communications history, including full text and a browsable archive of attached files
  • Email and telephone contact history, including timestamp, duration (for calls) and word count (for emails)
  • Complete personal photo and video archive, with timestamps

Anything I’ve missed? Anyone out there already working on this?


  1. Bill

    Maybe, I am one of those security phobic people but if the data you are talking about is collected I could not imagine sharing that with the world.

    Do you envision any type of access control to the data store?

    Will you want control how and when YOUR data is used?

  2. Chris DeVore

    I didn’t address this point in the post, but I envision this data store as an asset that you control. You can share elements of it with others anonymously (as in the Wesabe example) to surface interesting patterns and recommendations to benefit you, make it public with your identity attached (as with, or share in a restricted way with friends and family (like Google Docs or Flickr). To me the data pooling for pattern identification (whether anonymous or public) is the most exciting part – just imagine a super-accurate Amazon Suggests for every aspect of your life.

Comments are closed.