BIG DATA, BIG ANALYTICS: February 2012

Sunday, February 12, 2012

DATA, DATA EVERYWHERE

There's an interesting article in the Sunday New York Times today, purportedly on Big Data. The piece doesn't particularly provide a definition of Big Data - there's no mention of the Volume, Velocity, or Variety attributes that have become the "standard" measures of Big Data today. Rather, the article is more focused on discussing the practice of data science, and how analytics are an ever-increasing requirement for innovation and competition in today's market. What the article highlights is the increasing need for the competent practice of data science - running "Big Analytics" on the ever-growing volume of Big Data sets available. The referenced McKinsey study in the piece provides a great perspective on what they refer to as Deep Analytical Talent.

The implications of this shift (let's assume it is a shift) are numerous - but one is that the emerging practice and community of data science will require a new set of tools and capabilities. I read a great article recently (if someone can point me to the link, I can reference it appropriately) that compared the emerging data science practice to the emergence of computer scientists several decades ago. Before there were classes, tools, etc. for computer scientists we had electrical engineers, applied mathematicians, and other quantitative fields contributing to the emerging field of computer science. Today you can major in computer science, there are software packages built exclusively for computer scientists (IDEs, for example) - it's simply become part of our standard nomenclature. The same thing is happening with data science, and what we refer to at Greenplum as Big Analytics - a set of tools and capabilities are emerging (and still need to be developed) that enable the world's data scientists to their jobs better, faster, and with bigger and bigger data.

Thursday, February 9, 2012

THE BEGINNING OF SOMETHING BIG

Today was my last day at Yahoo!, where I have worked for the past 6.5 years. It's been a wild ride, and (so far) the best career choice I've ever made. When I started working at Yahoo! I had already been working in the area of analytics, first at Broadbase (now KANA) and then at Enkata. At both companies we worked with ever increasing database sizes - first Gigabytes at Broadbase, and then Terabytes at Enkata. I thought I knew what it meant to work with big data sets.

But at Yahoo! I learned that what I'd been doing so far was child's play. The first application I worked on, an internal tool for measuring the reach and engagement of Yahoo! properties, processed over 4 billion events a day and stored that data in a database that was 5 times larger than anything I had worked with to date. During my time at Yahoo! I worked on data pipelines and analytical applications that routinely process tens of Terabytes a day of raw data, and data marts that easily break the 100 Terabyte mark. And for a while, I think that Yahoo! was one of the few companies in the world that needed to and had the capacity to work with such data sizes.

What I see today is that times have changed: what was once the rarefied air of the few is becoming the daily Big Data challenge of the many. It's no longer sufficient to expect that Big Data systems need to be run by a select few who know how to effectively deploy and manage Hadoop clusters or manually tune and maintain an MPP database. In the emerging world of Big Data the companies that survive and win will be those that can get past the hurdles of just "managing" Big Data, but can rapidly iterate and learn using Big Analytics.
So, next week I will embark on a new journey, taking what I've learned over the past 12 years and applying it to the industries first real Big Data and Big Analytics platform. I am sure it will be a fun ride, and a chance to do something, well... BIG.