BIG DATA, BIG ANALYTICS

Monday, November 19, 2012

Building a Strategic Data Organization

This Big Data thing Is… Well… Big!

Early this year I made a big career move: after almost 7 years working at Yahoo! I joined Greenplum as our VP of Product Management. The excitement of the new job has been exhilarating – new industries to understand, Big Data challenges to solve, and the fast moving pace of a “start-up-like” company. I’ve always enjoyed learning new things – it was what I liked best about working in the central data team at Yahoo!. Greenplum is a place where I have continued to learn.

That said, there are a few patterns that I experienced at Yahoo! that I continue to see as I meet with Greenplum customers and prospects who want to tackle the world of Big Data. The first is an ever-growing need for analytic agility. Organizations are constantly challenged by the time and efforts required to extract insights from existing data assets and then convert these into actions (in the form of well-informed business decisions or data-driven applications). A portion of this challenge is solved by having the right platform – one of my favorite parts of being at Greenplum is when I can share with a customer how Greenplum’s Unified Analytics Platform supports an agile analytics environment. But the right platform is only part of the solution. The other consistent query I get from prospects and customers is: “What is the right organizational model for me to be successful with Big Data?” And while it’s fun for me to talk about our platform, I think it’s this second aspect of the Big Data challenge that may be the toughest to solve. And the most critical aspect to success.

Strategic Data Solutions at Yahoo!

In mid-2005 I joined a newly formed group at Yahoo! called Strategic Data Solutions (SDS). I actually sought out a role in the group after reading an article in Information Week about the appointment of Usama Fayyad as Yahoo!’s Chief Data Officer. My persistence paid off and I was lucky enough to get hired, joining a group of other data-loving professionals (including my current Greenplum colleague Annika Jimenez. Yahoo! was among the earlier companies to realize that its data assets were in fact a strategic asset, and to make a big bet (in the form of what grew to a 500+ person organization) to extract the most value from this data. It turns out that the bet Yahoo! made back in 2005 is similar to what I see many non-Internet companies starting to do today. In industries that people may consider to be old-school when it comes to Big Data – insurance, manufacturing, utilities – we now see executives starting to anoint their own Chief Data Officers and rolling out strategic data initiatives.

So, when these customers ultimately ask, “What is the right organizational model for me to be successful with Big Data?” – I look back to the way that the Strategic Data Solutions organization was set up, and I see a lot of things that we did right. Of course there is no single cookie-cutter approach to organizational design that works in all situations, but I believe that the core philosophy that drove how SDS was set up can help give other companies a strong framework for how to think about their strategic data initiatives. I’ve listed below the core components of the SDS organization – you can view these both as “product lines” as well as organizations within the group. My assertion: when building out your strategic data organization and its capabilities, think in terms of the big functional areas below:

Data Platform: at the core of any strategic data initiative is establishing a strong data platform that meets the core data provisioning needs of the organizations data consumers. Be careful not to confuse a data platform initiative with a more traditional “data warehouse” initiative. While one of the functions of the data platform may be to host or integrate with a data warehouse, the data platform also needs to support data sets that may not typically be in a data warehouse (documents, machine-generate logs, etc) and also needs to support workloads that aren’t well suited to a traditional data warehouse (sandbox-based analytics, feed provisioning to production systems, non-SQL data analysis, etc). At Yahoo! we made a big investment in building out a core data platform (originally a home-grown file-based system, and ultimately a combination of Hadoop and relational databases) to support the broad range of data consumers we needed to support.

Business Intelligence: one of the mistakes we made early on in SDS was to abdicate responsibility for the delivery of the core business intelligence needs of the various Yahoo! business units. It was a convenient decision to make initially: it was an area with demanding consumers, difficult-to-prove ROI, and was frankly not as “sexy” as the other more advanced work that we wanted to do. Over time, however, we realized that supporting the business intelligence needs of our business stakeholders needed to be one of our core offerings. There were benefits in terms of data re-use, stakeholder relationships, and other economies of scale that made this the right thing for SDS to do. By successfully supporting the BI needs of our business partners we were able to “earn the right” to engage with them on the more advanced analytics and data services we had to offer. The key to success here was to (appropriately) view our BI investment as a cost center. We avoided getting caught up in the losing battle of trying to show the ROI of our BI efforts by instead focusing our ROI-based initiatives in areas where we could, in fact, show true returns (see below).

Data Science Services: within SDS we worked hard to enable customers (the various Yahoo! product lines & business units) to derive “actionable insights” from the data asset we created with our Data Platform. Often the data science skills required for anything other than traditional reporting and BI weren’t resident in the various lines of business. (In fact, we continue to see this challenge today, and are working to help solve the Data Scientist skills shortage through things like our innovative partnership with Kaggle ) So SDS built out a consultancy-oriented group to help our customers move to the next level of analysis. The ultimate goal of our engagements with the business was twofold: first, the Data Science team was devoted to solving data-driven problems that resulted in a measureable ROI (increasing ad clickthrough rates, reducing churn, improving customer acquisition); second, we wanted to train our internal business customers on how to use the Data Platform and associated tools to do subsequent Data Science projects on their own.

Data Driven Applications: the ultimate goal of a lot of our Data Science initiatives at Yahoo! was to spur the creation of data-driven applications that could measurably impact the bottom or top line. As the name implies, these applications leveraged the results of some underlying data science efforts (scoring algorithms, recommendation models, pricing optimizations) to drive actions taken in Yahoo!s customer and internal-facing applications. The team was structured to work on a commissioned project basis: business unites would request support to build specific applications and back up their requests with detailed business cases. The Data Driven Applications team would then prioritize the long list of incoming requests and methodically tackle the highest-value projects. This model turned out to be a win-win for both SDS and our internal customers – the business units received value-enhancing data-driven applications; and SDS was able to effectively show how the investment in data as a strategic asset was driving true ROI for Yahoo!.

Data Distribution: a final and important aspect of the strategic data organization is an understanding that in addition to supporting the analytical needs (either via BI support or data science projects) there is also the need to support data distribution. For example, at Yahoo! the core data platform was used to generate segment membership information for billions of users (browser cookies) each day. These profiles needed to be distributed out to the operational systems that consumed them – the ad targeting platforms – so it was important to have the appropriate infrastructure and APIs to allow the consumers of these large data sets to access and move them. Additionally, there were consistent demands to provision subsets of the data in the core data platform to other consumers both inside and outside of Yahoo!. The Data Distribution challenge is one that many of our Greenplum customers today are started to struggle with as well, and it’s important to think about it when scoping out a big data strategy.

Dive Right In. The Water’s Warm!

Now I can’t guarantee that the structure described above is perfect for every organization – there are likely variations of this perspective that have worked for other successful data groups. However, I do think the emerging themes are consistent, and that if you consider the above elements while diving in to the Strategic Data Organization waters, you’ll be more likely to achieve success.

At the end of the day, there is a bit of a leap of faith required to make a strategic bet on big data. But the data shows that it’s worth it. A recent article in the Harvard Business Review revealed: “In particular, companies in the top third of their industry in the use of data-driven decision making were, on average, 5% more productive and 6% more profitable than their competitors.”

Good luck!

Wednesday, May 2, 2012

PHARTS - A Proposed Metric for Measuring Basketball Player Value

As a data guy, I have always enjoyed learning about the power of statistics combined with the right metric(s) – together these two components are able to be used to drive effective decisions that lead to a desired outcome. A great example of this – as I am sure many folks are already aware – is documented in Michael Lewis’“Moneyball”. In a nutshell, Moneyball describes how Oakland A’s General Manager Billy Bean was able to use “The Right Metric” – in this case On-Base Percentage – combined with statistical analysis to effectively acquire players that led to overall team success. This approach is nothing new in the world of sports or business, but it’s good to have this constant reminder of how creative thinking (the pursuit of the right metrics, the right attributes) and modeling (using the tools of Data Science) can lead to remarkable results.

In a recent conversation with a few guys that I play basketball with, we were discussing Kevin Love. His statistics this year have been simply amazing – he regularly puts up over 30 points a game and grabs 20 rebounds. We were trying to figure out how to rank him against other top players in the league, guys like Kobe Bryant, Kevin Durant, Lebron James. We didn’t really reach a conclusion (other than the fact that my scruffy facial hair, spot on three point shooting, and dominant rebounding closely resembles those of Mr. Love). So decided that it was time for a “Moneyball-Style” investigation of basketball statistics. (AUTHORS NOTE: Why, you may ask, did I decide this? In reality I just like playing with data. I also like creating clever acronyms.) While this analysis is currently in its early stages, and will likely stay that way, I thought it would be fun to share it with the world. I welcome any creative ideas on how to improve on the metric, test its applicability, or out-do my acronym.

Let’s start with the metric itself, which is PHARTS. A player’s PHARTS rating is calculated as follows: ([Points Halved] + [Assists] + [Rebounds] – [Turnovers] + [Steals])/(Minutes)

Once I had the metric, I needed to get some data. Since I am still in the evaluation stage I tried to find a free database of historical basketball statistics – luckily I was able to find this at http://www.databasebasketball.com/, at least through 2009. I downloaded the data into my sandbox (in this case Microsoft Excel) and proceeded to do a little discovery of the shape of the data – for example, in the early years of the data set a player’s minutes for the season weren't tracked. Similarly, statistics on steals and turnovers don’t start showing up until 1975. So as you will see in the analyses below I am only showing PHARTS ratings for players in the seasons between 1975 and 2009. I also needed to filter out players based on the number of minutes they played in the season – otherwise a player who had an amazing streak of 3 games and then sat on the bench the rest of the year (can anyone say Jeremy Lin?) might show up as a top-PHARTS prospect.

OK, so enough talk now – let’s get to the results of my analysis. To be clear, there is still work to do (most importantly my analysis remains descriptive – I have yet to correlate PHARTS scores with some objective metric like team winning percentage). But the results are still interesting, and at least merit some discussion.

Let’s start with the simple question: “Based on Career PHARTS, who are the top 25 players of all time?”. The result of this is show below with a lot of names you would expect, but also with a few surprises. (NOTE: This is filtered to show only players with more than 10,000 career playing minutes in the data set).

A lot of the names you see on this list are the expected ones – Magic Johnson, Larry Bird, Michael Jordan. But (at least for me) there are a few surprises – Chris Paul is up there in some rarified air; and who are Mel Daniels, Dan Issel, and LafayetteLever? Also, where is Kobe Bryant? Another note – since I needed to filter out the players and years where turnovers and steals weren’t tracked there are some key names that would be added to the top 10 above, including: Wilt Chamberlin, Bob Pettit, Bill Russell, Elgin Baylor, and Oscar Robertson.

Now, you might be looking at this list and saying – well, this is the same answer I would get if I just looked at Points/Minute, isn’t it? (Hint: remember that Kobe isn’t in that top 25 list) The chart below shows the Top 25 players from this same data set – but this time ranked by points. I’ve color coded the bar chart so that anyone in the Top 25 Career PHARTS list is in green (dark green is the best) and anyone in red is NOT in the top 25 list.

Michael Jordan is at the top of this list – he was a great scorer with a great PHARTS rating. But many of these top scorers are not in the original top 25 list – guys like Dominique Wilkins, Carmelo Anthony, and Kobe Bryan. These guys are great scorers, but aren’t as well rounded as the Top 25 list in terms of rebounds or assists. Now, it is fair to say that the PHARTS metric may be unduly influenced by rebounds, so a future version of this metric might look only at offensive rebounds (or at least weight them differently). But that’s outside of the scope of my analysis so far.

There is actually a wide range of factors that contribute to a player’s PHARTS rating, and it isn’t just scoring and rebounding. For example, John Stockton and Chris Paul make the Top 25 list by merit of their ridiculously high number of assists per minute. Swen Nater (who!?) is number 26 by averaging 0.63 rebounds a minute. Lafayette Lever and John Stockton are also helped into the top 25 by having the highest number of steals per minute among their PHARTS-leading counterparts.

Another interesting thing I noticed (I actually started thinking about this when reading up on the history of folks like Fats Lever and Michael Adams – high PHARTS guys I had never heard of) is that there was often an arc to their careers. They started with one team, had several seasons of greatness, and then were traded and never experienced their original success. So I looked at how PHARTS varies based on each player’s number of seasons in the NBA. I also looked at the variation (I used the standard deviation of PHARTS) of their performance my season.

So what can aspiring data-driven NBA GMs learn from this? Don’t acquire a player with a great PHARTS rating after his 5^th season and expect him to continue to perform at that same rate – on average PHARTS scores peak in a player’s 5^th or 6^th season. That said, the variation among PHARTS scores starts to decrease after a player’s 9^th season (if he lasts that long) – so if you are able to pick up a seasoned veteran you should be able to predict how he will perform in subsequent years. It’s a little bit more unpredictable for players in seasons 6 through 9.

Ok, this is all well and good, but I suppose the real question is whether a team of strong PHART-ers (as it were) is actually a strong team. Based on my analysis, here are the Top 20 teams (from 1975 through 2009) based on single season PHARTS average for the entire team.

Now, based on my quick analysis of the data, high TEAM PHARTS does not lead to championships – of this list the 1984 Lakers were in the Finals (they lost) and the 1977 Sixers were in the Finals (they lost too). That’s a 10% success rate – not so good. So, I tried a different approach – does having a SINGLE top ranked PHARTS player on your team indicate success? Here are the top 20 single-season PHARTS ratings by player (with their team listed too).

These results are a little more promising – from this list 6 of the 20 were on Championship Teams. Now – to be true to this analysis I really should be looking at the correlation of high individual PHARTS with team winning percentage. For those of you who remember Moneyball (or who are A’s fans) you know that the Billy Bean approach led to teams with consistently high winning percentages, but no World Series (although the amazingly do win the World Series in the movie of the same name).

So, where does this leave us? Nowhere, really – except that I can now comfortably say that PHARTS seems to be an OK measure of a player’s value. Maybe not the best; maybe I will tweak it some more in the future, but still pretty good. So the one final thing to do is to apply it to the 2011-2012 NBA player list, and see if my boy, K-Love, shows up among the top players. Here’s what the PHARTS ratings look like for an arbitrary selection of top NBA players:

Whoa, wait a minute! Is it possible that Rajon Rondo is the best player in the league? Maybe he is. Both he and Kevin Love are up there above the prolific scorers on the list. And why is that? Well, because Rondo delivers more assists than anyone in the league, and because Kevin Love isn’t afraid of going inside, banging some bodies, and grabbing some boards. Just like me.

A NOTE FOR THOSE OF YOU WHO WERE PAYING ATTENTION:

The point of this entire exercise was to illustrate the process of doing analysis. I spent time finding data. I loaded the data into a tool I was comfortable with. I visualized the data with my favorite visualization tool (Tableau). I iterated on my hypotheses. And I published my insights. Now, my data set in excel was only 20,936 rows. But imagine if you needed to do this on a billion records. Or on a trillion. Could you? If you want to learn more about how to do this at scale, come visit us at http://www.greenplum.com

Monday, April 30, 2012

THE PROVING GROUND: AGILE ANALYTICS

Agile development has been all the rage for a while now - extreme programming, scrum, user stories, epics, backlogs, etc. have become the lingua franca of any software development organization worth it's salt. And although notion of agile development hasn't yet completely penetrated other parts of the enterprise, there is an increasing awareness of the benefits of agile development. One of the areas where the concept of agile development is starting to gain traction is around analytics. As Jim Kobielus noted in a recent post, organizations that are ability to quickly learn and iterate using experimentation are able to gain a competitive advantage; analytics vendors like SAS have also been promoting the concept of agile applied to big data analytics.

This development is not surprising, as the values of agile development should resonate with anyone who's involved in delivering data and insights. At it's core, agile values:

Individuals and interactions over processes and tools
Working software over comprehensive documentation
Customer collaboration over contract negotiation
Responding to change over following a plan

In the realm of analytics, what do these values mean? And more importantly, how can they be realized in order to really realize the vision of "Agile Analytics"?

Let's start with the implications of "agile" in the world of Big Data Analytics (and also, Big Data Applications). Based on my experience at Yahoo! there is kind of an evolution of needs that takes place during the lifecycle of Big Data Analytics application development - let's call this the Analytics Application Development Lifecycle. Before diving into what this lifecycle looks like, let's first talk about the environment that enterprises with Big Data are dealing with today. In general, I've seen the following characteristics, which end up informing what an Agile Big Data environment needs to support:

Underlying Data Sets are Fast Changing: In this environment, timely analysis of new products and concepts is a competitive advantage. As a result, data processing and analysis systems need to be flexible enough to support underlying changes without requiring a rewrite or a new data model.
Demand for Analytics is Time Sensitive: In the big data world, the ability to analyze new features that are in production and impact revenue/ monetization is critical. Delays in turning around new requests can result in serious financial impact or customer risk.
Business Questions and Data Needs are Unpredictable: Anyone who is supporting the Business Intelligence (BI) needs of a "Big-Data-Driven" organization will tell you that reporting and analysis needs for new features can’t be anticipated - additional data needs often arise as the result of first-pass analyses. This means that data query and analysis systems must be built for unpredictable demands.
Volumes of Data and Data Consumers are Extremely Large: Analytics systems need to support deep analysis by data scientists, dashboards and reporting for larger internal user bases, and consumption by operational systems. To complicate things, all of these capabilities need to scale to support massive & growing data sets.

Given the above, what does a Big Data & Analytics platform need to do? It needs to support the analytics lifecycle as shown below.

A system that can easily support the above flow - with a focus on iterative, collaborative development within the "Ad Hoc" and "Proving Ground" quadrants - is well positioned to drive success for Big Data, Big Analytics initiatives. When evaluating your own platform to assess whether it's ready to support this lifecycle my advice is to focus on the capabilities described in the Top Ten list below. Now - this is not a comprehensive list, but it captures the core elements that one should be looking for as part of a data platform rollout.

Ad-hoc access to “raw” event/user level data
Data source agnosticism – structured or semi-structured data (e.g. key value pairs)
Data search and discovery via a semantic layer
Analysis- and Developer-friendly environment – SQL, Any BI Tool
Lower-than-average cost of change for new data, metrics
Schedule and publish capabilities for views, tables, insights
Unified catalog/metadata service
3rd Party Tool “friendliness”
Data processing management for ad-hoc & production workloads
Enterprise features for the entire data system

There are plenty of other things to think about as well: do you have the right "Data Scientists" within your organization to leverage this platform? Are you properly instrumenting your products and processes to drive data into your data platform? Are you thinking about closing the loop by building applications and systems that can leverage the insights delivered by your data science team (operationalization, as it were)? All things to keep in mind as you venture into the exciting new world of Big Data, and Big Analytics.

Sunday, February 12, 2012

DATA, DATA EVERYWHERE

There's an interesting article in the Sunday New York Times today, purportedly on Big Data. The piece doesn't particularly provide a definition of Big Data - there's no mention of the Volume, Velocity, or Variety attributes that have become the "standard" measures of Big Data today. Rather, the article is more focused on discussing the practice of data science, and how analytics are an ever-increasing requirement for innovation and competition in today's market. What the article highlights is the increasing need for the competent practice of data science - running "Big Analytics" on the ever-growing volume of Big Data sets available. The referenced McKinsey study in the piece provides a great perspective on what they refer to as Deep Analytical Talent.

The implications of this shift (let's assume it is a shift) are numerous - but one is that the emerging practice and community of data science will require a new set of tools and capabilities. I read a great article recently (if someone can point me to the link, I can reference it appropriately) that compared the emerging data science practice to the emergence of computer scientists several decades ago. Before there were classes, tools, etc. for computer scientists we had electrical engineers, applied mathematicians, and other quantitative fields contributing to the emerging field of computer science. Today you can major in computer science, there are software packages built exclusively for computer scientists (IDEs, for example) - it's simply become part of our standard nomenclature. The same thing is happening with data science, and what we refer to at Greenplum as Big Analytics - a set of tools and capabilities are emerging (and still need to be developed) that enable the world's data scientists to their jobs better, faster, and with bigger and bigger data.

Thursday, February 9, 2012

THE BEGINNING OF SOMETHING BIG

Today was my last day at Yahoo!, where I have worked for the past 6.5 years. It's been a wild ride, and (so far) the best career choice I've ever made. When I started working at Yahoo! I had already been working in the area of analytics, first at Broadbase (now KANA) and then at Enkata. At both companies we worked with ever increasing database sizes - first Gigabytes at Broadbase, and then Terabytes at Enkata. I thought I knew what it meant to work with big data sets.

But at Yahoo! I learned that what I'd been doing so far was child's play. The first application I worked on, an internal tool for measuring the reach and engagement of Yahoo! properties, processed over 4 billion events a day and stored that data in a database that was 5 times larger than anything I had worked with to date. During my time at Yahoo! I worked on data pipelines and analytical applications that routinely process tens of Terabytes a day of raw data, and data marts that easily break the 100 Terabyte mark. And for a while, I think that Yahoo! was one of the few companies in the world that needed to and had the capacity to work with such data sizes.

What I see today is that times have changed: what was once the rarefied air of the few is becoming the daily Big Data challenge of the many. It's no longer sufficient to expect that Big Data systems need to be run by a select few who know how to effectively deploy and manage Hadoop clusters or manually tune and maintain an MPP database. In the emerging world of Big Data the companies that survive and win will be those that can get past the hurdles of just "managing" Big Data, but can rapidly iterate and learn using Big Analytics.
So, next week I will embark on a new journey, taking what I've learned over the past 12 years and applying it to the industries first real Big Data and Big Analytics platform. I am sure it will be a fun ride, and a chance to do something, well... BIG.