Itinerary
- Part I: Introducing the AI Strategy Framework
- Part II: Crafting a Compelling AI Product Vision and Narrative
- Part III: Data Collection - The Essence of AI
- Part IV: Ensuring Reliable and Accessible Storage
- Part V: Data Exploration and Transformation - Obtaining Clean Data (👈You’re here)
- Part VI: Insights & Analysis - See the Unseen
- Part VII: Machine Learning - The Continuous Improvement Cycle
- Part VIII: AI & Deep Learning - Reaching the Pinnacle
Data runs the show
It’s Friday night. You’ve waited all week for this moment. For the chance to run out to the local blockbuster and to walk up and down the aisles picking up DVD cases and reading the back to see what catches your eye. On the way in you dump off last week's rental, knowing that it’s a few days late and you’re going to have to pay late fees. You also know that you may have to contend with the frustration of the latest and greatest release being fully rented out.
It’s not long ago that this was a reality. Today, things look much different and it’s largely due to the incredible scale and dedication that Netflix has placed on data collection and analysis. Netflix has in place three key tenets:
- Data should be accessible, easy to discover, and easy to process for everyone.
- Whether your dataset is large or small, being able to visualize it makes it easier to explain.
- The longer you take to find the data, the less valuable it becomes.
This is the Netflix way of describing the value of what we’ve been calling the Gold dataset. A set of data that has been transformed, cleaned up and refined such that all data users in the company can easily access and derive insights from its contents.
One of the ways that Netflix has come to dominate the streaming market, and mastered content creation is their usage of a centralized and clean dataset. Their first venture into the content creation realm was unique. They purchased two seasons of their first show for $100million dollars. An outrageous sum to start with, but especially bold for a company brand new to creation.
The reason why Netflix was so bullish was because of their data analysis and the insights they were able to glean from their high quality data set. First they looked at which shows were most watched on their platform, then they looked at particular attributes of the shows that were successful, third they mapped together common attributes to try to come up with a recipe for a successful show.
The show they purchased for $100M was House of Cards - a resounding success critically and commercially. By using the data analysis outlined above, they identified that a huge base of their users liked political dramas, liked films/shows that star Kevin Spacey, and enjoyed the director David Fincher. The House of Cards project was a combination of all of these factors and Netflix placed a big bet that they would be a recipe for success.
Netflix's adherence to its data principles demonstrates the profound impact of a focused data strategy—collecting, storing, and transforming data to facilitate easy access and processing. Their commitment to data-driven decisions has not only positioned Netflix as a leading global company but also highlighted the power of data analytics in achieving commercial success even before the deployment of AI models. This story once again demonstrates the transformative potential of our AI Strategy Framework.
SQL - the language of analysis
In past blogs we’ve mentioned the need to think about structured vs. unstructured data. At the Data Analysis layer of the AI Strategy Framework the data will be standardized and structured (in a SQL friendly format) and in a SQL database. SQL, or Structured Query Language, is a pivotal tool in the arsenal of any data-driven organization, acting as the communication structure for database management and data manipulation. Its journey from inception to becoming an indispensable part of modern data analysis is a testament to its flexibility, power, and accessibility.
At its core, SQL's syntax is both elegant and intuitive, designed to articulate complex data queries in a format that is relatively easy to understand. A typical SQL command includes the "SELECT" statement to specify the columns to be returned, "FROM" to indicate the source table, "WHERE" to denote conditions that must be met, "GROUP BY" to aggregate rows that have the same values in specified columns into summary rows, and "ORDER BY" to specify the sort order of the returned data.
Let’s imagine a data analyst at Netflix and how they may uncover useful relationships of their data using what we know about SQL.
SELECT Genre, COUNT(UserID) as ViewerCount
FROM WatchHistory
WHERE WatchTime BETWEEN '2023-01-01' AND '2023-03-31'
GROUP BY Genre
ORDER BY ViewerCount DESC;
What we can see here is that if a data analyst is equipped with a structured and trustworthy set of data, they can derive some very useful insights with only a few lines of SQL code. Here they are looking at the time spent watching a specific genre to understand what category is most popular over a certain timeframe.
The importance of data visualization
In the realm of analytics and insights, data visualization is like taking individual characters and writing a story. It’s all about transforming complex datasets into comprehensible, actionable information. Dashboards, in particular, serve as a powerful tool, combining critical metrics and trends into an user friendly interface. They are pivotal for providing data access across teams, enabling data-driven decision-making at all levels.
Returning to our example of a Netflix data analyst, we can imagine setting up a dashboard with a simple pie chart that updates every week, month, or year and shows the popularity of each genre by minutes watched. This simple visual can give decision makers the information they need to make future content creation decisions.
Not all dashboards are made equal
The effectiveness of a dashboard lies in its design and functionality. A well-designed dashboard is:
- User-Centric: Tailored to the needs and roles of its users, providing relevant data that supports specific decision-making processes.
- Interactive: Offers drill-down capabilities, allowing users to explore data layers beneath the surface metrics.
- Accessible: Available across devices, ensuring stakeholders can access insights regardless of their location.
While dashboards are invaluable, they are not without their challenges, especially when dealing with continuously updating data sources:
- Data Overload: The temptation to include too many metrics can overwhelm users, obscuring key insights.
- Performance Issues: High-frequency updates can strain backend systems, leading to slow response times and outdated information.
- Security and Access Control: Balancing accessibility with data security requires robust governance policies to prevent unauthorized access to sensitive information.
To avoid these common pitfalls and create dashboards that are timely, useful, and trusted it’s important that you take some time to plan and consult the users prior to putting charts and graphs together. Often we don’t think about our colleagues as customers, but in the case of critical business data, our internal stakeholders need to be properly consulted. To create data visualizations that actually drive business value think about following these steps:
- Define Clear Objectives: Start with the end in mind by identifying the decisions the dashboard is designed to support. This ensures relevance and focus.
- Engage Stakeholders Early: Involve potential users from various departments in the design process to gather insights on their data needs and usage patterns.
- Prioritize Key Metrics: Limit the dashboard to essential KPIs that align with business objectives, ensuring clarity and actionable insights.
Dashboards are a cornerstone of a data-driven culture, bridging the gap between raw data and strategic action. However, they are only as useful as their users say they are. Take a human-centric view to your dashboard development and ensure you have a feedback loop. By adhering to these strategic principles, you can harness the full potential of dashboards, fostering an environment where data empowers decisions at all levels.
Preparing for training
Transitioning from our foundational work on creating continuously updated and trustworthy dashboards, we now delve into the critical phase of preparing for machine learning (ML) model training. This step is where the role of data scientists becomes central to transforming a curated dataset into a powerhouse ready for ML algorithms.
The first task in this journey involves data scientists continuing their exploration of the Gold dataset to uncover underlying patterns and relationships. This exploration is crucial for understanding how different features interact and influence the outcome they're trying to predict. Techniques such as correlation matrices, scatter plots, and advanced statistical methods are employed to identify potential predictors for the ML model.
Once potential predictors are identified, data scientists engage in feature engineering. This process involves creating new features from existing data that can enhance the ML model's predictive power. For example, from a date column, a data scientist might extract day of the week, month, and year as separate features, hypothesizing that these elements might have distinct impacts on the target variable.
Let’s return back to Netflix now imagining one of their data scientists trying to make a recommendation algorithm. First they dive deep into the data set using statistical methods to uncover relationships. They uncover a relationship between the time of day and the genre of movie watched. With this relationship in mind, they ‘engineer’ features, meaning that they add columns to the dataset such as “Morning”, “Afternoon”, “Evening”. These columns are added by looking at the time of day and creating buckets of time where they consider it to be morning, afternoon and evening. They then would place a True or False in each column depending on whether that genre was watched at that specific time.
Preparing data for ML involves formatting it in a way that can be effectively processed by algorithms. This step includes:
- One-hot encoding: Transforming categorical variables into a form that could be provided to ML algorithms to do a better job in prediction. For instance, transforming a categorical feature like color (red, blue, green) into separate columns with binary values.
- Scaling: Standardizing the range of continuous initial variables so that each feature contributes approximately proportionately to the final prediction. Techniques like Min-Max scaling or Z-score normalization are common.
- Handling Missing Values: Deciding on strategies for missing data, such as imputation, where missing values are replaced with a statistic like mean, median, or mode.
As data scientists navigate through these steps, they transform the cleaned and curated dataset into a format primed for ML model training. This preparation is a blend of art and science, requiring deep knowledge of the data, the problem at hand, and the intricacies of ML algorithms. The culmination of this process is a dataset not just ready for model training but optimized to ensure the best possible outcomes from the ML algorithms applied to it.
Keep it simple
In the midst of refining our approach to data analysis and gearing up for machine learning model training, it's crucial to distill our strategies into a few key principles that ensure efficiency and clarity. Here are three central guidelines to keep in mind:
- Actionable Insights from Dashboards: Ensure your dashboards not only visualize data but also highlight actionable insights. Use clear visual indicators for metrics that need attention and provide straightforward options for users to dive deeper into the data. This focus helps translate insights into actions that drive value.
- Simplify Feature Selection: When moving towards ML model training, concentrate on simplifying feature selection. Utilize correlation analysis and importance ranking methods to identify and retain only those features that significantly impact your model's predictive power, making the path to first model deployment as simple as possible.
- Iterative Refinement: Adopt an iterative approach to both dashboard development and model preparation. Start with a basic set of functionalities and improve iteratively based on feedback from stakeholders and ongoing analysis. This method ensures continual enhancement and relevance of your analytics tools and ML models.
By emphasizing these three key areas, your team can maintain focus. This approach not only streamlines the process of preparing for machine learning but also ensures that your data analytics efforts remain grounded, actionable, and continuously aligned with your organizational objectives.
Wrapping Up
Grab your popcorn, put your feet up, and get ready for the data analysis show. Reaching this point in your AI strategy implementation is a huge accomplishment. It’s a step that many ignore, and ultimately never get to see the benefit of. Without the power of AI, Netflix used heuristics and data analysis to make $100M bet that changed the entertainment industry forever. By spending time with your data and realizing that humans still have a lot to contribute to your business, you’re another step closer to maximizing the ROI on your AI strategy investments.
Next week we finally get to what you’ve all been waiting for - Machine learning. We’ll talk about how to go about creating your first model to align with our theme of simplicity. This will allow you to focus on the big, hard problem of integrating this model with the rest of your technology stack. See you there!
Need Help?
If you're seeking to unlock the full potential of AI within your organization but need help, we’re here for you. Our AI strategies are a no-nonsense way to derive value from AI technology. Reach out. Together we can turn your AI vision into reality.
Chapters
Trip Images
Want to stay in the loop?
Mitchell Johnstone
Director of Strategy
Mitch is a Strategic AI leader with 7+ years of transforming businesses through high-impact AI/ML projects. He combines deep technical acumen with business strategy, exemplified in roles spanning AI product management to entrepreneurial ventures. His portfolio includes proven success in driving product development, leading cross-functional teams, and navigating complex enterprise software landscapes.