Itinerary
- Part I: Introducing the AI Strategy Framework
- Part II: Crafting a Compelling AI Product Vision and Narrative
- Part III: Data Collection - The Essence of AI (👈You’re here)
- Part IV: Ensuring Reliable and Accessible Storage
- Part V: Data Exploration and Transformation
- Part VI: Insights & Analysis - See the Unseen
- Part VII: Machine Learning - The Continuous Improvement Cycle
- Part VIII: AI & Deep Learning - Reaching the Pinnacle
Mapping the entire world
In 2007, Google's founders -Sergey and Larry- were experimenting with researchers at Stanford with a vision of “searching the whole world.” They had heard about camera tech that could take continuous photos and be stitched together. To test this theory they messily strapped some cameras to an old hippie van. They began to drive their camera strapped van around the Stanford campus and Palo Alto to see if they could validate their image capturing hypothesis.
The tech proved to be successful and with it they had discovered they could unlock their vision of “searching the world” by strapping cameras to cars and driving around the world. This new data collection strategy has given Google a decade plus of new innovations and resulted in significant customer demand. First they released Street View where they allowed customers to explore an uncanny view of city streets and sights. Even without an application of data analytics or machine learning, the team at Google found that meticulously collecting a unique dataset could help drive user engagement and use.
As Google's capabilities and the volume of Street View imagery grew, the company saw an opportunity to derive greater value from this extensive data collection. By applying deep learning techniques to their massive image data, Google began to identify and catalogue various elements within the photos – from street signs to business names. This progression from simply capturing images to extracting actionable insights significantly enhanced the overall Maps experience, making it richer and more useful for end-users.
Collecting unique data points and refining that raw data into well-structured and high-quality datasets is essential for unlocking the full potential of AI and machine learning technologies. This story prepares us for our discussion today. With a focus on the data collection first, you unlock opportunities to drive value for your users well before the deployment of your first algorithm.
The Data Centric Approach
Over the past decade as Machine Learning and AI have re-emerged from the “AI Winter,” the industry has thought more data is better. Organizations have been paralyzed by the need to collect massive datasets on par with the petabytes available to Google, Facebook, and Amazon. While it’s true that you can’t build ML models without a large dataset, we’ve learned from Andrew Ng that ‘large’ might be a bit smaller than we thought.
Ng, a pioneer of deep learning methods and co-founder of Coursera, has begun to teach the “Data-Centric AI Approach”. In this approach, you start at the base of the AI Strategy Pyramid not just with the goal of collecting substantial amounts of data, but with the goal of ensuring the highest level of quality of this data. Over the course of today’s blog I hope to help you understand what data collection looks like and also what it really means to have ‘high quality’ data. With this, you’ll be able to de-risk the investment in your AI strategy by ensuring a high quality data flow, giving a much higher chance that your model's output closely represents reality, and ultimately achieves your business goals.
What is data?
To understand what the data-centric approach to AI looks like we must first understand what data actually is. Data is information. In our everyday lives, data surrounds us. We collect data visually (reading road signs), through sound (music), scent (freshly baked bread), and touch (the texture of sand). But how do computers collect data? Early in the 20th century, humans began to experiment with encoding our senses into machines using electrical signals. This experimentation resulted in the concept of the bit. A bit in the simplest sense is just a measure of whether electricity is flowing or not flowing. Electricity is flowing (represented as 1 bit) or electricity is not flowing (represented by 0 bits).
Data is just electrical signals that are trying to represent the reality that our human senses perceive. We capture images using cameras, sound using recording devices, and we have sensors for capturing pressure and temperature. When we talk about data quality, we are talking about how accurately we can go from the richness of the world to a reduced reality of 0’s and 1’s.
How do we collect data?
Let’s expand a little bit on how we actually capture data and encode it into computers. We primarily do this in four ways:
- Sensors: Devices like cameras, microphones, thermometers, and pressure sensors capture specific aspects of the environment, generating data streams in real-time.
- The Internet of Things: A subset of sensors, the internet of things refers to the internet connectivity of our everyday appliances and devices and how these devices send data across the internet. (eg smart thermostats)
- Input Devices: Keyboards, mice, and touchscreens allow humans to directly input data into the system.
- Cookies: Cookies help companies track user behaviour across the internet and this data can be used to understand their behaviour and personalise different experiences.
Focus on qualitative vs. quantitative data quality
Now let’s return back to Andrew Ng and his data centric approach to AI. With our understanding of how data is captured and encoded into machines, we can now better grasp what it means to have a high quality data set. When measuring temperature or pressure, this means making sure that each measurement captured is as accurate as possible and there aren’t any misreadings. This case, however, is much easier than qualitative data capture. Not easier in the sense of technical implementation, but because when we start to capture qualitative measurements, consistency is extremely difficult across different samples.
Let’s look at an example to illustrate why this is difficult. Let’s imagine a manufacturing facility. This facility wants to use computer vision to determine if a part on the line is faulty. To do this, they need to capture these images and ‘label’ them with a faulty/not faulty tag. This information can then be used to train an algorithm on whether or not each part is faulty. How would we actually achieve these labels? It would probably have to be a human with lots of experience spotting defects in these specific pieces of equipment.
So we can imagine a manufacturing worker sitting in front of hundreds or thousands of images and inspecting them to see whether or not they think that the part is faulty. We then also imagine that as the coffee runs out and the day drags on, the consistency of labelling may degrade and there may be a few misses. If the dataset isn’t consistent, the algorithm will learn these inconsistencies and have inconsistent results. This company will have invested their money creating an algorithm that works, but works poorly, making it more difficult to see an adequate return on this investment and improvement in their operations.
Getting from A to B - sending data to storage
Now that we understand the substance of data as electrical signals represented by 0’s and 1’s, we need to then understand how those signals get from the sensing device, the web browser, or the smart thermostat to a data storage location. How is it that we centralize all of this information to make it available for data transformation, analytics, and ultimately training our AI algorithms?
The IP/TCP Model
Back in the 1970’s as researchers were sorting out how to allow for time-sharing on the very few computers that were available, they began to develop a model that has largely evolved into what we today call the “Internet”. This model describes layers of independent protocols that communicate with each other to pass data from one machine to another.
You may have heard of an IP address before and how this is associated with your internet connection, but probably haven’t thought too much about it otherwise - and for good reason. The internet is an infinitely complex web of software and hardware that all work together to seamlessly pass the 0’s and 1’s around the world at blazing fast speeds. For the purposes of data collection in an AI strategy we need to be only somewhat aware of the inner workings so that we can deliver on our vision and not compromise security along the way.
High-Speed data transfer
Imagine a scenario where an autonomous vehicle collects gigabytes of data per minute. The transmission speed of this data is critical for real-time analysis and decision-making. Here, leveraging high-speed cellular networks and understanding the bottlenecks in data transmission become paramount. This example shows the importance of evaluating data volume and network capabilities to ensure swift and cost-effective data handling.
When crafting your data collection strategy, and also while settling on your MVR (Minimum Viable Robot), you’ll want to think about the tradeoffs and strategies that will get you the data you need at the rate that you need it. Can you validate end user value with lower data volume at lower speeds? This will reduce the complexity of implementation while also keeping costs low as you validate that your model outputs are resulting in the value you outlined in your vision and narrative.
Emerging tech
Innovations like edge computing and 5G networks are revolutionizing how data is transmitted and processed. Edge computing allows data to be processed closer to its source, reducing latency and bandwidth use, while 5G networks offer unprecedented transmission speeds. These technologies are reshaping the landscape of data collection and storage, offering new possibilities for AI applications that require near-instantaneous data analysis and decision-making.
Staying on top of what seems like a continuous stream of incredible innovations is daunting. It is essential to ensure that you know what tools and technologies are available so that you are able to construct a vision that takes advantage of the newest technologies and help you to stay ahead of the curve. This must be done with caution, though. The press release for a new tool may promise the world, but often it needs some time to mature. Think about this trade-off as you settle in on your data collection strategy and highlight if the value of the new and cutting edge is greater than the established and well documented present solution.
Don’t ignore security
With the advent of sophisticated hacking techniques, securing data in transit has never been more critical. Implementing encryption protocols such as TLS (Transport Layer Security) or IPSEC can fortify data security. For instance, a healthcare provider transmitting sensitive patient data can employ these technologies to ensure data integrity and confidentiality, exemplifying how security measures can be integrated without compromising performance.Early consultation with cybersecurity experts can illuminate potential vulnerabilities and the most effective countermeasures.
Keep it simple
- Prioritize Data Integrity: Focus on ensuring the integrity and accuracy of your data. High-quality data is foundational for effective AI models. Consider techniques like anomaly detection and data cleansing early in the collection process to improve the quality of your inputs.
- Evaluate Data Relevance and Representation: Assess the relevance of collected data in relation to your AI objectives. Ensure the data accurately represents the diversity of scenarios your AI solution will encounter. This might involve actively seeking out underrepresented data to avoid biases.
- Iterate with a Feedback Loop: Establish a feedback loop to continuously refine your data collection strategies based on the performance of your AI models. Use insights from data analytics and model outputs to identify gaps in your data and areas for improvement.
Wrapping Up
Data is the foundation for a reason. Lots of data is good, but lots of good data is better. By understanding data as a digital reflection of our complex world, we can focus on what’s important for the next layers of our AI Strategy Framework. I hope our discussion has led to an understanding that the quality of your data sets the stage for the transformative potential of AI.
As we move forward, the next instalment will talk about data storage, examining various storage options and their significance for your AI projects. We'll explore how these choices impact your ability to harness data effectively and adapt to evolving technological landscapes.
Need Help?
If you're seeking to unlock the full potential of AI within your organization but need help, we’re here for you. Our AI strategies are a no-nonsense way to derive value from AI technology. Reach out. Together we can turn your AI vision into reality.
Chapters
Trip Images
Want to stay in the loop?
Mitchell Johnstone
Director of Strategy
Mitch is a Strategic AI leader with 7+ years of transforming businesses through high-impact AI/ML projects. He combines deep technical acumen with business strategy, exemplified in roles spanning AI product management to entrepreneurial ventures. His portfolio includes proven success in driving product development, leading cross-functional teams, and navigating complex enterprise software landscapes.