Data scientists have long been aware of the concept of “garbage in, garbage out” — the idea that the quality of results is a direct indicator of the quality of data. Indeed, much effort has been expended in the pursuit of cleansing data to ensure its accuracy. It then should come as no surprise that AI and machine learning (ML) algorithms are also subject to the same quality standards.
Why is it, then, that even systems trained with the most accurate information are often plagued with erratic and biased results? How is it that we can put in gold and still get garbage? Legendary or not, the internet is replete with stories of what can go wrong with seemingly perfect data that was unknowingly biased. Whether attempting to find enemy tanks hidden in the forest or discerning the difference between a wolf and a house pet, often what the machines learn is not what we sought out to teach them.
Thus, the solution to reducing bias cannot not lie solely in the quality of the data. And perhaps surprisingly, it does not lie in the volume of data. Even the most modern neural networks are susceptible to some form of gradient saturation — a condition that causes training to slow down or cease altogether.1 One important solution to reducing bias in your AI and ML systems is to diversify your data.2 In the remainder of this post, we will look at what exactly this means along with the challenges that come with creating a diversified data set.
What Is Data Diversity?
What exactly does it mean for data to be diverse? An obvious answer is that it means data that is simply different. But, hopefully, your intuition suggests there is more to it. After all, if all we cared about was data being different, we could randomly sample our training sets and expect similar results. Anyone who has tried this knows it's not that simple. Let’s examine the types of differences we actually care about.
- Regional: Regional differences can be thought of as data from multiple locations collected at a single point or as data from a single point collected from multiple locations. In either case, geographical insight into anomalies can be exploited.
- Temporal: This refers to data collected from different points in the pipeline or life cycle. Unfortunately, this is often overlooked as a form of data diversity — perhaps because it is a layer of information that human analysts have difficulty grappling with. Luckily, it's not difficult for machines to deal with data that has a temporal dimension. Furthermore, this type of data can be invaluable when determining anomalies — especially when multiple (think hundreds or thousands) timelines overlap.
- Structural: Data can be collected in different structured and unstructured formats. With modern collection engines, there is no bound to the types of information that can be processed by AI and ML algorithms. This can include all forms of digital data, such as images and video. Proper tools must be used to extract the desired data and transform it into usable information.
- Organizational: It is imperative that data be collected from different organizations, both internal and external. This is another category often overlooked because it may not be immediately evident why data from separate IT organizations, for example, should be mixed together. However, excluding certain organizations, for whatever reason, is a form of selection bias that is one of the most predominant factors in biased ML outcomes.3
- Varietal: This includes all forms of data collected from traditional monitoring and collector sources and from all nontraditional sources, such as social media, data lakes and cloud service providers. There is a virtually unlimited variety of data that can be consumed. To this end, several providers, such as Fivetran and Domo, have facilitated the collection, transformation and movement of structured, semistructured and unstructured data from a plethora of sources.
Creating More Diverse Data Sets
When it comes to creating a more diverse set of training data, there are some challenges that need to be overcome. Some of these are self-inflicted by outdated or lackadaisical practices, and some are inherent to big data itself. However, all of them require the proper tools to mitigate.
- Insufficient domain knowledge: Let’s face it, raw sensor data is often cryptically verbose and inherently noisy. Even if it were possible to make sense of, no data scientist has the desire to filter through the minutiae of raw data to ensure only the important parts remain. What is needed is a strong abstraction layer that captures the essence of each data point along with the ability to connect each data point to its corresponding model. With this capability, powerful tools can facilitate and even aid in the selection process.
- Inconsistent data formats and retention policies: Similar data collected from multiple sources may have wildly different formats and use cases. For example: One IT department may collect networking information using off-the-shelf tools that they store for years in a structured database. Another department may use in-house scripts to create files that get rotated every few days. Without the appropriate tools, combining and transforming data into a usable training set can range from cumbersome to downright difficult.
- Inadequate data selection: Whether it's an oversimplified sampling method or a query language that is inefficient, ineffective or, worse, non-existent, the biggest barrier to creating a diversified data set is often the ability to separate the good from the unwanted.
- Sampling: All methods of random or probabilistic sampling of data streams are prone to bias in some form or another.4 The weakness of these sampling methods is intuitive yet often ignored. Consider, for example, that you would like to sample a book. If the book were 100 pages and you read just 10 pages, selected either via random or probabilistic means, there is a very good chance that you would not come away with a deep understanding of its contents. What you actually need is for someone who has read the entire book to tell you which pages are most critical. In big data, this is obviously not possible. Therefore, we will have to settle on techniques that are considered to be the best attempt to capture the critical information in the data stream. This requires a priori knowledge of how the data interacts with the system. Therefore, tools that allow you to visualize these interactions are imperative.
- Filtering: Another common form of data selection is the process of reducing a large data set to a smaller, more manageable data set. Filtering can succumb to the same biases as sampling, with the slight advantage of having knowledge of the entire data set ahead of time. In order to effectively filter, there must be a robust and rich query language that allows data to be visualized across orthogonal perspectives. A simple example is the ability to view data across a time range grouped by different attributes to gain understanding of how these attributes change over time.
Unleashing the Power of Zenoss
Zenoss Cloud is a SaaS-based intelligent IT operations management tool for streaming and transforming big data into contextual patterns amenable for AI and ML operations. A rich ecosystem of collectors facilitate the collection of diverse data sets, and the Smart View allows users to upvote and downvote data streams for their relative contribution for training sets. This level of control is made possible by a robust modeling engine that provides users with the ability to create customized schemas to describe and query their data. Developing an effective and unbiased anomaly detection and prevention system can be hard without the proper tools. Put the power of the Zenoss Cloud monitoring platform to work for you by scheduling a demo today.
1 https://www.informit.com/articles/article.aspx?p=3131594&seqNum=2
2 https://news.mit.edu/2016/variety-subsets-large-data-sets-machine-learning-1216