Part 1: What Data-Centric Machine Learning Is and Why We
Need It
1 Exploring Data-Centric Machine Learning
Understanding data-centric ML
The origins of data centricity
The components of ML systems
Data is the foundational ingredient
Data-centric versus model-centric ML
Data centricity is a team sport
The importance of quality data in ML
Identifying high-value legal cases with natural language processing
Predicting cardiac arrests in emergency calls
Summary
References
2 From Model-Centric to Data-Centric – ML’s Evolution
Exploring why ML development ended up being mostly model-centric
The 1940s to 1970s – the early days
The 1980s to 1990s – the rise of personal computing and the internet
The 2000s – the rise of tech giants
2010–now – big data drives AI innovation
,Model-centricity was the logical evolutionary outcome
Unlocking the opportunity for small data ML
Why we need data-centric AI more than ever
The cascading effects of data quality
Avoiding data cascades and technical debt
Summary
References
Part 2: The Building Blocks of Data-Centric ML
3 Principles of Data-Centric ML
Sometimes, all you need is the right data
Principle 1 – data should be the center of ML development
A checklist for data-centricity
Principle 2 – leverage annotators and SMEs effectively
Direct labeling with human annotators
Verifying output quality with human annotators
Codifying labeling rules with programmatic labeling
Principle 3 – use ML to improve your data
Principle 4 – follow ethical, responsible, and well-governed ML practices
Summary
References
4 Data Labeling Is a Collaborative Process
, Understanding the benefits of diverse human labeling
Understanding common challenges arising from human labelers
Designing a framework for high-quality labels
Designing clear instructions
Aligning motivations and using SMEs
Collaborating iteratively
Dealing with ambiguity and reflecting diversity
Understanding approaches for dealing with ambiguity in labeling
Measuring labeling consistency
Summary
References
Part 3: Technical Approaches to Better Data
5 Techniques for Data Cleaning
The six key dimensions of data quality
Installing the required packages
Introducing the dataset
Ensuring the data is consistent
Checking that the data is unique
Ensuring that the data is complete and not missing
Ensuring that the data is valid
Ensuring that the data is accurate
Ensuring that the data is fresh