Unit 5: Statistical Representation of Data
Data Quality
Data Quality refers to the condition or fitness of data to serve its intended
purpose in a given context. High-quality data ensures that decisions based
on the data are accurate, effective, and reliable.
Key Dimensions of Data Quality:
1. Accuracy: The data correctly describes the "real-world" object or
event.
o Example: A person's name or address is correctly spelled.
2. Completeness: All required data is present.
o Example: Customer records include names, emails, and phone
numbers without missing fields.
3. Consistency: Data is consistent across different systems or datasets.
o Example: A customer's email address is the same in both the
CRM and billing system.
4. Timeliness: Data is up to date and available when needed.
o Example: Stock levels are updated in real time for e-commerce
websites.
5. Validity: Data conforms to the syntax (format, type, range) of its
definition.
o Example: Dates follow the DD/MM/YYYY format, and phone
numbers have the correct number of digits.
6. Uniqueness: Each entity is represented only once in the dataset.
o Example: No duplicate entries for the same product.
, 7. Relevance: The data is useful and applicable to the business goals.
o Example: Collecting customer feedback data that's relevant to
improving product design.
Why Data Quality Matters:
Enables better decision-making
Reduces operational costs
Improves customer satisfaction
Ensures regulatory compliance
Boosts efficiency and productivity
How to Improve Data Quality:
Perform data profiling and audits
Set data governance policies
Use data validation rules
Implement ETL (Extract, Transform, Load) processes
Maintain metadata and documentation
Conduct regular cleansing and deduplication
Data Objects and Attribute Types
In data mining and data analytics, data objects are entities that store
information, and attributes are the properties or characteristics that
describe those objects.
1. Data Objects: