De Mast, J. & Trip, A. – Exploratory Data Analysis in Quality-Improvement Projects
Confirmatory data analysis is concerned with testing a prespecified hypothesis.
Exploratory data analysis, during which data are screened for cues, leads to hypothesis
generation. Generating a hypothesis is different than hypothesis testing and estimation, it
can be brainstorming or exploiting suggestions from problems.
CDA and EDA are both contrasted with descriptive data analysis, which is the summary of a
dataset in a number of descriptive statistics. EDA goes somewhat further than descriptive
statistics, in that its aim is not merely to present salient features of a dataset, but also to
speculate and formulate hypothesis that gave the potential to explain these salient features.
EDA is viewed from philosophy (discovery), AI (problem solving) and medical sciences
(diagnosis). They conclude that guidelines for hypothesis generation should not consist of
algorithms, but should have the form of heuristics. EDA does not follow rules, it is flexible,
thus no plan, and its speculative, thus hypotheses have potential but are not true.
The purpose of EDA is the identification of dependent (Y) and independent (X) variables that
may prove to be of interest for understanding or solving the problem under study.
Y can be, Y=Y1+Y2.. (thus the sum of a number of components to identify interesting Y’s).
Here EDA helps go from a broad problem description to a focused description of the
problem, meaning causes and solution can more easily be found.
or Y=E1+E2.. (thus, E is the effect of a single or several causal factors X, and the inquirer
wishes to discover those X’s). EDA helps with the identification of potential causal influence
factors.
The process of EDA:
* Display the data: data should be organized to maximally exploit the pattern-recognition
capacities of our brains. Here distribution should be revealed through stratified data, data
plus time order or multivariate data.
* Identify salient features: assuming a neutral reference distribution, look for deviations
from this reference distribution.
* Interpret salient features: this is what turns descriptive statistics into EDA. A form of
discovery that often leads to more interesting hypotheses is explanation driven and is called
abduction. Abduction means that the inquirer compares conceptual combinations to his
observations until all the pieces seem to fit together and a possible explanation pops up.
The driving principle is explanatory coherence, which could be said to be the extent to which
pieces fit together and is based on the extent to which an idea explains a wide range of
observations, is consistent with context knowledge, and is simple.
Conclusion:
- EDA is only suitable for the discovery of factors that actually vary during data
collection.
- EDA stimulates and gives direction to the subsequent use of other discovery tools,
such as brainstorming, pairwise comparison, etc.
Check article for summary in table
Confirmatory data analysis is concerned with testing a prespecified hypothesis.
Exploratory data analysis, during which data are screened for cues, leads to hypothesis
generation. Generating a hypothesis is different than hypothesis testing and estimation, it
can be brainstorming or exploiting suggestions from problems.
CDA and EDA are both contrasted with descriptive data analysis, which is the summary of a
dataset in a number of descriptive statistics. EDA goes somewhat further than descriptive
statistics, in that its aim is not merely to present salient features of a dataset, but also to
speculate and formulate hypothesis that gave the potential to explain these salient features.
EDA is viewed from philosophy (discovery), AI (problem solving) and medical sciences
(diagnosis). They conclude that guidelines for hypothesis generation should not consist of
algorithms, but should have the form of heuristics. EDA does not follow rules, it is flexible,
thus no plan, and its speculative, thus hypotheses have potential but are not true.
The purpose of EDA is the identification of dependent (Y) and independent (X) variables that
may prove to be of interest for understanding or solving the problem under study.
Y can be, Y=Y1+Y2.. (thus the sum of a number of components to identify interesting Y’s).
Here EDA helps go from a broad problem description to a focused description of the
problem, meaning causes and solution can more easily be found.
or Y=E1+E2.. (thus, E is the effect of a single or several causal factors X, and the inquirer
wishes to discover those X’s). EDA helps with the identification of potential causal influence
factors.
The process of EDA:
* Display the data: data should be organized to maximally exploit the pattern-recognition
capacities of our brains. Here distribution should be revealed through stratified data, data
plus time order or multivariate data.
* Identify salient features: assuming a neutral reference distribution, look for deviations
from this reference distribution.
* Interpret salient features: this is what turns descriptive statistics into EDA. A form of
discovery that often leads to more interesting hypotheses is explanation driven and is called
abduction. Abduction means that the inquirer compares conceptual combinations to his
observations until all the pieces seem to fit together and a possible explanation pops up.
The driving principle is explanatory coherence, which could be said to be the extent to which
pieces fit together and is based on the extent to which an idea explains a wide range of
observations, is consistent with context knowledge, and is simple.
Conclusion:
- EDA is only suitable for the discovery of factors that actually vary during data
collection.
- EDA stimulates and gives direction to the subsequent use of other discovery tools,
such as brainstorming, pairwise comparison, etc.
Check article for summary in table