Behavioural data science, BDS = a field that combines psychology, statistics, data science, and
technology to understand, predict, and change human behavior using large datasets
Goals of BDS:
o Understanding = building theories that explain behavior (e.g. why do people become
depressed?)
o Prediction = using statistical models to predict behavior (e.g. predicting whether
someone is likely to drop out of university)
Machine learning = computer methods that learn patterns from data and
make predictions
o Change = developing interventions to change behavior (e.g. creating an app that
helps people exercise more often)
BDS is needed as behavior underlies major (world) problems (e.g. poverty, climate change,
war), however behavior was always studied with simple and minor tools, where now BDS
allows for huge datasets
o Psychology and AI can be combined to simulate human behavior (e.g. AI agents that
behave like people in a virtual town)
o BDS can analyze huge social networks and explain how polarization develops
Studying Twitter users during US elections resulted in groups of echo
chambers = a social environment where people mainly hear opinions they
already agree with (e.g. conservatives mostly interact with conservatives)
The architecture of the data world
Data → phenomena → theory
Collect data to establish phenomena and build theories that explain phenomena
Data = recorded observations
o Big data = extremely large datasets generated through digital technologies
Phenomena = robust patterns that exist in the world and can be found in data (it is not data
itself) (e.g. smarter people often do well on many intelligence tests, insomnia and depression
are related, etc)
o Statistical model = a mathematical tool used to detect patterns in data
o Identical twins’ scores are more similar than fraternal twins' → this is a phenomenon
(not data, not yet an explanation)
Theory = a set of ideas that explain why a phenomenon exists (e.g. why does this
pattern/phenomena occur?)
o Mathematical model = a theory written in mathematical form that can be tested and
simulated
Variable = a characteristic that can differ between people (e.g. age, IQ, happiness)
Experiments
Lexical decision task = an experiment where participants decide whether a letter string is a
word (e.g. tango) or nonword (e.g. drapa), which measures how easily words can be
activated from memory, as measured by response time (RT) (by pressing on a key)
o Evidence shows that people recognize common words faster than rare words
o General slowing hypothesis = older adults respond slower than younger adults as all
cognitive processes operate slower with age
By looking at the mean RT alone, older adults look slower (confirming global
slowing), but they were also more accurate (trading speed for accuracy),
which shows that slower responses do not automatically mean poorer ability
Speed-accuracy trade-off (SAT) = people can either respond quickly and
make more mistakes or respond slowly and make fewer mistakes
, Ratcliff diffusion model = a mathematical model describing how people
collect evidence before making a decision (e.g. brain collects clues until
evidence builds up to have enough information to make a decision)
Drift rate (v) = the speed and quality of evidence accumulation,
which measures task difficulty and ability (e.g. an easy word
produces a higher drift rate)
Boundary separation (a) = the amount of evidence required before
deciding, where wider boundaries result in slower but more accurate
responses (parameter responsible for the SAT)
Starting point (z) = the initial bias toward one response
Non-decision time (Ter) = the time spent on perception and
movement rather than decision making (e.g. seeing the word and
pressing the key)
The model showed all people had equal drift (v = .25), but older
adults just had wider boundaries (a = .12 vs .08)
Importance of models
Models are stupid, meaning models simplify reality which is what helps us think clearly
Formal models force you to specify the parts, explain their relationships, and think about
dynamics
Theory construction method = a five-step method to generate a theory
1) Identify phenomena: find patterns that need explanation
2) Generate a proto-theory = an initial explanation of the phenomenon
3) Formalize the proto-theory: turn the theory into a mathematical model
4) Evaluate the explanatory adequacy/power
5) Improve the theory: adjust the theory if it fails and generate new predictions
o Analogical abduction = borrowing a model from another field that shows a similar
pattern (e.g. the positive manifold of intelligence resembles correlated growth
between animal populations → borrow mutualism (positively coupled growth) as the
explanation)
, Smaldino: Models are Stupid, and We Need More of Them
Formal models are deliberately oversimplified (stupid), which is why they're useful (as humans are
boundedly rational and language is imprecise)
Bounded rationality = humans have limited resources for modelling a fast-changing world, so
we reason poorly about complex systems
Monty Hall problem = pick 1 of 3 doors; the host reveals a goat behind another and offers a
switch. You should switch (2/3 vs 1/3), because the host can always reveal a goat, so it gives
no information about your original 1/3 door
o Common intuition that the choice is 50/50 is erroneous, in this case there is a 1/3
probability that the initial choice is correct, and a 2/3 chance that one was wrong
This means, that 2/3 times switching one’s choice is the right move
Verbal model = a theory stated in words, where parts of a system and their relationships are usually
not well articulated (e.g. saying an individual has an array of social identities)
Verbal models can appear superior to formal models only by employing strategic ambiguity
= the danger of verbal models: by being vague, they let each reader pick their preferred
interpretation, giving the illusion of understanding (or positing theories that are unfalsifiable)
o One must ignore some details about complexity and organization to make any
headway, hence why verbal models are usually vague in their explanation
o Cubist chicken parable: two friends agree a LEGO build is a chicken, but discover on
closer precision it looked more like a rooster
Formal model = a verbal model instantiated as explicit mathematics/algorithms (e.g. saying an
individual has precisely an array of social identities, as modeled as a computation object, which in
turn might be modeled as simple numerical values for the sake of comparisons between individuals)
By making assumptions explicit, through which conclusions can be clearly implied
o Conclusions will be flawed as the assumptions are ultimately incorrect, or at least
incomplete
o by examining how conclusions differ from reality, one can refine the models, and
thereby refine the theories, to become less wrong
Formal models are often unrealistic, ignore huge swaths or reality, however this is not
necessarily a downside
o Humans can’t function without ignoring most of the facts of the world (selective
attention, cocktail party effect), where the ignorance is fundamentally adaptive
o Formal models systematize our stupidity, and ensure that we are all talking about the
same thing
The difference between formal and verbal models, is that formal model s make it clear which
factors are being considered and which are being excluded
Linear model = a model mostly used in statistics, which is obviously wrong yet useful, as it describes
relationships between variables but says little about the mechanisms generating the data
Assumes that data is generated by random sampling from some distribution, where the
model says little about the processes that actually generate the data, or about the
mechanistic relationships between variables
Wimsatt's 12 functions of false models = a catalogue of useful goals a known-false model can
achieve
1) An oversimplified model may act as a starting point in a series of models of increasing
complexity and realism
2) A known incorrect but otherwise suggestive model may undercut the too ready acceptance
of a preferred hypothesis by suggesting new alternative lines for the explanation of the
phenomena
3) An incorrect model may suggest new predictive tests or new refinements of an established
model, or highlight specific features of it as particularly important