2024-500 Period 5
Lecture 2
▪ What is statistics?
It’s the science of collecting, analysing, interpreting, and presenting data. It helps us make
sense of the world.
▪ What is a causal question?
- It implies that one thing influences or determines another. (e.g., does studying
European Studies affect graduate income?).
• Why is it difficult?
- It doesn’t always mean one causes the other, the correlation could result from:
➢ Pure chance.
➢ Alternative explanations
➢ Simultaneous trends
➢ True causal links, but further analysis is needed
- Example: the correlation between ice cream sales and violent crimes – both
increase in summer, but one doesn’t cause the other.
▪ Research questions:
- Should be answerable, falsifiable, interesting, and feasible.
• Quantitative: tests hypotheses, finds cause effect relationships, surveys
large populations.
• Qualitative: explores processes, builds theories, explains complex
behaviour.
▪ Operationalization and variables
• Operationalization: translating abstract concepts into measurable variables.
Met opmerkingen [LV1]: Simple Random Sampling:
- Easy concepts to measure: socio-economic status, education, economic Everyone is chosen completely at random, like drawing
development. names from a hat. Every individual has the same chance
of being picked.
- Harder concepts: happiness, health, democracy.
Stratified Sampling:
▪ Statistical models: Statistical models are simplified versions of reality. They help us The population is divided into groups (called strata) based
understand and explain the world by using data. With these models, we can make guesses on a certain characteristic (e.g. gender, age), and then
people are randomly chosen from each group. This
about things we can’t directly see—like the hidden patterns that create the data we observe.
ensures all groups are represented.
Cluster Sampling:
▪ Sampling: we cant study everyone, so we look at a smaller group. The population is divided into clusters (e.g. schools,
cities). Then, some clusters are randomly selected, and
▪ Types of sampling: all individuals in those clusters are surveyed.
• Random: simple random sampling, stratified sampling, cluster sampling.
Met opmerkingen [LV2]: Convenience Sampling:
• Non random sampling: convenience sampling, quota sampling, snowball sampling. You use whoever is easiest to reach—for example, asking
your friends or people walking by.
Quota Sampling:
Week 2, PWP slides: You pick a certain number of people from specific groups,
• The research process & Variables but not randomly. For example, choosing 10 men and 10
women, but based on whoever is available.
- Research starts with a question, e.g., “is there discrimination in the job market?” Snowball Sampling:
- Each row in a dataset = a unit (like a diff person/country). You start with a few people and ask them to refer others.
- Each column = a variable (e.g., age, income, political interest). It’s often used for hard-to-reach groups (like
undocumented workers or specific communities
- Types of variables:
o Categorical variable (nominal/ordinal): eye color, political opinion. Met opmerkingen [LV3]: A characteristic you’re
collecting data on
1
, o Numeric (continuous/discrete): age, time spent studying.
- Attributes: a variable is a general category, and the attributes are the answers or options
people can choose or give.
➢ Example:
- Variable: citizenship
- Attributes: citizen/non-citizen.
• Getting to know your data
1. Checking for missing values:
- Sometimes in your dataset, some answers are missing (people didn’t respond, data wasn’t
recorded) = missing values, and they can mess up your analysis if you don’t notice them.
➢ You can check for missing values using the command mdesc in Stata. It tells you how
many missing values are there for each variable.
2. Use summary statistics for numeric variables:
- If a variable is numeric, you can summarize it using statistics like:
o Mean (average)
o Median (middle value)
o Min/max (smallest/largest value)
➢ In state, the command is: **summarize varname** > summarize age
(this gives you those basis stats for one variable)
3. Use frequency tables for categorical variables:
- If a variable is categorical (gender, eye color), you can’t calculate a mean – but you can
see how often each value appears in your data.
➢ In Stata, the command is: **tabulate varname**
(this gives you a table showing how many people picked each category.
• Using Stata
o Main windows in Stata
- Output window: this is where you see the results of what you’ve done. (a feedback
screen).
- Command line: for example, summarize age.
- Do file editor: this is a separate place where you write your code before running it. It’s
like a notebook where you can organize your work, write comments, and save everything.
- Variables window: this shows you all the variables (columns) in your dataset. You can
browse them and see their names and labels.
o Why use do files?
- You can write code first, then run it when ready (instead of doing everything one-by-one).
- You can add comments to explain what you’re doing:
▪ // this is a comment > summarize age // This shows basic stats for the age variable
▪ * this is also a comment
▪ /* this is a longer comment that can go across multiple lines */
• Common commands
o Generate newvar = …
➢ This creates a new variable in your dataset.
For example
2
, - Generate young = 1 if age < 25/
➢ Creates a new variable called young, which equals 1 for everyone younger than 25.
o Replace varname = …
➢ This changes the values of an existing variable.
For example
- Replace age = 18 if age ==.
➢ Replacing missing ages with 18.
o Drop varname
➢ This deletes a variable from the dataset.
for example
- Drop eye_color
➢ Removes the variable eye_color from your data.
o Recode varname
➢ This changes values into new categories.
For example
- Recode age (0/18=1 “young”) (19/64=2 “adult”) (65/max=3 “senior”).
➢ Turns age into three groups: young, adult, and senior.
o Label variable varname “label”
➢ This adds a description to a variable.
For example
- Label variable polintr “political interest”.
➢ Now, when you look at polintr, it will show the label political interest.
o Tabulate varname
➢ This creates a frequency table for a variable.
For example:
- Tabulate gender
➢ Shows how many people are male, female, etc.
o Summarize varname
➢ This gives you summary statistics for a numeric variable.
For example:
- Summarize income
➢ Shows the mean, min, max, etc. for income.
• Using conditions in commands (so that your analysis or changes only apply to certain
subgroups in your data – not everyone).
o What are conditions?
They let you tell state:
Do this command, but only for people/rows that meet these specific criteria.
➢ “Only look at people under 40.”
o Logical symbols used in conditions:
- == equal to
3
, ➢ Gndr == 2 (means “gender is 2” (in this dataset, 2= female)/
- != not equal to
➢ Gndr ! = 2 (means “not female).
- > greater than
- < less than
- & = and (both things must be true)
- | = or (at least one must be true)
o Examples explained:
1. Generate newvar = 1 if gndr == 2 & polintr < 2
➢ Creates a new variable that is 1 for women who are very politically interested
(gndr == 2 = female, polintr < 2 = “very interested”)
- Other people (who don’t meet both conditions) get a missing value for this new variable.
2. Tabulate polintr if gndr == 2
➢ Shows a frequency table of political interest only for women.
3. Summarize inwtm if agea < 40 & gndr ! = 2
➢ Gives summary statistics (mean, min, max, etc.) for interview time
- But only for mean under 40.
- (agea < 40 = younger than 40, gndr != 2 = not female)
• Best practices
o Always write code in Do files: Instead of typing commands directly into Stata’s
command line (which disappears after you close it), you should write your code in a
Do file—this is a separate file where you can write, save, and retun all your
commands.
- You can save your code and reuse it later.
- You can edit your commands easily.
o Save do files often!
o Use comments to explain what you’re doing: In the Do file, write comments next to
or above your commands to remind yourself what the code does.
o Save your session output using log files: A log file keeps track of everything Stata
prints in the output window—your results, tables, errors, etc.
- To start recording, type: log using “filename”
- To stop recording, type: log close
Given document
o Log files:
- Creating log files: you can create a log file to record your session output (results) using:
Log using my_first_log.log
- Closing log files:
Log close
- Appending or replacing log files: allows you to add new output to an existing log file
log using my_first_log.log, append
4