Midterm
• There are two parts for each chapter – Summary and Practice questions (total of 106 questions).
Please read the lecture materials and summary before you study the practice questions. Practice
questions would be similar to the midterm questions. If you have any questions on the materials,
please go back to the course materials to ensure you have reviewed all concepts discussed in the
course first.
• Exam Instructions:
Notes:
1) This is the ONLINE exam. Please access to the blackboard at 8:05 pm March 23 rd, you can find
the midterm under the “Assignments” section.
2) All questions in the midterm will be different from the practice questions. However, if you
understand all the questions in the mock exam, you can do well.
3) Please note that the number of questions is changed to 25.
4) Academic Integrity: No communication with others are allowed while you are taking the exam.
Academic misconduct is a serious offense, if this happens, I will follow the academic misconduct
policy of Hofstra University.
Date: Mar 23rd, Monday, 2020
Duration: 1 hour 55 mins (8:05 pm – 10:00 pm)
Number of Questions: 25 Questions
Grading: All questions worth 4 points each for a total of 100 points. (Midterm is 30% of final grading.)
Range: Chapter 2, Chapter 3, and Chapter 5 (Lecture Note Page 1 – 13)
Chapter 2: Describing the Distribution of a Variable
1
,1. Summary
Chapter 2-1. Introduction
4 steps in data analysis:
1) Recognize a problem
2) Gather data
3) Analyze the data using the tools you learned in chapter 2.
4) Act on this analysis by changing policies, publishing reports, and so on.
Chapter 2-2. Basic Concepts
2-2a. Population and Sample
• Population: includes all data for interest in a research
• Sample: Subset of population. Sample is representative of the population so that observed
characterstics of the sample can be generalized to the population as a whole.
2-2b. Data Sets, Variables, and Observations
Data set: a rectangular array of data, with variables in columns and observat
ions in rows.
Columns: Variables (characteristics
of members of a population, such as
height, gender, or salary)
Rows: Observations
( a list of all variable values for a single member of a population)
2
, 2-2c. Data Types
1) Categorical data type: can be expressed text not numbers
Ordinal: if there is a natural ordering of its possible values
Nominal: if there is no natural ordering
* dummy variable: a 0-1 coded variable for a specific category (ex. Gender)
* binned variable: a numerical variable that has been categorized into discrete
categories
2) Numerical data type: can be expressed numbers not text
Discrete: when the data is countable, has clear spaces between values
Continuous: when the data is measurable, falls on a continuous sequence
Cross-sectional: data on a cross-section of a population at a distinct point in time
Time Series data: data collected over time
3) Date (Excel stores dates as numbers, but dates are treated differently from typical numbers.): we
didn’t consider this third ‘date’ data type much during the class, because there is no data analysis on
date variable.
Chapter 2-3, Chapter 2-4 and Chapter 2-5. Summarizing Categorical, Numerical Variables and Time Series Data
Below is the summary for chapter 2-3, 2-4 and 2-5.
Excel functions (marked as bold) or calculation formula for each data analysis concept.
3
, 1) Count: =COUNT(RANGE), COUNTIF(RANGE, CRITERIA) or COUNTIFS(RANGE1, CRITERIA1, RANGE2, CRITERIA2,…)
Criteria in COUNTIF or COUNTIFS function: the criteria that controls which cells should be counted.
Examples: =COUNTIF(D5:D12, “>100”) // Count cells over 100
=COUNTIF(D5:D12, 100) // Count cells which is 100 (if you are looking for a specific number, there
is no need to add “”).
= COUNTIF(D5:D12, “CA”) // Count cells which is ‘CA’
2) Percentage: Count/Total number
3) Mean: =AVERAGE(RANGE)
4) Median: =MEDIAN(RANGE)
5) Mode: =MODE(RANGE)
6) Min: =MIN(RANGE)
7) Max: =MAX(RANGE)
8) Quartile: =QUARTILE(RANGE, NUMBER)
Number – can be 1,2,3,4
There are 4 quartiles. Q1(25%), Q2(50%), Q3(75%), and Q4(100%). If you want to calculate Q1 of a data, the
number in the excel formula is 1, so the formula would be =QUARTILE(RANGE,1), and so on.
9) Percentile: =PERCENTILE(RANGE, k)
k – can be a number between 0 and 1 or percentile. (ex. K can be 0.3 or 30% if you want to calculate 30% value
of the data)
10) Range: Maximum – Minimum
11) Interquartile Range (IQR): Q3 – Q1
12) Variance: =VAR.P(RANGE) or =VAR.S(RANGE)
VAR.P can be used if the data is from the population, and VAR.S can be used if the data is from the sample.
13) Standard Deviation: =STDEV.P(RANGE) or =STDEV.S(RANGE)
STDEV.P can be used if the data is from the population, and STDEV.S can be used if the data is from the sample.
14) Skewness: =SKEW(RANGE) or =SKEW.P(RANGE)
SKEW can be used for both cases, SKEW.P can be used when the data is from the population.
As I mentioned during the lecture, if the skewness is higher or less than ±3, it is not normal distribution. If the
skewness is greater or less than ±1 (someone says it should be ±2), we can say the distribution is highly skewed.
4