Written by students who passed Immediately available after payment Read online or as PDF Wrong document? Swap it for free 4.6 TrustPilot
logo-home
Class notes

Programming for Business Analytics Notes

Rating
-
Sold
-
Pages
69
Uploaded on
21-03-2022
Written in
2021/2022

Programming for Business Analytics Notes taken at National University of Singapore

Institution
Course

Content preview

Week 1: Summarising, plotting and gathering data

Variable types

- Categorical / qualitative variable: when a variable consists of categories (norms or labels) and
answers questions about how cases fall into those categories
- Numerical / quantitative variable: when a variable has measured numerical values with units and
the variable tells us about the quantity of what is measured

Categorical variables

- Arise from descriptive responses to questions like ‘what kind of advertising do you use?’
- May only have 2 possible values (like yes or no)
- May be a number like a zip code
- Is also called nominal, since the label for each category are merely distinguishing norms
- A dummy variable is a 0-1 coded variable for a specific category. It is coded 1 for all observations
in that category and 0 for all observations that are not in that category
- Categorising a numerical variable as categorical is called binning (putting data into discrete bins)

Numerical variables

- A numerical variable has units
- Is discrete if it results from a count, such as the number of children.
- A continuous variable is the result of an essentially continuous measurement, such a weight or
height
- Cross-sectional data are data on a cross section of a population at a distinct point in time. Time
series data are data collected over time

A categorical variable has 3 aspects

- The number of categories
- the categories names
- The number of observations, or counts, in each category

Displaying quantitative variables

Histograms

- Similar to a bar chart with the bin counts used as the heights of the bars. There is no gap
between bars unless there are actual gaps in the data
- Decide how wide to make the bins
- Determine the count of each bin
- Decide where to place values that land on the endpoint of a bin
- Can choose to create a relative frequency histogram by displaying the % of cases in each bin
instead of the count. This makes the histogram more like a density function

When describing a distribution, attention should be paid to

- The shape (including any outliers)
- The center
- The spread

Shape

,- Mode: (1) peaks of humps seen in a histogram are called the modes of the distribution (2) a
distribution whose histogram has one main peak is called unimodal, two peaks- bimodal, 3 or
more- multimodal
- Symmetry: (1) a distribution is symmetric if the halves on either side of the center look, at least
approximately, like mirror images (2) the thinner ends of a distribution are called the tails. If one
tail stretches out further than the other, the distribution is called skewed to the side of the
longer tail.
- Outliers: (1) values that stand off away from the body of the distribution (2) can affect every
statistical method (3) can be the most informative part of the data (4) may be an error in the
data (5) should be discussed in any conclusions drawn about the data
- Characterising the shape of a distribution is often a judgement call
- An honest desire to understand what is happening in the data should guide decisions

Center

- To find the mean, add up all the values and divide by the number of values
- If a distribution is skewed, contain gaps, or contain outliers, then it might be better to use the
median- the value that splits the histogram into 2 equal areas
- The median is said to be resistant because it isn’t affected by unusual observations or by the
shape of the distribution  therefore the median is a better choice for skewed data
- If a distribution is roughly symmetric, the mean and median is expected to be close

Spread

- The more the data vary, the less a measure of center can tell us
- Range: defined as the difference between the extremes = max – min
- Interquartile rage (IQR) = Q3 – Q1
- Variance = average of the squared deviation of the values of the variable of y from the mean




- Taking the square root of the variances gives us the SD




Summary

- If the shape is skewed, the median and IQR should be reported
- If the shape is unimodal and symmetric, the mean and SD and possibly the median and IQR
should be reported
- If there are multiple modes, try to determine if the data can be split into separate groups
- If there are unusual observations, point them out and report the mean and SD with and without
the values
- Always pair the median with the IQR and the mean with the SD

5 number summary and boxplots

- The five number summary reports the median, the quartiles and the extremes

,- Once we have the five number summary, we can display the information in a boxplot

Boxplot

- The central box shows the middle half of the data, between the quartiles- the height/length of
the box equals to the IQR
- If the median is roughly centred between the quartiles, then the middle half of the data is
roughly symmetric. If it is not centred, the distribution is skewed
- The whiskers show skewness as well if they are not roughly the same length
- The outliers are displayed individually to keep them out of the way in judging skewness and to
display them for special attention

How to draw a boxplot

- Locate the median and quartiles on an axis and draw a 3 short lines.
- Connect the quartile lines to form a box
- Erect ‘fences’ around the main part of the data, placing the upper fence 1.5 IQRs above the
upper quartile and lower fence 1.5 IQRs below the lower quartile
- Draw whiskers from each end of the box up and down to the most extreme data values found
within the fences
- Add any outliers by displaying data values that lie beyond the fences with special symbols

7 number and boxplot summary

- The box is same as before
- The whiskers end at 5th and 95th percentile
- Beyond those, only the min and max are plotted as points
- The average is included with an x

Population and parameters

- Models use mathematics to represent reality
- A population includes all of the entities of interest, whether they be people, households,
machines or whatever
- Key numbers in the population are called parameters
- A sample is a subset of the population, often randomly chosen

, - Any summary formed from the data is a subset
- Sometimes, especially we match statistics with the parameter they estimate, we use the term
sample statistic. Some statistics are used as estimates for the population parameters
- A sample that estimates the corresponding parameters accurately is said to be representatives

A data set is usually a rectangular array of data, with variables in columns and observations in row

A variable is a characteristic of members of a population, such as height, gender or salary

An observation is a list of all variable values for a single member of a population

Common sampling designs

Simple random sample (SRS)

- A sample drawn, so that every possible sample has an equal chance of being selected
- With this method, each combination of individuals has equal chance of being selected as well
- A sampling frame is a list of individuals from which the sample can be drawn
- Once we have the sampling frame, we can assign the sequential number to each individual in the
sampling frame and draw random numbers to identify those to be sampled
- Sample-to-sample differences in the values for the variables we measured are called sampling
variability

Stratified sampling (basically sticking to the % of all the sample size)

- When we slice the population into homogeneous groups, called strata, use simple random
sampling within each stratum, and combine the results at the end
- Reduced sampling variability (since a rare stratum is never left out) is the most important benefit
of stratifying

Cluster and multistage sampling

- Isolate from the population parts or clusters that represent the population, and performing a
census (complete counts) within one of a few clusters at random, is called cluster sampling

Written for

Institution
Course

Document information

Uploaded on
March 21, 2022
Number of pages
69
Written in
2021/2022
Type
Class notes
Professor(s)
-
Contains
All classes

Subjects

$5.49
Get access to the full document:

Wrong document? Swap it for free Within 14 days of purchase and before downloading, you can choose a different document. You can simply spend the amount again.
Written by students who passed
Immediately available after payment
Read online or as PDF

Get to know the seller
Seller avatar
digitalnotes

Get to know the seller

Seller avatar
digitalnotes National University of Singapore
Follow You need to be logged in order to follow users or courses
Sold
-
Member since
4 year
Number of followers
0
Documents
9
Last sold
-
The Digital Notes

We review modules, share module materials and provide assistance to assignments & projects to students from all schools and education institutions.

0.0

0 reviews

5
0
4
0
3
0
2
0
1
0

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Working on your references?

Create accurate citations in APA, MLA and Harvard with our free citation generator.

Working on your references?

Frequently asked questions