Class notes

Programming for Business Analytics Notes

Rating

Sold

Pages

Uploaded on

21-03-2022

Written in

2021/2022

Programming for Business Analytics Notes taken at National University of Singapore

Institution

Course

Content preview

Week 1: Summarising, plotting and gathering data

Variable types

- Categorical / qualitative variable: when a variable consists of categories (norms or labels) and
answers questions about how cases fall into those categories
- Numerical / quantitative variable: when a variable has measured numerical values with units and
the variable tells us about the quantity of what is measured

Categorical variables

- Arise from descriptive responses to questions like ‘what kind of advertising do you use?’
- May only have 2 possible values (like yes or no)
- May be a number like a zip code
- Is also called nominal, since the label for each category are merely distinguishing norms
- A dummy variable is a 0-1 coded variable for a specific category. It is coded 1 for all observations
in that category and 0 for all observations that are not in that category
- Categorising a numerical variable as categorical is called binning (putting data into discrete bins)

Numerical variables

- A numerical variable has units
- Is discrete if it results from a count, such as the number of children.
- A continuous variable is the result of an essentially continuous measurement, such a weight or
height
- Cross-sectional data are data on a cross section of a population at a distinct point in time. Time
series data are data collected over time

A categorical variable has 3 aspects

- The number of categories
- the categories names
- The number of observations, or counts, in each category

Displaying quantitative variables

Histograms

- Similar to a bar chart with the bin counts used as the heights of the bars. There is no gap
between bars unless there are actual gaps in the data
- Decide how wide to make the bins
- Determine the count of each bin
- Decide where to place values that land on the endpoint of a bin
- Can choose to create a relative frequency histogram by displaying the % of cases in each bin
instead of the count. This makes the histogram more like a density function

When describing a distribution, attention should be paid to

- The shape (including any outliers)
- The center
- The spread

Shape

,- Mode: (1) peaks of humps seen in a histogram are called the modes of the distribution (2) a
distribution whose histogram has one main peak is called unimodal, two peaks- bimodal, 3 or
more- multimodal
- Symmetry: (1) a distribution is symmetric if the halves on either side of the center look, at least
approximately, like mirror images (2) the thinner ends of a distribution are called the tails. If one
tail stretches out further than the other, the distribution is called skewed to the side of the
longer tail.
- Outliers: (1) values that stand off away from the body of the distribution (2) can affect every
statistical method (3) can be the most informative part of the data (4) may be an error in the
data (5) should be discussed in any conclusions drawn about the data
- Characterising the shape of a distribution is often a judgement call
- An honest desire to understand what is happening in the data should guide decisions

Center

- To find the mean, add up all the values and divide by the number of values
- If a distribution is skewed, contain gaps, or contain outliers, then it might be better to use the
median- the value that splits the histogram into 2 equal areas
- The median is said to be resistant because it isn’t affected by unusual observations or by the
shape of the distribution  therefore the median is a better choice for skewed data
- If a distribution is roughly symmetric, the mean and median is expected to be close

Spread

- The more the data vary, the less a measure of center can tell us
- Range: defined as the difference between the extremes = max – min
- Interquartile rage (IQR) = Q3 – Q1
- Variance = average of the squared deviation of the values of the variable of y from the mean

- Taking the square root of the variances gives us the SD

Summary

- If the shape is skewed, the median and IQR should be reported
- If the shape is unimodal and symmetric, the mean and SD and possibly the median and IQR
should be reported
- If there are multiple modes, try to determine if the data can be split into separate groups
- If there are unusual observations, point them out and report the mean and SD with and without
the values
- Always pair the median with the IQR and the mean with the SD

5 number summary and boxplots

- The five number summary reports the median, the quartiles and the extremes

,- Once we have the five number summary, we can display the information in a boxplot

Boxplot

- The central box shows the middle half of the data, between the quartiles- the height/length of
the box equals to the IQR
- If the median is roughly centred between the quartiles, then the middle half of the data is
roughly symmetric. If it is not centred, the distribution is skewed
- The whiskers show skewness as well if they are not roughly the same length
- The outliers are displayed individually to keep them out of the way in judging skewness and to
display them for special attention

How to draw a boxplot

- Locate the median and quartiles on an axis and draw a 3 short lines.
- Connect the quartile lines to form a box
- Erect ‘fences’ around the main part of the data, placing the upper fence 1.5 IQRs above the
upper quartile and lower fence 1.5 IQRs below the lower quartile
- Draw whiskers from each end of the box up and down to the most extreme data values found
within the fences
- Add any outliers by displaying data values that lie beyond the fences with special symbols

7 number and boxplot summary

- The box is same as before
- The whiskers end at 5th and 95th percentile
- Beyond those, only the min and max are plotted as points
- The average is included with an x

Population and parameters

- Models use mathematics to represent reality
- A population includes all of the entities of interest, whether they be people, households,
machines or whatever
- Key numbers in the population are called parameters
- A sample is a subset of the population, often randomly chosen

, - Any summary formed from the data is a subset
- Sometimes, especially we match statistics with the parameter they estimate, we use the term
sample statistic. Some statistics are used as estimates for the population parameters
- A sample that estimates the corresponding parameters accurately is said to be representatives

A data set is usually a rectangular array of data, with variables in columns and observations in row

A variable is a characteristic of members of a population, such as height, gender or salary

An observation is a list of all variable values for a single member of a population

Common sampling designs

Simple random sample (SRS)

- A sample drawn, so that every possible sample has an equal chance of being selected
- With this method, each combination of individuals has equal chance of being selected as well
- A sampling frame is a list of individuals from which the sample can be drawn
- Once we have the sampling frame, we can assign the sequential number to each individual in the
sampling frame and draw random numbers to identify those to be sampled
- Sample-to-sample differences in the values for the variables we measured are called sampling
variability

Stratified sampling (basically sticking to the % of all the sample size)

- When we slice the population into homogeneous groups, called strata, use simple random
sampling within each stratum, and combine the results at the end
- Reduced sampling variability (since a rare stratum is never left out) is the most important benefit
of stratifying

Cluster and multistage sampling

- Isolate from the population parts or clusters that represent the population, and performing a
census (complete counts) within one of a few clusters at random, is called cluster sampling

Report Copyright Violation

Written for

Institution: National University Of Singapore
Course: DAO2702 Programming for Business Analytics

All documents for this subject (1)

Document information

Uploaded on: March 21, 2022
Number of pages: 69
Written in: 2021/2022
Type: Class notes
Professor(s): -
Contains: All classes

Subjects

dao2702 dsc2008
nus
business
programming for business analytics

$5.49

Get access to the full document:

Written by students who passed

Immediately available after payment

Read online or as PDF

Get to know the seller

digitalnotes

Get to know the seller

digitalnotes National University of Singapore

View profile

Sold

Member since

4 year

Number of followers

Documents

Last sold

The Digital Notes

We review modules, share module materials and provide assistance to assignments & projects to students from all schools and education institutions.

0.0

0 reviews

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller digitalnotes. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $5.49. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 49593 documents were sold in the last 30 days Founded in 2010, the go-to place to buy study notes for 16 years now

Programming for Business Analytics Notes

Content preview

Written for

Document information

Subjects

Get to know the seller

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Didn't get what you expected? Choose another document

Pay as you like, start learning right away

Working on your references?

Frequently asked questions

What do I get when I buy this document?

Satisfaction guarantee: how does it work?

Who am I buying these notes from?

Will I be stuck with a subscription?

Can Stuvia be trusted?