Geschreven door studenten die geslaagd zijn Direct beschikbaar na je betaling Online lezen of als PDF Verkeerd document? Gratis ruilen 4,6 TrustPilot
logo-home
Tentamen (uitwerkingen)

SYE-6501 Week 1 Introduction To Analytics Modeling - GTX

Beoordeling
-
Verkocht
-
Pagina's
40
Cijfer
A+
Geüpload op
07-03-2022
Geschreven in
2022/2023

SYE-6501 Week 1 Introduction To Analytics Modeling - GTX

Instelling
Vak

Voorbeeld van de inhoud

# Week 1 Introduction To Analytics Modeling - GTX
ISYE 6501

- Introduction to Analytics Modeling

answer important types of questions:
what happened? = descriptive
what is going to happen? = predictive
what actions are best? = prescriptive
how do we create value with data?

when can analytics answer these questions?

Modeling: taking a real life situation and expressing it in math
analyze in math and turn it into a solution

best ways to learn: ask questions, discuss answers

Course Structure:
- knowledge building
- experience building based on knowledge built in part 1

Knowledge Building:
Models - learn all the models
Cross Cutting - data prep, output quality, missing data
will include mathematical intuition but keep it agile
all developed with situations and examples
basic mathematical detail

Experience Based:
case studies
practice using models
practice using models using the commonly used analytics softwares
make sure you learn key basic concepts
link material with real analytic questions
develop learning beyond the videos
learn to use software without being told exactly what to do

Summary: knowledge building and then experience

What is Modeling?

- real life situation described in math
- analyze the math
- turn math analysis back to real-life solution

the mathematical description of the problem is the model
all the detail involved in modeling is 'the model'

Introduction to Classification
classification = putting things into categories

,put into groups of 'yes' and 'no'

many analytic questions need to bin answers into a group based on the past
examples

we can use classification models to sort these items into these groups

we can also have multiple classification groups - not just 'yes' or 'no'

we need data to get these answers!
we can infer and model from the data to classify a new point into the correct
group!

credit score and income example: scatterplot
if repaid - green
if defaulted - red
these points could have an entire set of features associated with it
we can draw a decision line between the points and sort them based on our
decision line

there are many lines! how do we know the 'right' line
they could all separate the groups evenly!

Choosing a Classifier

what are the trade-offs in building classification models?

we want to put things into categories!

should we give someone a loan?

we draw a line to sort groups into classification groups...
what is the right line to draw?

which one should we chose - the line that it further from making mistakes!
we might not have all the data - we want find the line that is not close to
make misclassifications

what if it is impossible to avoid making classification mistakes...
i.e. no line to separate between points?

we need a 'soft' classifier rather than a 'hard' classifier
we need as good as separation as possible - minimize the number of
misclassified points

we want to trade off between actual mistakes and 'near' mistakes
not all mistakes are equal!

the best separator - the most costly one type of decision is the further we
shift our line away from this group!

,we can set a high classifier in order to limit cost of classification errors
we can use the same idea for 'soft' classification also!

we can tell from our decision line which variable is important to the
classifier based on the scatterplot between the two variables

horizontal line = the classifier only takes the vertical access into account
vertical line = the classifier only takes the horizontal axis into account

Data Definitions

what data comes up in analytics?
what terminology do we use in different types of data?

important to understand the analytic vernacular

Data Tables:
rows are data points
columns are variables - information about each data point - features, predictors
response - the outcome column or the data point we want to predict - this is a column

Data Types:
Structured - described and stored in a structured way
Unstructured - cannot be stored easily - ex. written text

Structured Data:
quantitative - numbers with meaning
categorical - numbers without meaning - categories of data - numbers denote groupings
binary - takes on 1 or 0, takes on two values only!
unrelated data - no relationships between data points
related data - data linked together - time series data - recorded at the same intervals

Support Vector Machines

basic mathematical model for classification models
we want to put things into categories
should we give loans to people based on who they are?

think of the scatterplot - green is repaid, red is default

different lines can be better - be far away from mistakes and further away from
more costly mistakes

Support Vector Machines
n - data points
m = number of attributes
xij = ith attribute of the jth data point
x1j = credit score of person j (i is the attribute associated with the jth row)
x2j = income of person j
yj = the response for data point j
yj = is 1 if data point j is green (repaid)

, yj = is 0 if data point j is red (default)

a line through our classification space (scatterplot) would be defined as:
this is a set of coefficients!

a1*x1 + a2*x2 + ... + am*xm + a0 = 0
where a1 through am are the number of attributes or features!

we can also write this as:
Σ(ai*xi + a0) = 0


we can draw to parallel lines through our classification space such that:
parallel lines have the same coefficients but different intercepts!

we want to draw two parallel lines that separate our red and green points...
such that a0 is the line exactly in the middle of the two groups (splitting the two groups)
this will be our classifier - the line with intercept evenly splitting the two groups we want to classify

we want to find values of a0, a1...am that classify the points correctly and have the maximum
MARGIN BETWEEN THE TWO POINTS we need the maximum gap between the parallel lines

we are drawing to parallel lines as close as possible to our group of points
this means we have a line of a0,a1...am for the green
and we have a line of a0, a1...am for the red
we will use the midpoint of these two lines to be our classifier
the support vector machine aims to find the lines with the largest distance from the
classifier (midpoint) to the margin (individual lines separating green points)

Distance between solid lines:
= 2 / √(Σ(ai)^2)
this is 2 divided by the square root of the sum of a coefficients squared
this converts to
Σ(ai)^2 (sum of coefficients squared for all coefficients)
if we can minimize this sum - we can maximize the margin between the two groups of data!
this is our objective function - we aim to build lines that minimize this distance and
maximize the margin!

Hard separation problem: minimize the sum to maximize the margin
minimize over all a's the sum of the squares of the a's
subject to the sum has to be greater than equal to 1 for all data points
we minimize the sum of squares for all a's but only if we can accurately classify the groups!
our function is bounded by the original separation lines
we want to find two separation lines that accurately classify all points and
have the largest distance between the two lines!!


what if there is no way to separate between the two groups?
we need a 'soft' classifier!
this means we account for errors in classification while trade-off the most
costly errors

Geschreven voor

Vak

Documentinformatie

Geüpload op
7 maart 2022
Aantal pagina's
40
Geschreven in
2022/2023
Type
Tentamen (uitwerkingen)
Bevat
Vragen en antwoorden

Onderwerpen

$13.99
Krijg toegang tot het volledige document:

Verkeerd document? Gratis ruilen Binnen 14 dagen na aankoop en voor het downloaden kun je een ander document kiezen. Je kunt het bedrag gewoon opnieuw besteden.
Geschreven door studenten die geslaagd zijn
Direct beschikbaar na je betaling
Online lezen of als PDF

Maak kennis met de verkoper

Seller avatar
De reputatie van een verkoper is gebaseerd op het aantal documenten dat iemand tegen betaling verkocht heeft en de beoordelingen die voor die items ontvangen zijn. Er zijn drie niveau’s te onderscheiden: brons, zilver en goud. Hoe beter de reputatie, hoe meer de kwaliteit van zijn of haar werk te vertrouwen is.
DUKETEST Miami Dade College
Volgen Je moet ingelogd zijn om studenten of vakken te kunnen volgen
Verkocht
412
Lid sinds
5 jaar
Aantal volgers
390
Documenten
0
Laatst verkocht
1 jaar geleden
PATOCUTIE ACADEMICS

Get everything you need,NO STRESS

4.5

153 beoordelingen

5
124
4
6
3
11
2
4
1
8

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via iDeal of creditcard en download je PDF-document meteen.

Student with book image

“Gekocht, gedownload en geslaagd. Zo makkelijk kan het dus zijn.”

Alisha Student

Bezig met je bronvermelding?

Maak nauwkeurige citaten in APA, MLA en Harvard met onze gratis bronnengenerator.

Bezig met je bronvermelding?

Veelgestelde vragen