Khan Academy Ap CompSci Principles Data
analysis
Part 1: 1. All three algorithms contain biases that cause them to
In 2016, researchers studied the misidentify an AA-aligned tweet as non-English more often
differences between two than they misidentify the language of a White- aligned
dialects of English used in tweet.
tweets on Twitter. 2. The analytics team of a reviews website uses the
They first categorized over language identification algorithm to display a pie chart of
eight million tweets as either the number of reviews written per language in the last year.
African-American-aligned (AA-
aligned) or White-aligned.
[How?] The researchers
then tried out three different
language identification
algorithms to see whether
each tweet would be
categorized as the English
language, a different
language, or an unknown
language.
The following table shows the
proportion of tweets from
each dialect that were
classified as non-English by
each algorithm:
AA-alignedWhite-
alignedAlgorithm
113.2%7.6%Algorithm
28.4%5.9%Algorithm
324.4%17.6%
Which of the following
statements is supported by
that data?
,StackOverflow is a popular The user ID with the most number of unanswered questions
question & answers site. Each
time a user asks a new question,
they insert a row in a database
table.
Each row contains:
The user ID
The user display name
The timestamp of the
question The text of
the question
The spam score of the
question (0-5) Here are a
few rows from the table:
user_iddisplay_nametimestampque
stionspa
m_score62038TheAskerator11/27
/2012 06:15:28How do I
geocode a lat/lng?
020394NewCoder12303/12/201
5
10:55:10Where can I host my
website for free?
136917QuestionErrthing05/04/2
014 11:34:25Wanna download
this free file?3 The team
wants to display question
statistics on an internal
dashboard.
Which statistic can not be
calculated from the table of
questions?
A team of scientists and Recording of whale sound
engineers is putting together a
research project to study
whale sounds. In order to
develop the infrastructure for
the project, they need to first
determine how much data
storage space their
observational data will require.
This is an example of a single
observation:
SoundLocationDate/timeSpecie
, s3 minute long audio
file63.776871, -171.742193May 27,
2019, 2:23:13 PMBeluga
The team hopes to capture
thousands of whale sounds
from all the world's oceans.
Which piece of data will increase
their data storage needs the
most?
analysis
Part 1: 1. All three algorithms contain biases that cause them to
In 2016, researchers studied the misidentify an AA-aligned tweet as non-English more often
differences between two than they misidentify the language of a White- aligned
dialects of English used in tweet.
tweets on Twitter. 2. The analytics team of a reviews website uses the
They first categorized over language identification algorithm to display a pie chart of
eight million tweets as either the number of reviews written per language in the last year.
African-American-aligned (AA-
aligned) or White-aligned.
[How?] The researchers
then tried out three different
language identification
algorithms to see whether
each tweet would be
categorized as the English
language, a different
language, or an unknown
language.
The following table shows the
proportion of tweets from
each dialect that were
classified as non-English by
each algorithm:
AA-alignedWhite-
alignedAlgorithm
113.2%7.6%Algorithm
28.4%5.9%Algorithm
324.4%17.6%
Which of the following
statements is supported by
that data?
,StackOverflow is a popular The user ID with the most number of unanswered questions
question & answers site. Each
time a user asks a new question,
they insert a row in a database
table.
Each row contains:
The user ID
The user display name
The timestamp of the
question The text of
the question
The spam score of the
question (0-5) Here are a
few rows from the table:
user_iddisplay_nametimestampque
stionspa
m_score62038TheAskerator11/27
/2012 06:15:28How do I
geocode a lat/lng?
020394NewCoder12303/12/201
5
10:55:10Where can I host my
website for free?
136917QuestionErrthing05/04/2
014 11:34:25Wanna download
this free file?3 The team
wants to display question
statistics on an internal
dashboard.
Which statistic can not be
calculated from the table of
questions?
A team of scientists and Recording of whale sound
engineers is putting together a
research project to study
whale sounds. In order to
develop the infrastructure for
the project, they need to first
determine how much data
storage space their
observational data will require.
This is an example of a single
observation:
SoundLocationDate/timeSpecie
, s3 minute long audio
file63.776871, -171.742193May 27,
2019, 2:23:13 PMBeluga
The team hopes to capture
thousands of whale sounds
from all the world's oceans.
Which piece of data will increase
their data storage needs the
most?