https://www.stuvia.com/uaser/openstaxstudyhub https://www.stuvia.com/user/openstaxstudyhub
,https://www.stuvia.com/user/openstaxstudyhub https://www.stuvia.com/user/openstaxstudyhub https://www.stuvia.com/user/openstaxstudyhub
Principles of Data Science
Chapter 1
What Are Data and Data Science?
Chapter Review
[1.1, LO 1.1.1, 1.1.2]
1. Select the incorrect step and goal pair of the data science cycle.
a. Data collection: collect the data so that you have something for analysis.
b. Data preparation: have the collected data stored in a server as is so that you can start
the analysis.
c. Data analysis: analyze the prepared data to retrieve some meaningful insights.
d. Data reporting: present the data in an effective way so that you can highlight the
insights found from the analysis.
Solution: b. Data preparation: have the collected data stored in a server as is so that you can
start the analysis.
Rarely is collected data already in good shape for analysis. Most of the time, collected data
needs to be processed to be suitable for the analysis of interest. An example of preparation can
be dealing with missing data—removing them or filling them.
[1.2, LO 1.2.1]
3. Which of the following best exemplifies the interdisciplinary nature of data science in various
fields?
a. A historian traveling to Italy to study ancient manuscripts to uncover historical insights
about the Roman Empire
b. A mathematician solving complex equations to model physical phenomena
c. A biologist analyzing a large dataset of genetic sequences to gain insights about the
genetic basis of diseases
d. A chemist synthesizing new compounds in a laboratory
Solution: c. A biologist analyzing a large dataset of genetic sequences to gain insights about the
genetic basis of diseases
Traditionally, biologists would conduct lab experiments to answer questions in their field;
however, nowadays data science is being used to analyze large datasets to extract valuable
information that can shed light on complex topics such as the genetic basis of diseases. Option
a) is incorrect as studying primary sources does not inherently involve data science. Option b) is
11/11/24 For more free, peer-reviewed, openly licensed resources visit OpenStax.org. 2
https://www.stuvia.com/uaser/openstaxstudyhub https://www.stuvia.com/user/openstaxstudyhub
,https://www.stuvia.com/user/openstaxstudyhub https://www.stuvia.com/user/openstaxstudyhub https://www.stuvia.com/user/openstaxstudyhub
Principles of Data Science
incorrect as solving equations is not in the domain of data science. Option d) is incorrect as it
describes the traditional work of a chemist as a lab scientist.
Critical Thinking
[1.3, LO 1.3.4]
1. For each dataset, list the attributes.
a. Spotify dataset
b. CancerDoc dataset
Solution a: Following is the list of attributes in the Spotify dataset:
track_name, artist(s)_name, artist_count, released_year, released_month, released_day,
in_spotify_playlists, in_spotify_charts, streams, in_apple_playlists, in_apple_charts,
in_deezer_playlists, in_deezer_charts, in_shazam_charts, bpm, key, mode, danceability_%,
valence_%, energy_%, acousticness_%, instrumentalness_%, liveness_%, speechiness_%
Solution b: The CancerDoc dataset has three attributes; however, none of these attributes have
a clear name. They are: the column with numeric identifiers (the first column), the column with
cancer type (the second column), and the actual text (the third column).
[1.3, LO 1.3.2]
3. For each dataset, identify the type of the dataset—structured vs. unstructured. Explain why.
a. Spotify dataset
b. CancerDoc dataset
Solution a: The Spotify dataset is a structured dataset since each item in the dataset is in a
same form.
Solution b: The CancerDoc dataset is an unstructured dataset since the third column is the main
information while the first and second columns serve as labels of each entry (i.e., used to
distinguish each item in the dataset). The third column is a free-form text, so this dataset is
unstructured.
[1.3, LO 1.3.4]
5. Open the WikiHow dataset (ch1-wikiHow.json) and list the attributes of the dataset.
Solution: The ch1-wikiHow.json file has a list of items in an array (i.e., [ ]). Each array has an
object (i.e., { }) in which there are nine attributes total. The attributes are: “Time”, “URL”,
“MainTask”, “MainTaskSummary”, “Steps”, “Categories”, “Ingredients”, “Requirements”, and
“Tips”.
Note that some attributes have data in the form of an array as well. For example, “Steps” is an
array of which each element is also an object with three fields—“Headline”, “Description”, and
“Links”.
11/11/24 For more free, peer-reviewed, openly licensed resources visit OpenStax.org. 3
https://www.stuvia.com/uaser/openstaxstudyhub https://www.stuvia.com/user/openstaxstudyhub
, https://www.stuvia.com/user/openstaxstudyhub https://www.stuvia.com/user/openstaxstudyhub https://www.stuvia.com/user/openstaxstudyhub
Principles of Data Science
[1.5, LO 1.5.3]
7. Regenerate the scatterplot of the Spotify dataset, but with a custom title and x-/y-axis label.
The title should be “BPM vs. Danceability.” The x-axis label should be titled “bpm” and range
from the minimum to the maximum bpm value. The y-axis label should be titled “danceability”
and range from the minimum to the maximum Danceability value.
a. Python Matplotlib (Hint: DataFrame.min() and DataFrame.max() methods
return min and max values of the DataFrame. You can call these methods upon a specific
column of a DataFrame as well. For example, if a DataFrame is named df and has a
column named “col1”, df[“col1”].min() will return the minimum value of the
“col1” column of df. )
b. A spreadsheet program such as MS Excel or Google Sheets (Hint: Calculate the minimum
and maximum value of each column somewhere else first, then simply use the value
when editing the scatterplot.)
Solution a: The following code draws the same scatterplot with the custom title and axis labels.
import matplotlib.pyplot as plt
plt.scatter(data["bpm"], data["danceability_%"]) # draw the scatterplot
plt.title("BPM vs. Danceability") # set the title
plt.xlabel("BPM") # set the x-axis label
plt.xlim(data["bpm"].min(), data['bpm'].max()) # set the range of the axis
# set the y-axis label and its range of values
plt.ylabel("Danceability (%)")
plt.ylim(data["danceability_%"].min(), data['danceability_%'].max())
plt.show()
Solution b: (This solution is based on MS Excel.) You can edit the chart title by double-clicking
the title text. A cursor will show up, and you can edit the title text. The axis labels can be added
by clicking Chart Design > Add Chart Element > Axis Titles. Primary Vertical and Primary
Horizontal will add a text box for the x- and y-axes, respectively. You can edit the text boxes by
double-clicking them.
To set the range of the values to be related to the minimum and maximum values of the bpm
and danceability column, on Excel you need to calculate those values first. You can do so by
using =MIN() and =MAX() on each column. Note those values somewhere and use them in the
text boxes under Format Axis > Axis Options > Bounds. You can open the Format Axis menu by
either 1) double-clicking the axis elements or 2) right-clicking the axis elements and then
selecting Format Axis….
11/11/24 For more free, peer-reviewed, openly licensed resources visit OpenStax.org. 4
https://www.stuvia.com/uaser/openstaxstudyhub https://www.stuvia.com/user/openstaxstudyhub