Geschreven door studenten die geslaagd zijn Direct beschikbaar na je betaling Online lezen of als PDF Verkeerd document? Gratis ruilen 4,6 TrustPilot
logo-home
Tentamen (uitwerkingen)

University of California, BerkeleyDS 100sp18_hw2_solution.ipynb at master DS-100_sp18 GitHub

Beoordeling
-
Verkocht
-
Pagina's
19
Cijfer
A+
Geüpload op
04-07-2021
Geschreven in
2020/2021

Homework 2: Food Safety Course Policies Here are some important course policies. These are also located at Collaboration Policy Data science is a collaborative activity. While you may talk with others about the homework, we ask that you write your solutions individually. If you do discuss the assignments with others please include their names at the top of your solution. Due Date This assignment is due at 11:59pm Tuesday, February 6th. Instructions for submission are on the website. Homework 2: Food Safety Cleaning and Exploring Data with Pandas img src="scoreC" width=400 In this homework, you will investigate restaurant food safety scores for restaurants in San Francisco. Above is a sample score card for a restaurant. The scores and violation information have been made available by the San Francisco Department of Public Health, and we have made these data available to you via the DS 100 repository. The main goal for this assignment is to understand how restaurants are scored. We will walk through the various steps of exploratory data analysis to do this. To give you a sense of how we think about each discovery we make and what next steps it leads to we will provide comments and insights along the way. As we clean and explore these data, you will gain practice with: Reading simple csv files Working with data at different levels of granularity Identifying the type of data collected, missing values, anomalies, etc. Exploring characteristics and distributions of individual variables Question 0 To start the assignment, run the cell below to set up some imports and the automatic tests that we will need for this assignment: In many of these assignments (and your future adventures as a data scientist) you will use os, zipfile, pandas, numpy, t, and seaborn. 1. Import each of these libraries as their commonly used abbreviations (e.g., pd, np, plt, and sns). 2. Don't forget to use the jupyter notebook "magic" to enable inline matploblib plots ( 3. Add the line () to make your plots look nicer. In [1]: import os import zipfile import pandas as pd import numpy as np import t as plt import seaborn as sns %matplotlib inline () In [2]: import sys assert 'zipfile'in es assert 'pandas'in es and pd assert 'numpy'in es and np assert 'matplotlib'in es and plt assert 'seaborn'in es and sns Downloading the data As you saw in lectures, we can download data from the internet with Python. Using the file from the lectures (see link ( fetch_and_cache to download the data with the following arguments: data_url: the web address to download file: the file in which to save the results data_dir: (default="data") the location to save the data f if t th fil i l d l d d4/18/2018 sp18/hw2_ at master · DS-100/sp18 · GitHub force: if true the file is always re-downloaded This function should return pathlib.Path object representing the file. In [3]: import requests from pathlib import Path def fetch_and_cache(data_url, file, data_dir="data", force=False): """ Download and cache a url and return the file object. data_url: the web address to download file: the file in which to save the results. data_dir: (default="data") the location to save the data force: if true the file is always re-downloaded return: The pathlib.Path object representing the file. """ ### BEGIN SOLUTION data_dir = Path(data_dir) data_(exist_ok = True) file_path = data_dir / Path(file) # If the file already exists and we want to force a download then # delete the file first so that the creation date is correct. if force and file_s(): file_k() if force or not file_s(): print('Downloading...', end=' ') resp = (data_url) with file_('wb') as f: (nt) print('Done!') else: import time last_modified_time = (file_().st_mtime) print("Using cached version last modified (UTC):", last_modified_time) return file_path ### END SOLUTION Now use the previously defined function to download the data from the following URL: SFB ( In [4]: data_url = ' file_name = '' data_dir = '.' dest_path = fetch_and_cache(data_url=data_url, data_dir=data_dir, file=file_name) print('Saved at {}'.format(dest_path)) Loading Food Safety Data To begin our investigation, we need to understand the structure of the data. Recall this involves answering questions such as Is the data in a standard format or encoding? Is the data organized in records? What are the fields in each record? There are 4 files in the data directory. Let's use Python to understand how this data is laid out. Use the zipfile library to list all the files stored in the dest_path directory. Creating a ZipFile object might be a good start (the Python docs ( In [5]: # Fill in the list_files variable with a list of all the names of the files in the zip file my_zip = ... list_names = ... ### BEGIN SOLUTION my_zip = zipfile.ZipFile(dest_path, 'r') list_names = [ame for f in my_ist] print(list_names) ### END SOLUTION In [6]: assert isinstance(my_zip, zipfile.ZipFile) assert isinstance(list_names, list) assert all([isinstance(file, str) for file in list_names]) Using cached version last modified (UTC): Wed Feb 7 17:46:26 2018 Saved at ['', '', '', '']4/18/2018 sp18/hw2_ at master · DS-100/sp18 · GitHub ### BEGIN HIDDEN TESTS assert set(list_names) == set(['', '', '', '']) ### END HIDDEN TESTS Now display the files' names and their sizes. You might want to check the attributes of a ZipFile object. In [7]: ### BEGIN SOLUTION zf = zipfile.ZipFile(dest_path, 'r') for file in ist: print('{}t{}'.format(ame, _size)) ### END SOLUTION Question 1a From the above output we see that one of the files is relatively small. Still based on the HTML notebook ( the 5 first lines of this file. In [8]: file_to_open = ... ### BEGIN SOLUTION file_to_open = '' with (file_to_open) as f: for i in range(5): print(ine().rstrip().decode()) ### END SOLUTION In [9]: assert isinstance(file_to_open, str) ### BEGIN HIDDEN TESTS assert file_to_open == '' ### END HIDDEN TEST

Meer zien Lees minder
Instelling
Vak

Voorbeeld van de inhoud

4/18/2018 sp18/hw2_solution.ipynb at master · DS-100/sp18 · GitHub


DS-100 / sp18

Branch: master sp18 / hw / hw2 / solution / hw2_solution.ipynb Find file Copy path


data100.instructors lab/lab06 sol release 97662bd on Mar 1


0 contributors



3822 lines (3821 sloc) 173 KB




https://github.com/DS-100/sp18/blob/master/hw/hw2/solution/hw2_solution.ipynb 1/19

,4/18/2018 sp18/hw2_solution.ipynb at master · DS-100/sp18 · GitHub




Homework 2: Food Safety
Course Policies
Here are some important course policies. These are also located at http://www.ds100.org/sp18/ (http://www.ds100.org/sp18/).

Collaboration Policy

Data science is a collaborative activity. While you may talk with others about the homework, we ask that you write your solutions individually. If
you do discuss the assignments with others please include their names at the top of your solution.


Due Date
This assignment is due at 11:59pm Tuesday, February 6th. Instructions for submission are on the website.



Homework 2: Food Safety
Cleaning and Exploring Data with Pandas
<img src="scoreCard.jpg" width=400>

In this homework, you will investigate restaurant food safety scores for restaurants in San Francisco. Above is a sample score card for a restaurant.
The scores and violation information have been made available by the San Francisco Department of Public Health, and we have made these data
available to you via the DS 100 repository. The main goal for this assignment is to understand how restaurants are scored. We will walk through the
various steps of exploratory data analysis to do this. To give you a sense of how we think about each discovery we make and what next steps it
leads to we will provide comments and insights along the way.

As we clean and explore these data, you will gain practice with:

Reading simple csv files
Working with data at different levels of granularity
Identifying the type of data collected, missing values, anomalies, etc.
Exploring characteristics and distributions of individual variables



Question 0
To start the assignment, run the cell below to set up some imports and the automatic tests that we will need for this assignment:

In many of these assignments (and your future adventures as a data scientist) you will use os, zipfile, pandas, numpy, matplotlib.pyplot, and
seaborn.

1. Import each of these libraries as their commonly used abbreviations (e.g., pd, np, plt, and sns).
2. Don't forget to use the jupyter notebook "magic" to enable inline matploblib plots
(http://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-matplotlib).
3. Add the line sns.set() to make your plots look nicer.

In [1]: import os
import zipfile
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()

In [2]: import sys

assert 'zipfile'in sys.modules
assert 'pandas'in sys.modules and pd
assert 'numpy'in sys.modules and np
assert 'matplotlib'in sys.modules and plt
assert 'seaborn'in sys.modules and sns



Downloading the data
As you saw in lectures, we can download data from the internet with Python.
Using the utils.py file from the lectures (see link (http://www.ds100.org/sp18/assets/lectures/lec05/utils.py)), define a helper function
fetch_and_cache to download the data with the following arguments:

data_url: the web address to download
file: the file in which to save the results
data_dir: (default="data") the location to save the data
f if t th fil i l d l d d
https://github.com/DS-100/sp18/blob/master/hw/hw2/solution/hw2_solution.ipynb 2/19

, 4/18/2018 sp18/hw2_solution.ipynb at master · DS-100/sp18 · GitHub
force: if true the file is always re-downloaded

This function should return pathlib.Path object representing the file.

In [3]: import requests
from pathlib import Path

def fetch_and_cache(data_url, file, data_dir="data", force=False):
"""
Download and cache a url and return the file object.

data_url: the web address to download
file: the file in which to save the results.
data_dir: (default="data") the location to save the data
force: if true the file is always re-downloaded

return: The pathlib.Path object representing the file.
"""

### BEGIN SOLUTION
data_dir = Path(data_dir)
data_dir.mkdir(exist_ok = True)
file_path = data_dir / Path(file)
# If the file already exists and we want to force a download then
# delete the file first so that the creation date is correct.
if force and file_path.exists():
file_path.unlink()
if force or not file_path.exists():
print('Downloading...', end=' ')
resp = requests.get(data_url)
with file_path.open('wb') as f:
f.write(resp.content)
print('Done!')
else:
import time
last_modified_time = time.ctime(file_path.stat().st_mtime)
print("Using cached version last modified (UTC):", last_modified_time)
return file_path
### END SOLUTION


Now use the previously defined function to download the data from the following URL: http://www.ds100.org/sp18/assets/datasets/hw2-
SFBusinesses.zip (http://www.ds100.org/sp18/assets/datasets/hw2-SFBusinesses.zip)

In [4]: data_url = 'http://www.ds100.org/sp18/assets/datasets/hw2-SFBusinesses.zip'
file_name = 'data.zip'
data_dir = '.'


dest_path = fetch_and_cache(data_url=data_url, data_dir=data_dir, file=file_name)
print('Saved at {}'.format(dest_path))

Using cached version last modified (UTC): Wed Feb 7 17:46:26 2018
Saved at data.zip



Loading Food Safety Data
To begin our investigation, we need to understand the structure of the data. Recall this involves answering questions such as

Is the data in a standard format or encoding?
Is the data organized in records?
What are the fields in each record?

There are 4 files in the data directory. Let's use Python to understand how this data is laid out.

Use the zipfile library to list all the files stored in the dest_path directory.

Creating a ZipFile object might be a good start (the Python docs (https://docs.python.org/3/library/zipfile.html) have further details).

In [5]: # Fill in the list_files variable with a list of all the names of the files in the zip file
my_zip = ...
list_names = ...

### BEGIN SOLUTION
my_zip = zipfile.ZipFile(dest_path, 'r')
list_names = [f.filename for f in my_zip.filelist]
print(list_names)
### END SOLUTION

['violations.csv', 'businesses.csv', 'inspections.csv', 'legend.csv']

In [6]: assert isinstance(my_zip, zipfile.ZipFile)
assert isinstance(list_names, list)
assert all([isinstance(file, str) for file in list_names])


https://github.com/DS-100/sp18/blob/master/hw/hw2/solution/hw2_solution.ipynb 3/19

Geschreven voor

Instelling
Vak

Documentinformatie

Geüpload op
4 juli 2021
Aantal pagina's
19
Geschreven in
2020/2021
Type
Tentamen (uitwerkingen)
Bevat
Vragen en antwoorden

Onderwerpen

$14.49
Krijg toegang tot het volledige document:

Verkeerd document? Gratis ruilen Binnen 14 dagen na aankoop en voor het downloaden kun je een ander document kiezen. Je kunt het bedrag gewoon opnieuw besteden.
Geschreven door studenten die geslaagd zijn
Direct beschikbaar na je betaling
Online lezen of als PDF

Maak kennis met de verkoper

Seller avatar
De reputatie van een verkoper is gebaseerd op het aantal documenten dat iemand tegen betaling verkocht heeft en de beoordelingen die voor die items ontvangen zijn. Er zijn drie niveau’s te onderscheiden: brons, zilver en goud. Hoe beter de reputatie, hoe meer de kwaliteit van zijn of haar werk te vertrouwen is.
Examhack Stanford University
Volgen Je moet ingelogd zijn om studenten of vakken te kunnen volgen
Verkocht
301
Lid sinds
4 jaar
Aantal volgers
238
Documenten
999
Laatst verkocht
3 dagen geleden
EASY A GRADE!!

Here, you will find simple, articulate well-researched education material for you. .... ALL WORK HAS PASSED WITHOUT NEEDING REVISIONS AND BY THE RUBRIC.

3.8

61 beoordelingen

5
31
4
11
3
5
2
4
1
10

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via iDeal of creditcard en download je PDF-document meteen.

Student with book image

“Gekocht, gedownload en geslaagd. Zo makkelijk kan het dus zijn.”

Alisha Student

Bezig met je bronvermelding?

Maak nauwkeurige citaten in APA, MLA en Harvard met onze gratis bronnengenerator.

Bezig met je bronvermelding?

Veelgestelde vragen