Project 2: Python and Pandas
Setup
- copy the following code into a file named
mini-pandas.py
.
import pandas as pd
import matplotlib.pyplot as plt
# URLs for the datasets
# TO BE REPLACED LATER WITH URLs
products_url = "products.csv"
reviews_url = "reviews.csv"
# load in the data
products = pd.read_csv(products_url)
reviews = pd.read_csv(reviews_url)
Programming Portion
In this project we’ll be maintaining and analyzing information about ice cream product and review data. The data is stored in two csv files: products.csv
which stores information such as the flavor, ingredients, and rating, while the reviews.csv
contains the reviews that individuals have for the various products.
Part 1: Data Cleaning and Inspection
The reviews data table that you loaded in isn’t quite ready for use. The ingredients column is a string of ingredients separated with commas, this is a common way to store lists in csv files. In addition, there may be missing or implausible data in the file.
Your data cleaning and inspection pass should do the following with the two dataframes:
- identify any rows that don’t include an author or a title (this is normal for reviews, but let’s just say it’s bad in this case)
- convert the date from a string to a datetime object
- convert the sequence of ingredients to a list
Task: Write a function called inspect
that takes in a reviews dataframe and returns a reviews dataframe only containing the suspicious rows decribed in item 1. You don’t have to remove thse from the table, just identify them.
Task: Make a function called prep_reviews
that takes in a reviews dataframe as input and modifies the dataframe as stated in item 2.
Hint: use the pandas.to_datatime() method
Task: Create a function prep_products
that takes a products dataframe and modifies the dataframe as stated in item 3.
Part 2: Updating Data
In this section we will make updates to the dataframes.
Task: Write a function add_ingredient
that takes in a products dataframe, a key (string), and an ingredient (string) then adds the ingredient to that flavor’s list of ingredients.
Task: Write a function new_flavor
that adds a new flavor to the products dataframe. The function takes in a products dataframe and data for all of the columns for a new row in order. It should not return any output.
Part 3: Extra Credit
Task: Create a function favorite_ingredients
that takes in a reviews dataframe containing a single reviewer and a products dataframe. Return a list of ingredients that appear in every ice cream the reviewer gave 4 stars or higher.
Note: A single reviewer may have multiple reviews. This means the reviews dataframe may contain multple rows for a review of a different flavor.
Hint: You can get the intersection of 2 lists: lst1
and lst2
by writing set(lst1).intersection(lst2)
. To get elements not in an intersection of 2 lists write set(a) ^ set(b)
OTHER TASK IDEA Task: write a function popular_ingredients
that takes in a products dataframe and returns a list containing the 5 most popular ingredients based on the rating of each product (not including the ingredients shared between all products).
Part 4: Analysis
Task: Write a function plot_rating_data
that takes a dataframe and produces a scatterplot of its rating
vs rating_count
columns. Your function should chck whether the given dataframe has columns with these names before generating the plot, raising a ValueError
if it does not.
Note: You can get the names of columns in a dataframe named df
by writing df.columns.to_list()
.
Task: In a comment, do you believe there a relationship between the rating of a product and the number of ratings?
Task: Write a function reviews_per_month
that takes a reviews dataframe and produces a line plot of the number of reviews made each month within the dataframe.
Hint: groupby
will be useful here. To extract the month and year from a column of dates in a dataframe df
, write df['date'].dt.to_period('M')
Task: In a comment, what is the trend of reviews per month? Are there periods where there is an increase in number of reviews or is it more uniform? Why do you think this is the case?
Task: Write a function reviews_per_flavor
that takes in a reviews datasets and produces a line plot for the number of reviews for 3 flavors (10_bj, 16_bj, and 28_bj) in the dataset over time (using the months and years as the horizontal axis). Put them all in the same figure window (which will happen automatically unless you explicitly create a new window).
Hint: To select a specific group from a GroupBy object gf
, write gf.get_group('key')
method
Note: If you get a SettingWithCopyWarning
, create a deep copy of a dataframe df
by writing df.copy()
Part 5: Testing (everyone)
For the testing portion, we are largely interested in the contents of your testing tables, as well as which cases you check using your testing tables. Pay attention to the variety of cases represented in your testing tables and how they align with the needs of the functions that you are testing.
Task: Create a small reviews table and use it to test both the prep_reviews
and inspect
functions. Write separate testing functions for these (called test_prep_reviews
and test_inspect
).
Task: Create both a small products table and reviews table to generate analysis plots with the plotting functions in part 4.
Task (HW4 Testing Makeup): Develop a solid set of tests for the favorite_ingredients
function and any complicated helpers you made for it. You may wish to develop a separate small reviews table designed specifically for testing this function (though this is not required). Put your tests in a function called test_favorite_ingeredients
(with separate test functions for any helpers that you decide should be tested).
Handin
Submit your Python file to Gradescope