Project 2: Python and Pandas

Setup

copy the following code into a file named mini-pandas.py.

import pandas as pd
import matplotlib.pyplot as plt

# URLs for the datasets
# TO BE REPLACED LATER WITH URLs
products_url = "products.csv"

reviews_url = "reviews.csv"

# load in the data
products = pd.read_csv(products_url)

reviews = pd.read_csv(reviews_url)

Programming Portion

In this project we’ll be maintaining and analyzing information about ice cream product and review data. The data is stored in two csv files: products.csv which stores information such as the flavor, ingredients, and rating, while the reviews.csv contains the reviews that individuals have for the various products.

Part 1: Data Cleaning and Inspection

The reviews data table that you loaded in isn’t quite ready for use. The ingredients column is a string of ingredients separated with commas, this is a common way to store lists in csv files. In addition, there may be missing or implausible data in the file.

Your data cleaning and inspection pass should do the following with the two dataframes:

identify any rows that don’t include an author or a title (this is normal for reviews, but let’s just say it’s bad in this case)
convert the date from a string to a datetime object
convert the sequence of ingredients to a list

Task: Write a function called inspect that takes in a reviews dataframe and returns a reviews dataframe only containing the suspicious rows decribed in item 1. You don’t have to remove thse from the table, just identify them.

Task: Make a function called prep_reviews that takes in a reviews dataframe as input and modifies the dataframe as stated in item 2.

Hint: use the pandas.to_datatime() method

Task: Create a function prep_products that takes a products dataframe and modifies the dataframe as stated in item 3.

Part 2: Updating Data

In this section we will make updates to the dataframes.

Task: Write a function add_ingredient that takes in a products dataframe, a key (string), and an ingredient (string) then adds the ingredient to that flavor’s list of ingredients.

Task: Write a function new_flavor that adds a new flavor to the products dataframe. The function takes in a products dataframe and data for all of the columns for a new row in order. It should not return any output.

Part 3: Extra Credit

Task: Create a function favorite_ingredients that takes in a reviews dataframe containing a single reviewer and a products dataframe. Return a list of ingredients that appear in every ice cream the reviewer gave 4 stars or higher.

Note: A single reviewer may have multiple reviews. This means the reviews dataframe may contain multple rows for a review of a different flavor.

Hint: You can get the intersection of 2 lists: lst1 and lst2 by writing set(lst1).intersection(lst2). To get elements not in an intersection of 2 lists write set(a) ^ set(b)

OTHER TASK IDEA Task: write a function popular_ingredients that takes in a products dataframe and returns a list containing the 5 most popular ingredients based on the rating of each product (not including the ingredients shared between all products).

Part 4: Analysis

Task: Write a function plot_rating_data that takes a dataframe and produces a scatterplot of its rating vs rating_count columns. Your function should chck whether the given dataframe has columns with these names before generating the plot, raising a ValueError if it does not.

Note: You can get the names of columns in a dataframe named df by writing df.columns.to_list().

Task: In a comment, do you believe there a relationship between the rating of a product and the number of ratings?

Task: Write a function reviews_per_month that takes a reviews dataframe and produces a line plot of the number of reviews made each month within the dataframe.

Hint: groupby will be useful here. To extract the month and year from a column of dates in a dataframe df, write df['date'].dt.to_period('M')

Task: In a comment, what is the trend of reviews per month? Are there periods where there is an increase in number of reviews or is it more uniform? Why do you think this is the case?

Task: Write a function reviews_per_flavor that takes in a reviews datasets and produces a line plot for the number of reviews for 3 flavors (10_bj, 16_bj, and 28_bj) in the dataset over time (using the months and years as the horizontal axis). Put them all in the same figure window (which will happen automatically unless you explicitly create a new window).

Hint: To select a specific group from a GroupBy object gf, write gf.get_group('key') method

Note: If you get a SettingWithCopyWarning, create a deep copy of a dataframe df by writing df.copy()

Part 5: Testing (everyone)

For the testing portion, we are largely interested in the contents of your testing tables, as well as which cases you check using your testing tables. Pay attention to the variety of cases represented in your testing tables and how they align with the needs of the functions that you are testing.

Task: Create a small reviews table and use it to test both the prep_reviews and inspect functions. Write separate testing functions for these (called test_prep_reviews and test_inspect).

Task: Create both a small products table and reviews table to generate analysis plots with the plotting functions in part 4.

Task (HW4 Testing Makeup): Develop a solid set of tests for the favorite_ingredients function and any complicated helpers you made for it. You may wish to develop a separate small reviews table designed specifically for testing this function (though this is not required). Put your tests in a function called test_favorite_ingeredients (with separate test functions for any helpers that you decide should be tested).

Handin

Submit your Python file to Gradescope