Working with tables

Released	Due
TBD	TBD

A vast portion of previously collected data is housed in tabular formats. Be that a CSV file, an SQL database, a Pandas dataframe, or et cetera–knowing how to work with tabular data is crucial skill for data scientists and programmers alike. Even outside of the data analysis realm, we will find many situations where storing a simple listing of values becomes unintuitive or computationally restrictive, and being able to reason about data with a two-dimensional row/column format becomes all the more important. Luckily for us, Pyret’s table implementation provides us with a number of ways to manipulate and query tabular data (relevant DCIC section).

That being said, there is also no such thing as a perfect dataset. Issues with duplicated entries, missing values, biased inputs, and so on are all common when working with real-world data. You may find that the data you seek simply does not exist in one place, and indeed it often the case that you will need to pull information from multiple sources to find what you are looking for. The process of pulling data from several sources into some central repository is known as data warehousing, and you will be getting hands-on practice with the process in this assignment! We will not be doing a deep dive into data-cleaning just yet, but we will also begin to explore biase in traditional computing systems and ways to stay aware of data errors.

Part 1: aggregating movie data

To get comfortable with pulling data from multiple sources, we’re going to take a trip to the movies! We will be working with data from the following sources:

Worldwide box office sales set
- https://www.kaggle.com/datasets/kkhandekar/all-time-worldwide-box-office
MovieLens review set
- https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset?select=movies_metadata.csv

Due to browser limitations, the aforementioned datasets cannot be imported using pyret’s web editor. Starter code has hence been provided to import reduced copeis of the above records from a UMD-hosted mirror. Box office sales data can be found in the SALES table, and general production information can be found in PRODUCTION table.

If you are interested in seeing how the reduced datasets were created, feel free to take a look at this python script.

Task 1: finding box office sales from IMDB identifier

The Internet Movie Database (IMDb) is a popular, online movie database housing production information, user reviews, and so on. While searching IMDB for something to watch, we want to be able to see how much money any given movie made upon release. To that end, please fill in the function imdb-to-box-office(imdb-id :: String, production :: Table, sales :: Table) -> Number:

Your function should take the input IMDB identifier and return it’s worldwide box office revenues in numeric format
- Hint: while there are no IMBD identifiers stored in the SALES table, you should be able to find a movie’s name from its associated identifier using the PRODUCTION table
Note that there are some movies which do not appear in both datasets. Your function will not be evaluated for such cases.

Task 2: making sales lookup robust

As mentioned in the footnote for task 1, some movies are not present in both datasets. In the real world, we would not want our program to crash and burn the moment it comes across a missing entry. For now, we can address this issue by throwing an error whenever a given movie cannot be found. Please add error messages to your task 1 code imdb-to-box-office-robust(imdb-id :: String, production :: Table, sales :: Table) -> Number: for the following cases…

An IMDB identifier is missing –> throw an exception with the message “IMDB identifier not found”
IMDB movie name is missing from box office set –> throw an exception with the message “Missing box office data for {insert movie title}”
Movie name is not unique (i.e. “Robinhood” has several adaptations) –> throw an exception with the message “Multiple movies matched”

To test error messages, we can use the raises keyword described in pyret’s test documentation. I.e:

check
    imdb-to-box-office-robust("im not a real identifier") raises "IMDB identifier not found"
end

Task 3: computing profit margins

We now want to know how much money a movie made after accounting for the cost of production (which can be found in the PRODUCTION table). Please implement the function imdb-to-profit(imdb-id :: String, production :: Table, sales :: Table) -> Number which returns a movie’s worldwide box office sales less the cost of production. Your function should error as described in task 2 for the case of missing IMDB identifiers or box office data.

Hint: You should be able to call your imdb-to-box-office-robust()function to make this easier!

Task 4: test-driven planning

As a global campus, we want to be able to see box office data in any currency. Seeing that the present SALES data is recorded in US dollars, we will need to convert accordingly. To do so, we will be using a conversions table storing the original currency, currency being converted to, and conversion rate. Your job for the next task will be to return a movie’s international box office sales data for any specified currency string.

Instead of jumping straight into development, please take a moment to draft a psuedocode plan for how this could be implemented. In addition to your draft, please provide a set of three unique conversion tables to test the proposed function. Each of your tables should be labelled with the case it is testing – be that missing data, duplicate entries, etc. Please provide these tables in the following format: {from-currency :: String, to-currency :: String, conversion-rate :: Number}

Task 5: converted sales data

Implement your plan from task 4 inside the sales-in-currency(imdb-id :: String, desired-currency :: String, conv-table :: Table, production :: Table, sales :: Table) -> Number: placeholder. Your function should throw an error (with any message of your choosing) in the event of an unrecorded IMDB id, missing conversion rates, or duplicate conversion rate entries. You can run your function with the provided CONVERSIONS-SMALL table, but note that this table does not demonstrate every possible error case! Please use the tables you produced in task 4 to fill in the gaps.

Recall that you can always use a where block (pyret docs%20is%2020%0A%20%20double(15)%20is%2030%0Aend,-2.2.2%C2%A0Testing)) to conveniently bundle student tests with your function

Task 6: improved conversions

For this problem, we want to make better use of minimal conversion rate data. To that end, you will be adding support for reverse conversions. I.e. if we were given a value in USD and wanted a result in JPY, then we could calculate the answer by multiplying the USD value by the reciprocal of the JPY –> USD conversion rate. Your function should error in the same cases of missing IMDB ids, duplicate conversion rates, etc – but it should NOT error for missing conversions if the reverse conversion exists!

All that being said, please implement the improved conversion function inside robust-sales-in-currency(imdb-id :: String, desired-currency :: String, conv-table :: Table, production :: Table, sales :: Table) -> Number:
Again, tables will not be given for every error case. Don’t forget to write your own tests!

Part 2: visualizing Titanic data

For the rest of this assignment, we will be delving into an exploratory analysis of the following Titanic passenger dataset: titanic.csv.** Each row contains a single passenger’s information, including…

Whether or not they survived (0/1)
Class of ticket purchased
Passenger name and title
Sex assigned at birth
Age at time of trip
Whether their siblings or spouses were on board
Whether their children or parents were on board
The price of their ticket

Please refer to the pre-loaded TITANIC table from the starter code in order to complete the following tasks.

** Should be mirrored wherever we put the movie data.

Task 7: finding most frequent entries

Please fill in the function get-six-most-popular(titanic-data :: Table, birth-sex :: String) -> Table which computes the six most popular first names for the desired sex. You may find it useful to plan your function prior to implementation. For helpful table methods, please refer to the pyret documentation or condensed Brown University reference guide here.

Hint: all first names will be delimited with a space before and after (i.e. title(space)first-name(space)...)
Hint: pyret provides functions for sorting, and you may find the provided helper method counter(table, column-name) useful.

Task 8: visualizing title data

The ability to present information in a concise and informative manner is crucial. To that end, please implement the function visualize-titles(titanic-data :: Table) which should produce some graph or plot of your choosing to show the frequency of each Mr/Ms/Dr/etc title in the Titanic dataset. Your implementation will be graded manually with respect to readability, so make sure your chosen format is reasonable for the data in question.

Note that pyret’s charting functionality is currently limited to list-based instantiation. For example…
- Suppose we have a table named data with the columns [name, num]. If we wanted to make a bar chart of this information, we would do the following:
  from-list.bar-chart(counter.get-column('name'), counter.get-column('num'))

Task 9: making a point

As demonstrated with the previous tasks, the TITANIC table could be used to extrapolate a variety of patterns and interesting datapoints about the Titanic passengers. In this problem, you are tasked to identify some pattern or relation of your choosing and present it to the graders in a clear and concise format. Your final implementation should:

Show how to extract the pattern in question
Produce at least one table or graph
Include at least a one line comment about why you felt the relation was interesting

If you are unsure what to look at, you could consider comparing fare prices across ticket class, title and ticket class, or survial rates to start. There is a placeholder for your implementation in making-a-point(titanic-data :: Table), but feel free to add any helpers or other functions that you deem necessary.

Task 10: reflection

As mentioned in the introduction to this assignment, we will often run into biases when processing real world data. Please type up a brief reflection answering the following questions in sentence form (a simple numeric list of responses is fine):

Did anything suprise you aboute your outputs in task 6?
What were some assumptions made/implied in the Titanic dataset? Were any groups excluded?
If you were to go back and record data for Titanic passengers, what information would you want to collect or drop? Would you use the same format as the TITANIC table in question?

In formulating your responses, please consider perusing the following readings: