Homework Eight (retake)
Due 2025-05-12 at 11:59p
Download Starter Files Submit on GradescopeThis is your second shot at the problems from Homework Eight. If you missed any of the problems, you are strongly advised to try again. Your goal is to complete (Satisfactory or Excellent) up to four problem across Homework Eight and this set, but you are encouraged to do more – especially if you were challenged by the first set of problems. Remember that we evaluate each problem individually, so there are no issues if you only do a subset of the problems (including none).
Objectives
- Demonstrate your ability to read text files
- Demonstrate your ability to perform string manipulation for data cleaning
- Demonstrate your ability to work with CSV and JSON file formats
Getting started
For this assignment, we have another four functions for you to write. Please pay attention to all of these instructions. Even this front matter, which may look the same (and many of you skip over) has changed.
For this assignment, we have provided starter code. This primarily consists of the data files used in the examples below. These do not need to be submitted with your code to Gradescope.
Feel free to work through the functions in any order. Whatever order you use, we suggest that you take the time to test each one as you go rather than trying to write all four out like they were part of an essay and then testing at the very end. Treat these are four totally separate and distinct problems (which they are).
Submit your solution on Gradescope using the button above. Feel free to resubmit as you complete each problem to check your progress. To repeat – you only need to submit homework08r.py
.
Python subset
For this assignment, we have not placed any restrictions on what you may use.
Satisfactory vs. Excellence
A solution for these questions that is excellent will have the all of the following qualities
Style
An excellent function will have a docstring that is formatted in the way shown in the lectures. It should include:
- the purpose of the function
- the type and purpose of each parameter (if any)
- the type and meaning of the output (if any)
In addition, you should follow some of the PEP8 guidelines on whitespace. The ones we will be looking at are:
- no whitespace between names and
(
or[
(e.g.,f (5)
should bef(5)
ands [3:5]
should bes[3:5]
) - there should be a single space around operators (e.g.,
x=4+1
should bex = 4 + 1
andy = 3 -2
should bey = 3 - 2
) - there should be a space after commas, but not before (e.g.,
f(4 , 5)
orf(4,5)
should bef(4, 5)
)
Special cases and requirements
For some of the problems, we have identified special cases or requirements that we have deemed potentially more challenging and not essential to a satisfactory solution. An excellent solution will cover all cases. These cases will be identified in the autograder by tests with *
at the end of the title (so make sure you submit frequently as you are working).
Problem 1: medal_tally
Write a function called medal_tally
. It should take in two string parameters: input_csv_filename
containing the path to a valid CSV file and output_csv_filename
containing the path to the file you’d like your function to write data to.
The input CSV file will contain results from the Olympic Games and have the following columns:
medal_type
: the medal color (Gold/Silver/Bronze)discipline
: the discipline (sport)event
: the specific even within the disciplineathlete
: the name of the athlete who won a medalcountry
: the country of the athlete who won a medal
Your goal is to compute a sorted medal tally for countries at these games. Imagine you have the following input file (provided in the starter code as paris_2024_medals_partial.csv
):
medal_type,athlete_name,event,country,discipline
Gold,Siniaková Kateřina,Mixed Doubles,Czech Republic,Tennis
Gold,Macháč Tomáš,Mixed Doubles,Czech Republic,Tennis
Silver,Wang Xinyu,Mixed Doubles,China,Tennis
Silver,Zhang Zhizhen,Mixed Doubles,China,Tennis
Bronze,Dabrowski Gabriela,Mixed Doubles,Canada,Tennis
Bronze,Auger-Aliassime Félix,Mixed Doubles,Canada,Tennis
Gold,Biles Simone,Women's All-Around,United States,Artistic Gymnastics
Silver,Andrade Rebeca,Women's All-Around,United States,Artistic Gymnastics
Bronze,Lee Sunisa,Women's All-Around,United States,Artistic Gymnastics
You can’t just count the pairs of countries and medals, because sometimes a medal is given to a team (e.g., in tennis mixed doubles) and sometimes it is given to an individual (e.g., in the artistic gymnastics women’s all-around), but either way it should count as a single medal for the country. You should consider medal_type
, event
, and discipline
together as a unique identifier for a medal to add to your medal tally; you may assume that there are no ties in the dataset.
Your function should write the medal tally to the output CSV file, sorted by total medals. Any ordering is OK if countries are tied in total medal count. For example, here’s an appropriate output file for paris_2024_medals_partial.csv
:
Country,Gold,Silver,Bronze,Total
United States,1,1,1,3
Czech Republic,1,0,0,1
China,0,1,0,1
Canada,0,0,1,1
The headers for your output CSV file should be capitalized and appear in the order shown above. In the starter code, you’ve been provided a function that should help with the sorting - the function header and docstring are provided below for reference:
def sort_list_of_dict(list_of_dict, sort_key, descending):
"""
Sorts a list of dictionaries by the values associated with one of the keys
from the dictionaries
Parameters:
list_of_dict (List[Dict]): _description_
sort_key (str): the key from the dictionaries to sort by
descending (bool): sort in descending order if True (otherwise ascending)
Returns:
List[Dict]: The sorted list
Note: does not modify the original list
"""
To earn an excellent, you should handle the case where the input file has a column order that differs from the example shown above.
medal_tally
|
|||||
---|---|---|---|---|---|
Parameters |
|
||||
Return type | None |
Examples
You can test your script on paris_2024_medals_partial.csv
and check that your results match those shown above. Once that is working, test on beijing_2022_medals.csv
; your output should essentially match the real medal tally, with some minor changes due to doping violations.
Problem 2: html_checker
Write a function called html_checker
that takes a filename html_filename
and returns a diction containing the location of the first error (if any). As you may have guessed from the name, the input file will be a HTML (Hypertext Markup Language) file. HTML is the markup language that is used to create webpages.
The key thing to know about HTML for this problem is that the markup is in the form of nested tags. A basic HTML tag actually consists of a start tag and an end tag with content in between them. So, for example, we might see the header level one tag (h1
) used like this: <h1>Header</h1>
. Note that the end tag is indicated by the /
.
Because tags have a start and an end, they can contain content, including other tags. The nesting can be arbitrarily deep. So, we could have content like this:
<body>
<h1>Header</h1>
<div>
<p>This is a paragraph</p>
<p>This paragraph has text in <em>italics</em> </p>
</div>
</body>
The role of html_checker
is to report nesting errors. We have two things we are concerned with:
- all start tags must have a matching end tag
- tags cannot be interleaved. If a tag is opened inside of a parent, it must be closed before the parent is closed.
So, this is invalid:
<body>
<h1>Header</h1>
<div>
<p>This is a paragraph</p>
<p>This paragraph has text in <em>italics</em> </p>
As is this:
<body>
<h1>Header</h1>
<div>
<p>This is a paragraph</p>
<p>This paragraph has text in <em>italics</em>
</div>
</p>
</body>
If there is an error, html_checker
will return a dictionary that contains two keys: "line number"
and "character"
. The line number is the line where the error is detected. And the character is the character in the line where the error was detected. The point of detection will be either the end of the file, or at the close of a tag while a child tag is still open.
As an additional wrinkle, the line numbers and the character numbers should start at 1, not zero, so they correspond to what is reported in a text editor.
If the file is fine, html_checker
should return an empty dictionary.
An excellent solution will correctly process header tags with attributes, e.g., <h1 class="title">
, which should be treated as a <h1>
tag. You can assume that the <
and >
characters always indicate the start and end of a tag.
An excellent solution will also correctly handle self-closing tags. A self-closing tag will have the form <br />
. The /
at the end closes the tag without the need for an explicit end tag.
html_checker
|
|||
---|---|---|---|
Parameters |
|
||
Return type |
dict – either empty or containing keys “line number” and “character” with in values
|
Examples
The examples below are included with the starter code in the html_samples
directory
>>> html_checker("html_samples/example01.html")
{}>>> html_checker("html_samples/example02.html")
'line number': 6, 'character': 63}
{>>> html_checker("html_samples/example03.html")
'line number': 12, 'character': 0}
{>>> html_checker("html_samples/example04.html")
{}>>> html_checker("html_samples/syllabus.html")
{}
Problem 3: ris_to_bib
Write a function called ris_to_bib
that takes in a filename ris_filename
as input and returns a string representing the APA citations for the references in the RIS file. RIS is a structured file format for citations. You’ll need to handle a subset of the fields that can be specified in a RIS file, and you can assume that all of these fields are specified in your input files:
TY
is the type of reference, and comes first. In our case, it will be witherJOUR
(journal) orCONF
(conference)AU
is an author name. The citation may have multiple authors; in the APA citation, the order of the authors should match the order in the file.JO
is the journal or conference name.PY
is the publication year.
A single citation will always start with the TY
field; the other fields may be in any order. There may be multiple AU
fields in one citation, but the other fields will only appear once. A RIS file may contain multiple citations; the end of a citation is indicated by the field ER -
. There is no newline between citations.
As an example, imagine you have the following RIS file as input:
TY - CONF
TI - Bert: Pre-training of deep bidirectional transformers for language understanding
AU - Devlin, Jacob
AU - Chang, Ming-Wei
AU - Lee, Kenton
AU - Toutanova, Kristina
JO - Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers)
PY - 2019
ER -
TY - JOUR
TI - On computable numbers, with an application to the Entscheidungsproblem
AU - Turing, Alan Mathison
JO - Proc. of London Mathematical Society
PY - 1937
ER -
You’ll note that the fields names all have two characters, and they are separated from the values with two spaces and a hyphen.
The output of your program should be:
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers).
Turing, A. M. (1937). On computable numbers, with an application to the Entscheidungsproblem. Proc. of London Mathematical Society.
There are a couple things to note about the formatting:
- In the bibliography, the most authors are separated with just a comma; the last author is separated from the first \(n-1\) authors with a comma and an ampersand.
- Authors are represented in the format
Last, F. M.
Hyphenated first names are split and represented with initials. If a name already has been abbreviated in the RIS file (e.g., it ends in “.”), it should be not be abbreviated further. You should allow for multiple middle names. - The word “In” is placed before the journal or conference name if the publication type is
CONF
.
To earn an excellent, your solution should ignore fields that are not TY
/AU
/JO
/PY
but still process the rest of the reference. For instance, many references will include the VL
(volume) field; your program should ignore it and still process the required fields.
ris_to_bib
|
|||
---|---|---|---|
Parameters |
|
||
Return type |
str
|
Problem 4: read_tab
Write a function called read_tab
that takes in a single string argument (filename
), which is the name of a file containing guitar tablature (or “tab”). Your function will read the tab file and return a list of the notes played.
This problem will come with a small dose of music theory and guitar instruction. It should be enough for those of you who have never held a guitar in your hand. For those of you who do have a guitar background, this will be a simplified explanation and set up for the sake of the problem…
What you need to know
Basic music theory
- Western music is dominated by the chromatic scale
- the scale consists of 12 pitches or notes
- the names of the notes are A, A#, B, C, C#, D, D#, E, F, F#, G, G#
- the scale is cyclic, so the next note after G# is A, and the note before A is G#
Guitars - The guitar has six strings. - In standard tuning, they are tuned to E, A, D, G, B, and e - The lowercase e is the same note, one octave (one full cycle) higher - The frets allow the player to raise the pitch of the string. So, pressing fret 1 on the E string will give us an F. Pressing fret 3 on a B string will give us a D.
Tablature - Tab is a notation that shows a guitarist which strings to play, and on which frets to put their fingers - There is one line per string in the diagram - Number indicate a string is to be played and where the string should be fretted - Tab is read from left to right, with an instance of time being a vertical slice across the strings
e|-0----------|
B|-------0----|
G|------------|
D|----2-----2-|
A|-------2----|
E|-0----------|
So, in the example above, the guitarist should play the E and e strings together with no fretting. Then they play the D string at fret 2, then the A string at fret 2 and the B string open. Finally they will play the D string at fret 2 again.
ASCII style tab is not really standardized, so we will make some refinements
- Every string will start with the string name followed by a
"|"
- Each slice of time will be three characters wide. The first two characters represent the frets and the third is set to
"-"
to provide space between moments. If a number is only a single character wise, the leftmost character will also be a"-"
- A “|” indicates the end of the line
- If there are multiple lines in the file, they will be separated by a single blank line
The problem
Your function should read in the tab in the file and return a list of the notes to play in order. You will need to detect which string is played, and then use the string and the fret to figure out which note was played.
Since we can play than one note at a time, your output should be a list of lists, ordered from high string (“e”) to low (“E”). So, given the tab above, the output would be [['E', 'E'], ['D'], ['B', 'B'], ['E']]
.
Note that in the tab, the high e is indicated with a lowercase e. You can safely ignore that in your solution – I am not going to have you try to indicate the octave of each note.
read_tab
|
|||
---|---|---|---|
Parameters |
|
||
Return type |
list of list s of string s
|
Examples
To recreate these examples, you will need the files that created them. You will find them in the tab
directory of the starter code. You are encouraged to make some of your own as you find things you need to test.
>>> read_tab("tab/tab01.txt")
'E', 'E'], ['E'], ['B', 'B'], ['E']]
[[>>> read_tab("tab/tab02.txt")
'E', 'B', 'G#', 'E', 'B', 'E'], ['E', 'B', 'G#', 'E', 'B', 'E'], ['E', 'C#', 'A', 'E', 'A', 'E'], ['E', 'C#', 'A', 'E', 'A', 'E'], ['G', 'D', 'G', 'D', 'B', 'G']]
[[>>> read_tab("tab/tab03.txt")
'B', 'F#', 'D', 'A', 'E', 'B'], ['C', 'G', 'D#', 'A#', 'F', 'C'], ['C#', 'G#', 'E', 'B', 'F#', 'C#'], ['D', 'A', 'F', 'C', 'G', 'D']]
[[>>> read_tab("tab/tab04.txt")
'E', 'E'], ['D'], ['B', 'B'], ['E'], ['B', 'E'], ['E'], ['B'], ['E'], ['G', 'G'], ['D'], ['D', 'B'], ['D'], ['D', 'G'], ['D'], ['B'], ['D'], ['E', 'E'], ['E'], ['B', 'B'], ['E'], ['B', 'E'], ['E'], ['B'], ['E'], ['E', 'A'], ['A'], ['C#', 'E'], ['A'], ['C#', 'A'], ['A'], ['E'], ['A']] [[