Homework 3: Part-of-Speech Tagging with the Viterbi Algorithm

Code

marked = require("marked");
dateFns = import("https://cdn.jsdelivr.net/npm/date-fns@4.1.0/+esm")

function md2html(text) {
  const template = document.createElement("template");
  template.innerHTML = marked.parseInline(text);
  return template.content.cloneNode(true);
}

function buildLG(assignmentData) {
  const elems = [];
  elems.push(htl.html`<strong>Learning Goals:</strong><br>Once you complete this assignment, you should:`);
  let lg_assignment_items = assignmentData.lg_assignment.map(lg => htl.html`<li>${lg}</li>`);
  elems.push(htl.html`<ul>${lg_assignment_items}</ul>`);
  elems.push(htl.html`This assignment is connected to the following overall learning goals of the course:`);
  let lg_course_items = assignmentData.lg_course.map(lg => htl.html`<li>${lg}</li>`);
  elems.push(htl.html`<ul>${lg_course_items}</ul>`);
  return elems;
}

function buildDeadlines(assignmentData) {
  console.log("Prepping deadlines");
  return htl.html`<table class="table">
    <thead>
    <tr>
    <th>Initial Feedback Deadline</th>
    <th>Final Autograder Deadline</th>
    <th>Final Revision Deadline</th>
    </tr>
    </thead>
    <tbody>
    <tr>
    <td>${assignmentData.initial_due}</td>
    <td>${assignmentData.autograder_due}</td>
    <td>${assignmentData.final_due}</td>
    </tr>
    <tr>
    <td colspan="3"><small><i>All deadlines are at 11:59PM EST. See the <a href="https://middcs.github.io/csci457-s25/resources/homework_policies.html">homework policies</a> for details about the three dealines.</i></small></td>
    </tr>
    </tbody>
  </table>
  `;
}

function buildDetails(assignmentData) {
  const elems = [];
  elems.push(htl.html`<h1>${assignmentData.title}</h1>`);
  elems.push(htl.html`<a href="${assignmentData.gradescope}" class="btn btn-outline-primary" role="button">Gradescope</a>`);
  if (assignmentData.download_link) {
    const download_label = assignmentData.download_label || "Starter Code + Data";
    elems.push(htl.html`<a href="${assignmentData.download_link}" class="btn btn-outline-primary" role="button">${download_label}</a>`);
  }
  elems.push(buildDeadlines(assignmentData));
  elems.push(md2html(assignmentData.desc));
  elems.push(htl.html`<br>`);
  elems.push(...buildLG(assignmentData));
  elems.push(htl.html`<strong>Submit these files</strong>: ${md2html(assignmentData.submit_files)}<br>`);
  if (assignmentData.omit_files) {
    elems.push(htl.html`<strong>Omit these files</strong>: ${md2html(assignmentData.omit_files)}<br>`);
  }
  elems.push(htl.html`<strong>Leaderboard:</strong> ${md2html(assignmentData.leaderboard_inst)}<br>`);
  elems.push(htl.html`<strong>Partners:</strong> ${assignmentData.partners ? htl.html`👥 You may work with a partner <b>if</b> you meet the requirements described in the <a href="https://middcs.github.io/csci457-s25/resources/homework_policies.html">homework policies</a>.` : htl.html`👤 This is an individual assignment.`}<br>`);
  if (assignmentData.rc) {
    elems.push(htl.html`<strong>Responsible Computing:</strong> ⚖️ ${md2html(assignmentData.rc)}<br>`);
  }
  if (assignmentData.credits) {
    elems.push(htl.html`<strong>Credits:</strong> ${md2html(assignmentData.credits)}`);
  }
  return elems;
}


htl.html`<div class="hw-details">${buildDetails(assignmentData)}</div>`

Implementation Details

You should implement the Viterbi algorithm as described in the textbook. The starter code should help you out a lot - it includes code that will compute initialization log probabilities (self._init_log_probs), transition log probabilities (self._transition_log_probs), emission log probabilities (self._emission_log_probs), and the unique tags (self._tags) for you. In the pseudocode in the textbook, these are referred to as \(\pi\), \(A\), \(B\), and \(Q\), respectively (see slp section A.4).

The final implementation should be similar to the psuedocode in the textbook, with a few modifications. The following modifications are made in the starter code:

In the transition likelihoods, we implement laplace smoothing to avoid 0 probabilities. Instead of adding 1, add self.k_transition (see the starter code - this is called add-k smoothing, and we discussed it briefly when discussing n-gram language models). If k = 0, there is no smoothing. Specifically, \(P_{smooth}(t_i \vert t_{i-1})\) can be computed as \(\frac{C(t_{i - 1}, t_i) + k}{C(t_{i-1}) + k * \vert Q \vert}\), where \(k\) is self.k_transition and \(\vert Q \vert\) is the number of unique tags.
In your emission likelihoods, we add an unknown word using the tag "<UNK>" for each POS tag. The unknown word will have count self.k_emission (which may be a float). self.k_emission is also added to all other counts. Specifically, for all words where \(C(t_i, w_i) > 0\), \(P_{smooth}(w_i \vert t_i) = \frac{C(t_i, w_i) + k}{C(t_i) + k * (\vert W \vert + 1)}\), where \(k\) is self.k_emission and \(\vert W \vert\) is the number of words that appear with tag \(t_i\). \(P_{smooth}(\text{UNK} \vert t_i) = \frac{k}{C(t_i) + k * (\vert W \vert + 1)}\).
- When you see unknown word/tag pairs when executing the Viterbi algorithm, you should look up the probability of "<UNK>" associated with that tag.

In the code that you write, your calculations should all be performed in log space to avoid underflow. We’ll convert probabilities of 0 to log probabilities of \(-\infty\).

Resources

Starter Code

The starter code for this assignment provides a main method and an abstract POSTagger class. Your main task is to fill in the _predict_one method of the HMMPOSTagger subclass. You might also edit the __init__ method of the subclass. Do not edit any of the other classes.

The starter code also provides a baseline POS tagger, BaselinePOSTagger. The train method creates a dictionary that stores the most common tag for each token in the training data. That tag is chosen by the _predict_one method if the token appeared in the training data. If the token didn’t occur in the training data, the most common tag overall is chosen.

You may find that some code from homework 2 is helpful, and you should feel free to copy it over!

Data Files

Two data files are available for you, POS_train.txt (training data) and POS_dev.txt (development data). The file has sentences separated by newlines. Each sentence is formatted as

tok_1/tag_1 tok_2/tag_2 ... tok_n/tag_n

The data is drawn from Universal Dependencies. It has been reformatted for our class to make it easier to process without external libraries.

Allowed External Resources

Warning

External libraries that are not part of the python standard library, including numpy, are not allowed on this assignment and will lead to failed autograder tests. The python math module is sufficient to handle the mathematics needed in this assignment. You may use any built-in python modules (in particular, you may think that Counter and defaultdict from the collections module are useful).

While you may not search for code that solves this problem, you may search for ideas for improving your smoothing and handling of unknown words.

Deliverables

Code

You should submit a python script model.py on gradescope. Your main goal is to write a HMM part of speech tagger that exceeds the accuracy of the baseline tagger. This is actually not a trivial task, as the BaselinePOSTagger is a pretty good baseline.

Suggestions for Developing your Model

To make sure that you have gotten the basic implementation of the Viterbi algorithm correct, I would recommend starting out by running test_mini.py which runs through two examples we have completed in class. You might need to debug by checking intermediate computations if your final results are incorrect.

Once those test cases are working, you can test your full model using test.py on the larger dataset from Universal Dependencies. If your model is working on the test_mini.py examples but doesn’t exceed the baseline, that’s expected - you need to test out different values for k_emission and k_transition to optimize your model!

Your tasks

The following components are required:

`_predict_one` method

The _predict_one method should return a list of POS tags given a sentence (represented as a list of tokens). It should be implemented as described in slp Figure A.9. Here are a couple of things to be aware of in your implementation:

While slp uses a matrix to represent the viterbi and backpointer arrays, using a dictionary or dictionaries is also an appropriate approach, and might be easier than using a list of lists.
Your probabilities will be log probabilities, which means that you’ll need to use + instead of * when the pseudocode shows multiplication of probabilities, because \(log(M \cdot N) = log(M) + log(N)\). This is what we did in our examples in class!

Warning

Be wary of an approach that represents the Viterbi table with words as dictionary keys. This approach leads to bugs if the same word appears multiple times in the text.

`init` method

There is already an __init__ method provided in the code. However, once your write your code, you’ll see that you don’t necessarily improve upon the baseline with these arguments. To actually make an improvement, test out different values for the k_transition and k_emission arguments (try both higher and lower values than 1). Once you find something that works, change the default values for k_transition and k_emission by updating these two lines of code:

    def __init__(self, k_transition: float = 1,
                 k_emission: float = 1, extension: bool = False):

Possible Extensions (OPTIONAL)

If you’d like to learn more, you can try additional methods for handling unknown words (which would require updates to train). Here are some ideas:

Some types of words are more likely to be unseen in the training data than others - in particular, this includes open class words like nouns. One way to identify these types of words in a data-driven fashion is to count the number of words that appear only once with each POS tag - the more such words there are, the more likely it is for an unknown word to have that tag. Try to incorporate this observation in your model. This webpage might be helpful:
- Margaret Fleck’s POS Tagging Assignment
Morphology might also give you some hints about the tag of an unknown word - for instance, words ending in "ly" are more likely to be adverbs. Try to incorporate this observation in your model. Here’s a resource that might be helpful:
- Morphological features help POS tagging of unknown words across language varieties
Finally, the TnT paper (one of our supplemental reading assignments) has some ideas for improving HMM POS taggers!

If you implement one of these methods, your code should still work as described in the assignment unless the argument extension is set to Truein your HMMPOSTagger! Use conditionals to switch between the two functionalities. The extension flag will be used when testing your code for the leaderboard.

Report

In addition to your code, you must fill out the report in report.md (from the starter code).

Implementation Details

Resources

Starter Code

Data Files

Allowed External Resources

Deliverables

Code

Suggestions for Developing your Model

Your tasks

_predict_one method

__init__ method

Possible Extensions (OPTIONAL)

Report

`_predict_one` method

`init` method