Rule-Based Tokenization Activity

In this activity, you will experiment with NLTK’s tokenizer and try to implement some of its functionality. However, a sub-goal of this activity is to get you set up using vscode and introduce you to the vscode debugger, which I hope you will use on your homework assignments! If that’s most of what you complete during the time alloted for today’s activity, I’ll still consider it time well-spent.

Part 0: Setting up vscode

Download vscode

Many students already use vscode as their main editor, but if you don’t, you’ll need to download it here. Go through the installation process before continuing to the next step.

I strongly recommend using vscode in this class. I think that getting acclimated with its debugger will be helpful for your programming skills. Additionally, later in the semester, I will have you connect to the ada cluster using vscode - it’s certainly possible to do that directly in your terminal, but I think you’ll have an easier time if you’re familiar already with using vscode!

Downloading and opening starter code

Click here to download the starter code for this activity. This will download a zip file, which you should unzip.

Then, in vscode, click the Open… button and open the directory containing all of the files.

At this point, your vscode window should look something like what you see below.

There are four files:

emma_nltk_tokenized.json: Sentences from Emma by Jane Austen and their corresponding tokens extracted using the nltk tokenizer. I pre-processed this for you because I don’t want you to worry about installing nltk as part of this activity!
report.md: If you get far enough in today’s activity, you’ll write some of your findings here.
tokenizer_activity.py: This is the main python script you’ll be working with today.
.vscode/launch.json: This is a configuration file for the debugger in vscode. In this case, it sets up vscode to run the tokenizer_activity.py script in the debugger. I’ll distribute a similar file with homework assignments, and they may contain multiple “Launch Targets” that will allow you to run multiple scripts in the debugger. In those assignments, you can access different scripts with the dropdown. For now though, there’s just one script!

Set up your python environment

You should install the Python Debugger extension. You can do that by clicking the link, then clicking Install, which will take you to the installation page in vscode.

Open the Command Palette (⇧⌘P), start typing the Python: Create Environment command to search, and then select the command. For this particular activity, you can use any python installation that is already on your laptop, so select any existing interpreter that you see in the drop-down.¹

Click here if the step above didn’t work

If you aren’t able to select a python interpreter, it’s possible that either you don’t have python installed or vscode does not work with your python installation. Please follow the instructions on the vscode website to install a python interpreter. You shouldn’t need a virtual environment for this activity, but you can set one up if you want to.

Once you’re done, try selecting your interpreter again.

Part 1: Catching a bug

Without reading through tokenizer_activity.py, run it in the debug mode in vscode. To do that, navigate to the run and debug page by clicking the play/bug button on the left of your screen.

Then, click on the green play button that you’ll now see in the top left. Note that this is being configured by .vscode/launch.json.

There is a pretty obvious bug that you can see from syntax highlighting that will lead to a runtime error in tokenizer_activity.py. Still, humor me and run the script in the debugger before fixing the bug.

If you have your debugger set up to set a breakpoint on uncaught exceptions, you should find that the execution of your program stops when the exception is raised. Now let’s take a minute to answer some questions using the debugger:

If you don’t have this set up, look for checkboxes in the BREAKPOINTS tab at the bottom of your screen, and check Uncaught Exceptions.

What has caused the exception?

This should be pretty easy to figure out, if you haven’t already. Look at the grey box showing the exception on your screen to see the type of error.

What line did the exception occur on?

Again, this one should be pretty easy to figure out. Look at the line highlighted in yellow, and the text in the exception itself.

What is the string that is being tokenized?

There are a number of ways to figure this out, but I’d like you to look at the VARIABLES tab on the top-left side of your screen. This shows the values of all local variables at the point when the exception occurred.

What are NLTK’s tokens for the sentence that the custom tokenizer just failed on?

OK, this one’s a harder question and seems a bit obscure. However, in this class, you’ll often want to figure out which edge case is causing your program to fail, and that might be related to something that is happening outside of the current function. To figure this out, we’ll explore the call stack.

Look for the call stack in your interface, and click around. See if you can answer the question on your own by exploring, but if not, you can look at the hint below.

Click here for a hint

Look for CALL_STACK on the bottom-left side of the screen, then click on the compare_tokenizers function.

Then, look at VARIABLES on the top-left of your screen. You should be able to see a variable nltk_tokens in your local variables.

Fix the bug, and then continue to the next part.

Part 2: Setting breakpoints

You may have set breakpoints in the debugger before. To set a breakpoint, hover your mouse to the left of a line number, then click. You’ll see a red dot next to the line number. The breakpoint will pause the execution of your program at the line it is set on.

Once you’ve set your breakpoint, try clicking the step over button (). This will allow you to trace the execution of the tokenize function. Once you’ve done that, stop the debugger (), and we’ll try something a bit more complex.

More advanced breakpoints

The compare_tokenizers function tokenizes sentences and prints a summary of how the tokenization differs, showing 10 random sentences. Looking at the output, you might see that our tokenizer treats "Mr." as two tokens ("Mr" and "."), while the NLTK tokenizer treats it as a single token, which makes more sense. If you don’t see this in the output, try running the script a few more times to get a different random set of sentences!

Vscode has a features that allows you to set a conditional breakpoint. We are going to set a breakpoint that pauses execution only if some expression is true, in this case, "Mr." in string. Essentially, we want to see what our function does when the string contains "Mr.", so that we can update our function to be more inline with the NLTK tokenizer.

To set a conditional breakpoint, right click to the left of the first line of the function, and click Add Conditional Breakpoint…. You’ll insert the expression "Mr." in string, then hit your return/enter key to set the breakpoint.

Now, run the script again to see what happens. You’ll see that execution is paused only for those sentences containing "Mr.". Step through the code to see how that token is processed. After what you have a sense of what happens, stop the debugger, update the code to reflect NLTK, and try again to ensure that the token is processed correctly. You might want to keep the breakpoint for now, to ensure that you are able to see how a sentence with the "Mr." token is processed!

Part 3: More tokens

Now that you’ve learned some debugging skills, work on the following tasks related to improving the tokenizer:

Task 1: more tokens

Open up report.md and list at least three ways (besides the one above) in which the behavior of the tokenizers differ from each other. Feel free to run the compare_tokenizers function multiple times or change the num_sents argument to print more sentences!

Note that this is a markdown file. You’ll be working with markdown files in your homework, so it’s good to get used to them. If you don’t have experience with markdown, this is a good resource to get started. You can use this site to render/preview your markdown file (you can also render it in vscode).

Task 2: updating the tokenization

Now, update the tokenize function above such that the differences you listed above no longer exist. You might want to use the debugger to help you!

As long as you have time, you can continue to iterate - identify more differences (which you can add to your list above), then update your tokenizer to fix them. At some point, you might find differences that you don’t think should be fixed. Perhaps the NLTK tokenizer does something that you think is wrong or inconsistent! If you find any such instances, list them in report.md.

This is the main formal debugging guide I’ll give you in class, as it’s not the focus of NLP. However, I hope that you’ll continue to hone your skills with the debugger, as it will be incredibly helpful when you’re working through the homework assignments in this course. If you want to learn more (there’s always more to learn), I’d be happy to talk during my drop-in hours.

Footnotes

To learn more about setting up Python environments in vscode, you can look at this page. If you want to look into this, please do it after class.↩︎