Course Project

In the final project you will apply what you have learned about data analysis, machine learning and generative AI to implement a model or analysis of moderate scope. The project will culminate in a electronic poster session at the end of the semester. You will ideally work in teams of 2-3.

Picking a project

Your project must:

  1. Incorporate data in some way, either by collecting your own data or using an existing dataset.
  2. Use or create at least one machine learning model, which could include using a pre-trained generative model.
  3. Incorporate quantitative evaluation of a relevant metric, e.g., measuring accuracy of classifier.

Within those very broad requirements there any many potential projects. Some ideas are listed at the end of the assignment, but you are encouraged to explore whatever topics are of interest to you. We are happy and eager to help you brainstorm ideas!

You are not expected to tackle a “big” dataset or achieve state-of-the-art (or even “positive”) results. We have limited time and computational resources. A successful project satisfies the above requirements and demonstrates your understanding of the course material. It does not need to be “groundbreaking”, novel (in the research sense of the word), or even “work” at its intended goal. Replicating previous work with different data or a slightly different approach is OK. Clearly documenting your results and why you hypothesize the model didn’t work as intended is a successful project. Your project does need to be independent work, i.e., simply rehashing an existing tutorial or blog post is not sufficient.

Deliverables:

Project proposal
A one paragraph description of the goals of your project and your plan to achieve those goals. This paragraph should include the data you plan to use, what model(s) you will create or use, and how you will quantitatively evaluate your results. This proposal is not a contract, you are permitted (and expected) to deviate from this proposal as your project evolves.
Project diary
We expect you to work approximately TBD hours on this project. For each hour you spend working, write exactly two sentences: one describing what you set out to do in that hour and a second describing what you accomplished in that hour. These are individual not team progress reports, every student will submit their own. The purpose of these reports are to encourage you to make steady and substantive progress and practice breaking your project down into small, manageable tasks. That doesn’t mean your effort has to be evenly distributed (I recognize you have other responsibilities), and you may have extended debugging periods where it doesn’t feel like you are making any progress. Both are OK, and the latter is normal!
Project code
Either submit a link to a publicly viewable Github repository, or zip up your code and submit it to the relevant assignment. Your code should include a README file with instructions on how to run your code and reproduce your results. Other artifacts, such as data/model cards, would be also be included here.
Poster

There is no formal final report, instead you will prepare and present a poster at a class-wide electronic poster session. Check out the poster template for additional instructions about size, fonts, etc. You will likely want to use Google Slides/PowerPoint or Illustrator to make the poster. Whatever software you use, please submit a PDF of your poster to the relevant assignment.

Your poster should describe your project and results in a story that traces from the upper left to lower right. The necessary components are:

  1. Goal: In just a sentence or two describe the goal of your project
  2. Background: In a few paragraphs (and a figure if relevant), provide the necessary background information so a classmate could understand your project
  3. Data: Describe the data you used, including how you collected/generated it (if relevant)
  4. Model/Analysis: What did you do with your data? Describe your model architecture or analysis approach/methodology.
  5. Results: Present your results with relevant figures and quantitative metrics. This section should include a synthesis of results, i.e., some discussion of what you could conclude from those results, not just the results alone.
  6. Responsible computing: Consider, and if relevant, address any ethical, societal, or environmental implications of your project.

Your poster only has so much space, so you will need to be concise and make deliberate choices about what is most important to include. Use figures and bullet points where possible (a poster is not a report and so does not need to be written in prose). In many cases, what appears on the poster will be a distillation of more detailed explanations found in your notebooks, data/model cards, etc.

Some additional notes about your poster:

  • Make sure your poster clearly communicates your results. As a practical matter, multiple groups will be presenting at one time during the poster * session, thus we can’t meaningfully take in your quantitative results orally, in the moment. Instead we will review your poster and other materials in depth afterwards.
  • Aim for a generic scientific audience, (i.e., don’t reference our class) with a similar level of background knowledge as you and your classmates. Recall that while your audience is familiar with the topics from class (e.g., you don’t need to explain what a neural network), they likely don’t have any specific knowledge about your project topic (i.e., the specific algorithm or data you used).
  • Watch out for and eliminate weasel words (like “interestingly” and other beholder words), that sound quantitative without actually conveying information. Concision is key. Wherever relevant, aim for a “just the facts” style.
  • Use inline citations, i.e., numbers indexing into a references list in the corner of your poster.

For inspiration check out the posters in 75SHS.

Evaluation

Your project will be evaluated based on the following attributes using the “EMRN” rubric (Exemplary, Meets expectations, Revision needed, Not assessable). An exemplary project will have the following attributes:

  • Methodology: Approach is methodologically sound and sufficiently comprehensive to solve the intended problem (within resource constraints). Any software artifacts, such as notebooks, are high-quality with appropriate use of libraries, clear and mechanistically solid text, effective visualizations, and appropriate citations.
  • Results: Obtained specific results relevant to the intended problem. Derived thoughtful, insightful and carefully qualified conclusions from those results.
  • Poster: Clearly and effectively visually presents problem, data, methods, results and implications for responsible computing. Poster is high quality with effective figures, clear and mechanistically solid text and appropriate citations.
  • Responsible computing: Thoughtfully considered and, where possible, acted to mitigate ethical, societal, and/or environmental implications of the project. Any data/model cards are high quality.

The evaluation will be weighted in part by difficulty. When choosing a project you should aim to balance ambition with the likelihood of successfully achieving your goals. Perfect execution of a very limited project would not meet expectations, while imperfect execution of a more ambitious project could meet expectations or even be considered exemplary. But an impossible problem will be impossible to execute. Meeting your proposed goals is not a requirement - we can’t always predict the obstacles we will face - but you are expected to make an appropriate effort to achieve realistic goals within the constraints of the course.

You will receive feedback on your initial project submission, and have the opportunity to revise and resubmit before the final deadline.

Project ideas

Here are some potential project ideas to get you started. These are just suggestions, we encourage you to pursue whatever topics interest you! Please come talk to us about your ides, we are eager to help you brainstorm and appropriately scope your project.

  • Identify interesting or relevant Middlebury data sources such as campus environmental data, the course catalog, IRS filings, etc. The specific analysis would depend on the data, but could include building predictive models of quantitative data, embedding-based analysis or search tools for text sources, etc. For example, predict future energy demand, build a course recommendation system, compare Middlebury IRS filings to peer institutions, etc.
  • Create and quantitatively optimize prompts for LLMs to automate or augment an existing “manual” workflow, e.g., interpreting/classifying historical documents, performing content moderation, making auditing decisions, etc.
  • Train simple generative models on small datasets of interest, e.g., plays, then quantitatively evaluate the impact of different modeling choices (e.g., tokenization, architectures) performance.
  • Generate synthetic data to protect privacy. Train a model to create synthetic data in a domain of interest (e.g., images, text, tabular data), then evaluate how predictive models trained on synthetic data perform on real data.
  • Identify how problems from other domains, e.g., genomics, can be mapped to use “off-the-shelf” text or image analysis techniques. Quantitatively evaluate how well these techniques perform compared to domain-specific approaches.



© Michael Linderman and Phil Chodrow, 2026