Text Analysis Pedagogy Institute
In partnership with:
- University of Virginia, School of Data Science
- University of Virginia, Library
- University of Arizona, University Libraries
Funding for 2021-2022 provided by:
The National Endowment for the Humanities
Funding for 2023-2024 provided by:
ITHAKA
The Text Analysis Pedagogy (TAP) Institute offered free training for text analysis teachers from 2021-2024. During this time, thousands of educators from around the globe participated in classes, panels, and other events. All of the teaching and learning materials for the TAP Institute have been released as open educational resources with the CC-BY license.
2024
Python Basics (5 notebooks)
Author: Nathan Kelber
If you've never programmed before, this course is a great introduction. Taught from a humanist perspective, this course will help you start writing your first code and unlock the potential of text analysis.
Automated Text Classification Using LLMs (3 notebooks)
Author: Erik Fredner
Learn the basics of using a large language model—specifically ChatGPT—for text classification. Using the ChatGPT application programming interface (API), explore how LLMs can assist humanists with various text classification tasks, including binary classification, labeling, applying confidence intervals to judgments, and more.
Large Language Models and Embeddings for Retrieval Augmented Generation (3 notebooks)
Author: Grant Glass
*Explore large language models and embeddings, with a specific focus on their application in retrieval augmented generation (RAG) solutions. RAG is a groundbreaking approach that combines the strengths of pre-trained language models with external knowledge retrieval, enabling the generation of highly informative and contextually relevant text. *
spaCy in the World of LLMs (3 notebooks)
Author: William Mattingly
Past TAP Institute courses have focused on spaCy. Over the last year, however, large language models (LLMs) have radically changed the natural language processing (NLP) landscape. This course teaches students the basics of spaCy within this new world.
Introduction to Vector Databases and Semantic Searching (3 notebooks)
Author: William Mattingly
Learn the fundamentals of vector databases: what are they, how do they work, and how to build them. Build local vector databases and learn to create cloud-based ones. This includes the basics of semantic searching, its advantages and disadvantages, and ways to perform semantic search on local and cloud-based vector databases.
Introduction to Retrieval Augmented Generation (3 notebooks)
Author: William Mattingly
LLMs have two major limitations: Their training data is fixed and they have the potential to hallucinate. In this course you will learn how to address these issues with retrieval augmented generation(RAG). RAG systems combine vector databases with large language models. They let the user pose a question which then retrieves the relevant material from a vector database. These two things are combined and given to a large language model to frame a domain-specific and up-to-date response. Learn these concepts and build your own RAG applications, specifically with Verba.
Small Language Models (3 notebooks)
Author: J.D. Porter
Demystify some of the ideas behind LLMs by building a basic model. The models will not have the generative power of OpenAI, Llama, or Gemini, but you will be able to make comprehensible models yourself using only a laptop and a small corpus of texts, tweak them, and have fun playing with parameters.
Hugging Face Transformers (3 notebooks)
Author: Nathan Kelber
Find and use datasets, models, and more with the Hugging Face Transformers library. This introduction will help you start applying open language models in your research projects.
2023
Python Basics (5 notebooks)
Author: Nathan Kelber
If you've never programmed before, this course is a great introduction. Taught from a humanist perspective, this course will help you start writing your first code and unlock the potential of text analysis.
Pandas Basics (3 notebooks)
Author: Zhuo Chen
Pandas is a Python library designed for working with tabular data. In this three-day course, you will learn the two fundamental objects in Pandas: Pandas Series and DataFrame. You will also learn how to index and select data from a DataFrame, how to slice and filter a DataFrame, and how to work with missing values in a DataFrame.
Pandas Intermediate (3 notebooks)
Author: Zhuo Chen
Pandas is a Python library designed for working with tabular data. After completing Pandas Basics, take this course to learn how to use the summary functions and maps in Pandas, how to group and sort data, how to work with time series data, and how to create simple plots in Pandas.
SpaCy (9 notebooks)
Author: William Mattingly
SpaCy is one of the more widely used libraries for performing natural language processing (NLP) in Python. It allows for you to leverage pre-designed pipelines for processing texts or create custom pipelines tailored to your own use case. This course will introduce you to all the basics of spaCy.
Finding Word Meaning Through Context (3 notebooks)
Author: J. D. Porter
J.R. Firth once wrote “You shall know the meaning of a word by the company it keeps.” In this workshop, we learn how to find the company that words keep. We will introduce basic methods for working with text files in Python, as well as some common statistical measures for determining which words are characteristic of which texts. Then we will learn how to find “collocates”, or the words that appear near any given term in a text.
Web-Scraping Toolkit (3 notebooks)
Author: Elizabeth Wickes
The data we need is often stored on web pages or behind APIs and getting access to it involves tools beyond just core Python. This course provides an overview of common strategies, tools, and skills needed for a variety of web scraping projects. Whether you are using non-coding tools like Google Sheets or more advanced Python modules, this course will jump-start your skills to start on a web scraping project.
Tools for taming text: RegEx and XPath (3 notebooks)
Author: Elizabeth Wickes
XPath and Regular Expressions are two powerful techniques that can aid data extraction from text, websites, XML documents, and more. Regular Expressions are ideal for matching patterns within free text while XPath is an expression language for extracting content from HTML and XML documents. This course aims to introduce learners to the appropriate context for using each tool, foundational to intermediate syntax for each, and finally how to appropriately use them together within Python. Learners will also be provided some take home challenges to practice their skills after the course.
Teaching Data Literacy (3 notebooks)
Author: Nathan Kelber
An introduction to teaching data literacy for librarians, faculty, and staff. This course will help you teach your first class using Constellate, including creating your own repository in GitHub to track your notebooks. You will be able to use any TAP Institute lessons and modify them for your own use.
2022
Python Basics (5 notebooks)
Author: Nathan Kelber
If you've never programmed before, this course is a great introduction. Taught from a humanist perspective, this course will help you start writing your first code and unlock the potential of text analysis.
Introduction to Pandas (3 notebooks)
Author: William Mattingly
This course introduces students to working with tabular data in Python through the Pandas library. On Day 1, you will learn how to install and import Pandas; you will also learn about some of its basic features, such as the DataFrame. Day 2 will focus on finding, organizing, and sorting data. Day 3 will focus on advanced searching methods, such as filtering, querying, grouping, and GroupBy. A few additional lessons will be provided on plotting data in Pandas.
Introduction to Natural Language Processing with spaCy (3 notebooks)
Author: William Mattingly
This course will introduce the key concepts of natural language processing (NLP) and an NLP Python library, spaCy. SpaCy allows users to cultivate robust pipelines for text analysis. In Day 1 we will learn about NLP concepts and how to install and use the spaCy library generally. On Day 2, we will learn how to use spaCy to identify linguistic features within a document. On Day 3, we will learn about how to apply those features to solve real-world problems for information extraction.
Working with Twitter Data (3 notebooks)
Author: Melanie Walsh
This course will prepare students to collect, analyze, and visualize Twitter data. Students will learn how to work with the Twitter API and with the Python library twarc, one of the most popular tools for Twitter data. We will also introduce basic text analysis methods that are appropriate for short documents like tweets. Participants who are eligible for the Academic Research Track of the Twitter API will have the opportunity to work with the entire historical archive of tweets (2006-2022).
A Practical Guide to Text Data Curation (3 notebooks)
Author: Xanda Schofield
No matter how exciting your research question is or how fancy your models are, all text analysis projects depend on having text data that is tidy enough to analyze. This course surveys some practices of text data curation to filter out irrelevant text, refine a corpus vocabulary, and identify text artifacts in real world text collections. We will explore how to approach these tasks using Python libraries such as NLTK and spaCy, as well as explore how some text models, like LDA topic models, can actually serve as a tool for diagnosing recurring corpus issues.
Introduction to Multilingual Named Entity Recognition (3 notebooks)
Author: William Mattingly
This course will introduce students to named entity recognition with emphasis placed on multilingual documents. In Day 1, we will address some of the common issues one faces in handling multilingual documents, such as inconsistent text encoding and text standardization, and some of the current state-of-the-art transformer-based language models. We will also meet some of the key features of spaCy’s NER pipelines. On Day 2, we will jump into rules-based NER with spaCy. On Day 3, we will explore machine learning (ML) based NER in spaCy. Here, we will learn the essentials of creating good datasets for training NER models.
Machine Learning for Humanists (3 notebooks)
Author: Grant Glass
This course will introduce students to the variety of machine learning (ML) algorithms available for textual analysis. Throughout the three days of the course, we will address how ML can be used to solve text-based problems. Day 1 will focus on the basics of ML and students will use supervised learning to work through a research question. Day 2 will be dedicated to a common ML technique: Topic Modeling. Day 3 will focus on more advanced techniques such as using language models to classify text. Everyday students will be provided a workflow for using these techniques on their own research questions.
2021
Python Basics (5 notebooks)
Author: Nathan Kelber
If you've never programmed before, this course is a great introduction. Taught from a humanist perspective, this course will help you start writing your first code and unlock the potential of text analysis.
A Gentle Introduction to Optical Character Recognition (3 notebooks)
Python Basics (3 notebooks)
Author: Hannah Jacobs
This course will introduce the concept of “Optical Character Recognition” (OCR), various tools available for performing OCR, and important considerations for successfully OCRing digitized text. Using Tesseract in Python, we’ll walk through the entire process using a variety of examples to show the range of challenges scholars can face when performing OCR. By the end of the course, participants should be able to use the course’s Jupyter Notebooks to perform OCR on their own; they should be able to identify possible technical challenges presented by specific texts and propose potential solutions; and they should be able to assess the degree of accuracy they have achieved in performing OCR.
Data Analysis with Pandas (3 notebooks)
Author: Melanie Walsh
This workshop will introduce students to a popular Python package known as Pandas, a tool for data analysis and manipulation that is widely-used among data scientists. Participants will learn how to work with CSV files and JSON files, how to filter and aggregate data, how to make bar charts and time series plots, how to merge datasets with common values, and more. All case studies and examples will feature data relevant to the humanities, such as (potentially) library circulation data, screenplay data, and social media data.
Introduction to Machine Learning (3 notebooks)
Author: Grant Glass
This course will introduce you to many techniques available to analyze textual data with different Machine Learning techinques in Python. You will be introduced to the theory and method of Machine Learning and given some practical skills on how to write and execute machine learning code in Python. Some basic experience with Python will be required for participation in the class coding projects, but feel free to join us if you want to have a better understanding of what Machine Learning techniques can do for humanists. Generally speaking, this class will help you think about humanities problems through the lens of Machine Learning.
Named Entity Recognition (3 notebooks)
Author: Zoe LeBlanc
This course will introduce participants to one of the core areas of natural language processing - named entity recognition. While annotating datasets with set standards is one of the oldest areas of DH research (particularly with the Text Encoding Initiative), this course will focus on some of the newer approaches for identifying and annotating objects of interest in any given text. The course will focus on using the Python library Spacy with both it's built-in functionality, and also learning how to expand upon it for more specific uses. While this course is taught in English, participants are encouraged to bring sources in multiple languages. Ultimately, participants will learn both how to leverage NER in their research and how to tailor NER to their specific textual sources.
Text Analysis with Ancient/Medieval Languages (3 notebooks)
Author: William Mattingly
This workshop will introduce students to natural language processing (NLP) and text analysis in ancient and medieval languages. We will use Latin as a case study. Day 1 will focus on the basics of NLP and spaCy, one of the leading NLP libraries for Python. Day 2 will address the textual problems of working with ancient/medieval languages, including how to handle highly-inflected languages; lemmatization without a lemmatizer; and accounting for textual, geographical, and temporal variances of the language. Day 3 will address a single text analysis problem: named entity recognition (NER) in Latin. On this final day, we will develop a workflow for solving this problem. Students will leave this workshop with a strong understanding of NLP and NER. They will also have an understanding of how to solve text analysis problems in highly-inflected or dead languages. Students will be provided with the resources for further learning. Finally, students will leave the workshop with a working NER model that they can use and improve in the future.
Visualizing Humanities Data (3 notebooks)
Author: Zoe LeBlanc
This course will introduce participants to some of the foundations and horizons of visualizing humanities data. To help us generate datasets we will lightly explore some text analysis methods, and then focus on some of the possibilities and pitfalls of visualizing data derived from these methods. In particular, this course will introduce participants to the principles of the grammar of graphics and exploratory data analysis through using the Python library Altair and Jupyter Notebooks. The goal of this course is to help participants learn how to incorporate visualizing humanities data into their research workflows, for both sharing aggregated information and making arguments.
Introduction to Machine Learning (3 notebooks)
Author: William Mattingly
This workshop will introduce students to machine learning (ML), from its early beginnings to its modern applications; students will also be introduced to a branch of ML known as deep learning. We will specifically address how ML can be used to solve text-based problems. Day 1 will focus on the basics of ML, the key concepts and terms that practitioners must know. Day 2 will be dedicated to a common ML problem: text classification. Day 3 will focus on an adjacent problem: topic modeling. On both days, students will be provided a worfklow for solving these problems. Students will leave this workshop with a firm understanding of ML conceptually and a basic understanding of how to engage in ML via Python. Finally, students will be provided with the resources for further learning.
How to do Things with Topic Models (4 notebooks)
Author: Rafael Alvarado
This workshop will introduce students to the concept of topic models and how they have been used to advance humanistic research. Topics to be covered include topic models as a general task in text analytics, creating topic models from scratch using Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF), visualizing their results, evaluating their performance, and interpreting their results. In addition, students will be exposed to examples of how topic models have been used in humanistic and social science research. Work will be conducted using Python 3 and Jupyter Notebooks.