Date Lecture Readings Logistics
Thu 08/31/23 Lecture #1:
  • Course introduction
  • Logistics
  • Transformers - high level overview
[ slides ]
Main readings:
  • Attention is all you need (2017) [link]

Tues 09/05/23 Lecture #2:
  • Optimization, backpropagation, and training
[ slides ]
Main readings:
  • Deep Feedforward Networks. Ian Goodfellow, Yoshua Bengio, & Aaron Courville (2016). Deep Learning, Chapter 6.5. [link]
  • An overview of gradient descent optimization algorithms [link]
  • A Gentle Introduction to Torch Autograd [link]
  • Autograd Mechanics [link]

Thu 09/07/23 Lecture #3:
  • Word embeddings
  • Tokenization
[ slides ]
Main readings:
  • Distributed Representations of Words and Phrases and their Compositionality (2013) [link]
  • GloVe- Global Vectors for Word Representation (2014) [link]
  • BPE - Neural Machine Translation of Rare Words with Subword Units (2016) [link]
  • SentencePiece- A simple and language independent subword tokenizer and detokenizer for Neural Text Processing (2018) [link]

Tue 09/12/23 Lecture #4:
  • Transformers
  • Implementation details
[ slides ]
[ notebook transformer.ipynb ]
Main readings:
  • Attention is all you need (2017) [link]
  • The Annotated Transformer [link]

Thu 09/14/23 Lecture #5:
  • Positional Information - Absolute - Relative - ROPE - ALiBi
  • Multi-Query Attention
  • Grouped Multi-Query Attention
  • Inference
  • KV caching
  • Encoder/decoder-only vs encoder-decoder
[ slides ]
Main readings:
  • Train Short, Test Long- Attention with Linear Biases Enables Input Length Extrapolation [link]
  • RoFormer- Enhanced Transformer with Rotary Position Embedding [link]
  • Fast Transformer Decoding- One Write-Head is All You Need [link]

Tips on choosing a project [slides]

HW1 out [link]

Th 09/19/23 Lecture #6:
  • Transfer Learning
[ slides ]
Main readings:
  • ELMo, Deep Contextualized Representations [link]
  • ULMFit, Universal Language Model Fine-tuning for Text Classification [link]
  • BERT, Pre-training of Deep Bidirectional Transformers for Language Understanding [link]
  • ELECTRA, Pre-training Text Encoders as Discriminators Rather Than Generators [link]

Project teams due. [Team submission form]

Th 09/21/23 Lecture #7:
  • Model architecture and training objectives - Encoder-decoder, decoder-only - UL2 / FIM
[ slides ]
Main readings:
  • Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer [link]
  • UL2- Unifying Language Learning Paradigms (2022) [link]
  • What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? (2022) [link]
  • BART, Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension [link]

Tue 09/26/23 Lecture #8:
  • Scale
  • Compute analysis in transformers
[ slides ]
Main readings:
  • Scaling Laws for Neural Language Models [link]
  • Training Compute-Optimal Large Language Models [link]

Thu 09/28/23 Lecture #9:
  • Scaling laws and GPT-3
  • Few-shot Learning
  • Prompting
  • In-context learning
[ slides ]
Main readings:
  • Language Models are Few-shot Learners (2020) [link]
  • Rethinking the Role of Demonstrations- What Makes In-Context Learning Work? (2022) [link]
  • Data Distributional Properties Drive Emergent In-Context Learning in Transformers (2022) [link]

Tue 10/03/23 Lecture #10:
  • Prompting
  • Emergence
  • Reasoning
  • Instruction tuning
[ slides ]
Main readings:
  • Chain of Thought Prompting Elicits Reasoning in Large Language Models [link]
  • Tree of Thoughts- Deliberate Problem Solving with Large Language Models [link]
  • The curious case of neural text degeneration (2020) [link]
  • Training language models to follow instructions with human feedback (2022) [link]
  • Multitask Prompted Training Enables Zero-Shot Task Generalization (2021) [link]
  • Finetuned Language Models Are Zero-Shot Learners (2021) [link]
  • Scaling Instruction-Finetuned Language Models (2022) [link]
  • Tree of Thoughts- Deliberate Problem Solving with Large Language Models (2023) [link]

Thu 10/05/23 Lecture #11:
  • Adaptation
  • Reinforcement Learning for language models fine-tuning
[ slides ]
Main readings:
  • Fine-Tuning Language Models from Human Preferences (2019) [link]
  • Learning to Summarize with Human Feedback (2020) [link]
  • InstructGPT - Fine-Tuning Language Models from Human Instructions (2021) [link]
  • Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (2022) [link]
  • Direct Preference Optimization- Your Language Model is Secretly a Reward Model (2023) [link]
Optional readings:
  • Parameter-Efficient Transfer Learning for NLP (2019) [link]
  • Prefix-Tuning- Optimizing Continuous Prompts for Generation (2021) [link]

HW 1 due (10/8)

Tues 10/10/23 Lecture #12:
  • Challenges and Opportunities of Building Open LLMs
Guest lecturer:
Iz Beltagy, Allen Institute for AI
Photo of Iz Beltagy
[ slides ]
Main readings:
  • What Language Model to Train if You Have One Million GPU Hours (2022) [link]
  • Dolma- Trillion Token Open Corpus for Language Model Pretraining (2023) [link]
  • Llama 2- Open Foundation and Fine-Tuned Chat Models (2023) [link]
  • Pythia- A Suite for Analyzing Large Language Models Across Training and Scaling (2023) [link]
  • BLOOM- A 176B-Parameter Open-Access Multilingual Language Model (2022) [link]
  • Scaling Language Models- Methods, Analysis & Insights from Training Gopher (2022) [link]

Thu 10/12/23 Lecture #13:
  • Parameter efficient fine tuning
[ slides ]
Main readings:
  • Parameter-Efficient Transfer Learning for NLP (2019) [link]
  • Prefix-Tuning- Optimizing Continuous Prompts for Generation (2021) [link]
  • Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning (2022) [link]
  • LoRA- Low-Rank Adaptation of Large Language Models [link]
  • Scaling Down to Scale Up- A Guide to Parameter-Efficient Fine-Tuning [link]
  • Efficient Transformers- A Survey [link]

Tue 10/17/23 Lecture #14:
  • Guest lecture by Tushar Khot
    Reasoning with (De)Composition
Guest lecturer:
Tushar Khot, Allen Institute for AI
Photo of Tushar Khot
[ slides ]
Main readings:
  • Hey AI, Can You Solve Complex Tasks by Talking to Agents? [link]
  • Toolformer - A Tool-Augmented Language Model (2022) [link]
  • Improving Language Model Negotiation with Self-Play and In-Context Learning from AI Feedback (2023) [link]
  • ReAct- Synergizing Reasoning and Acting in Language Models [link]

10/20 Project proposal due

10/17/23 - 10/23/23 October recess - No classes

Tue 10/24/23 Lecture #15:
  • Modular deep learning
  • Mixture of experts
[ slides ]
Main readings:
  • Modular Deep Learning (2022) [link]
  • A Review of Sparse Expert Models in Deep Learning (2022) [link]
  • Outrageously Large Neural Networks- The Sparsely-Gated Mixture-of-Experts Layer (2017) [link]
  • Switch Transformers- Scaling to Trillion Parameter Models (2021) [link]

Thu 10/26/23 Midterm

Tue 10/31/23 Lecture #16:
  • Retrieval augmented language models
Guest lecturer:
Sewon Min, University of Washington
Photo of Sewon Min
[ slides ]
Main readings:
  • Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (2020) [link]
  • Improving language models by retrieving from trillions of tokens (2021) [link]
  • REPLUG, Retrieval-Augmented Black-Box Language Models [link]

Tue 11/02/23 Lecture #17:
  • Build an Ecosystem, Not a Monolith
Guest lecturer:
Colin Raffel, University of Toronto
Photo of Colin Raffel
[ slides ]
Main readings:
  • Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning (2022) [link]
  • Exploring and Predicting Transferability across NLP Tasks (2020) [link]
  • Editing Models with Task Arithmetic (2022) [link]

Tue 11/07/23 Lecture #18:
  • Modeling long sequences
  • Hierarchical and graph-based methods
  • Recurrence and memory
[ slides ]
Main readings:
  • Higher-order Coreference Resolution with Coarse-to-fine Inference (2018) [link]
  • Entity, Relation, and Event Extraction with Contextualized Span Representations (2020) [link]
  • Memorizing transfomers (2022) [link]
  • Hierarchical Graph Network for Multi-hop Question Answering [link]
  • Compressive Transformers for Long-Range Sequence Modelling (2020) [link]
  • Efficient transformers - A survey (2022) [link]

HW 2 due

Thu 11/09/23 Lecture #19:
  • Modeling long sequences
  • Sparse attention patterns
  • Approximating attention
  • Hardware aware efficiency
[ slides ]
Main readings:
  • Longformer- The Long-Document Transformer (2020) [link]
  • BigBird - Transformers for Longer Sequences (2020) [link]
  • Performer - Rethinking Attention with Performers (2021) [link]
  • Reformer - The Efficient Transformer (2020) [link]
  • Long T5 - Efficient Text-To-Text Transformer for Long Sequences (2022) [link]

Tue 11/14/23 Lecture #20:
  • Training approaches for long sequences
  • Hardware aware efficiency
  • Societal considerations and impacts of foundation models
[ slides ]
Main readings:
  • FlashAttention - Fast and Memory-Efficient Exact Attention with IO-Awareness (2022) [link]
  • PRIMERA- Pyramid-based Masked Sentence Pre-training for Multi-document Summarization (2022) [link]
  • Peek Across- Improving Multi-Document Modeling via Cross-Document Question-Answering (2023) [link]
  • What's in my big data? (2023) [link]
  • Red Teaming Language Models with Language Models (2022) [link]
  • Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (2022) [link]

Tue 11/16/23 Lecture #21:
  • Vision transformers
  • Diffusion models
[ slides ]
Main readings:
  • An Image is Worth 16x16 Words- Transformers for Image Recognition at Scale (2020) [link]
  • Training data-efficient image transformers & distillation through attention (2021) [link]
  • Denoising Diffusion Probabilistic Models (2020) [link]

11/17/23 - 11/26/23 Thanksgiving recess - No classes

Tue 11/28/23 Lecture #22:
  • Final project presentations -- session 1

Thu 11/30/23 Lecture #23:
  • Towards Large Foundation Vision Models
Guest lecturer:
Neil Houlsby, Google Deepmind
Photo of Neil Houlsby
[ slides ]
Main readings:
  • Scaling Vision Transformers to 22 Billion Parameters (2023) [link]
  • From Sparse to Soft Mixtures of Experts (2023) [link]
  • Scaling Vision Transformers (2021) [link]
Optional readings:
  • PaLI-X- On Scaling up a Multilingual Vision and Language Model [link]
  • PaLI- A Jointly-Scaled Multilingual Language-Image Model [link]

Friday 12/1/23 Lecture #24:
  • Final project presentations -- session 2

12/9 HW 3 due

Tue 12/05/23 Lecture #25:
  • Foundation Models for Code and Math
Guest lecturer:
Ansong Ni, Yale University
Photo of Ansong Ni
[ slides ]
Main readings:
  • Evaluating Large Language Models Trained on Code (2021) [link]
  • Solving Quantitative Reasoning Problems with Language Models (2022) [link]
  • StarCoder- May the source be with you! (2023) [link]
Optional readings:
  • Program Synthesis with Large Language Models (2021) [link]
  • Show Your Work- Scratchpads for Intermediate Computation with Language Models (2021) [link]

Thu 12/07/23 Lecture #26:
  • Moved to 12/1 (see above)

12/18 Final project report due