CPSC 488/588 | Schedule

Date	Lecture	Readings	Logistics
Thu 08/31/23	Lecture #1: Course introduction Logistics Transformers - high level overview [ slides ]	Main readings: Attention is all you need (2017) [link]
Tues 09/05/23	Lecture #2: Optimization, backpropagation, and training [ slides ]	Main readings: Deep Feedforward Networks. Ian Goodfellow, Yoshua Bengio, & Aaron Courville (2016). Deep Learning, Chapter 6.5. [link] An overview of gradient descent optimization algorithms [link] A Gentle Introduction to Torch Autograd [link] Autograd Mechanics [link]
Thu 09/07/23	Lecture #3: Word embeddings Tokenization [ slides ]	Main readings: Distributed Representations of Words and Phrases and their Compositionality (2013) [link] GloVe- Global Vectors for Word Representation (2014) [link] BPE - Neural Machine Translation of Rare Words with Subword Units (2016) [link] SentencePiece- A simple and language independent subword tokenizer and detokenizer for Neural Text Processing (2018) [link]
Tue 09/12/23	Lecture #4: Transformers Implementation details [ slides ] [ notebook transformer.ipynb ]	Main readings: Attention is all you need (2017) [link] The Annotated Transformer [link]
Thu 09/14/23	Lecture #5: Positional Information - Absolute - Relative - ROPE - ALiBi Multi-Query Attention Grouped Multi-Query Attention Inference KV caching Encoder/decoder-only vs encoder-decoder [ slides ]	Main readings: Train Short, Test Long- Attention with Linear Biases Enables Input Length Extrapolation [link] RoFormer- Enhanced Transformer with Rotary Position Embedding [link] Fast Transformer Decoding- One Write-Head is All You Need [link]	Tips on choosing a project [slides] HW1 out [link]
Th 09/19/23	Lecture #6: Transfer Learning [ slides ]	Main readings: ELMo, Deep Contextualized Representations [link] ULMFit, Universal Language Model Fine-tuning for Text Classification [link] BERT, Pre-training of Deep Bidirectional Transformers for Language Understanding [link] ELECTRA, Pre-training Text Encoders as Discriminators Rather Than Generators [link]	Project teams due. [Team submission form]
Th 09/21/23	Lecture #7: Model architecture and training objectives - Encoder-decoder, decoder-only - UL2 / FIM [ slides ]	Main readings: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer [link] UL2- Unifying Language Learning Paradigms (2022) [link] What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? (2022) [link] BART, Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension [link]
Tue 09/26/23	Lecture #8: Scale Compute analysis in transformers [ slides ]	Main readings: Scaling Laws for Neural Language Models [link] Training Compute-Optimal Large Language Models [link]
Thu 09/28/23	Lecture #9: Scaling laws and GPT-3 Few-shot Learning Prompting In-context learning [ slides ]	Main readings: Language Models are Few-shot Learners (2020) [link] Rethinking the Role of Demonstrations- What Makes In-Context Learning Work? (2022) [link] Data Distributional Properties Drive Emergent In-Context Learning in Transformers (2022) [link]
Tue 10/03/23	Lecture #10: Prompting Emergence Reasoning Instruction tuning [ slides ]	Main readings: Chain of Thought Prompting Elicits Reasoning in Large Language Models [link] Tree of Thoughts- Deliberate Problem Solving with Large Language Models [link] The curious case of neural text degeneration (2020) [link] Training language models to follow instructions with human feedback (2022) [link] Multitask Prompted Training Enables Zero-Shot Task Generalization (2021) [link] Finetuned Language Models Are Zero-Shot Learners (2021) [link] Scaling Instruction-Finetuned Language Models (2022) [link] Tree of Thoughts- Deliberate Problem Solving with Large Language Models (2023) [link]
Thu 10/05/23	Lecture #11: Adaptation Reinforcement Learning for language models fine-tuning [ slides ]	Main readings: Fine-Tuning Language Models from Human Preferences (2019) [link] Learning to Summarize with Human Feedback (2020) [link] InstructGPT - Fine-Tuning Language Models from Human Instructions (2021) [link] Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (2022) [link] Direct Preference Optimization- Your Language Model is Secretly a Reward Model (2023) [link] Optional readings: Parameter-Efficient Transfer Learning for NLP (2019) [link] Prefix-Tuning- Optimizing Continuous Prompts for Generation (2021) [link]	HW 1 due (10/8)
Tues 10/10/23	Lecture #12: Challenges and Opportunities of Building Open LLMs Guest lecturer: Iz Beltagy, Allen Institute for AI [ slides ]	Main readings: What Language Model to Train if You Have One Million GPU Hours (2022) [link] Dolma- Trillion Token Open Corpus for Language Model Pretraining (2023) [link] Llama 2- Open Foundation and Fine-Tuned Chat Models (2023) [link] Pythia- A Suite for Analyzing Large Language Models Across Training and Scaling (2023) [link] BLOOM- A 176B-Parameter Open-Access Multilingual Language Model (2022) [link] Scaling Language Models- Methods, Analysis & Insights from Training Gopher (2022) [link]
Thu 10/12/23	Lecture #13: Parameter efficient fine tuning [ slides ]	Main readings: Parameter-Efficient Transfer Learning for NLP (2019) [link] Prefix-Tuning- Optimizing Continuous Prompts for Generation (2021) [link] Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning (2022) [link] LoRA- Low-Rank Adaptation of Large Language Models [link] Scaling Down to Scale Up- A Guide to Parameter-Efficient Fine-Tuning [link] Efficient Transformers- A Survey [link]
Tue 10/17/23	Lecture #14: Guest lecture by Tushar Khot Reasoning with (De)Composition Guest lecturer: Tushar Khot, Allen Institute for AI [ slides ]	Main readings: Hey AI, Can You Solve Complex Tasks by Talking to Agents? [link] Toolformer - A Tool-Augmented Language Model (2022) [link] Improving Language Model Negotiation with Self-Play and In-Context Learning from AI Feedback (2023) [link] ReAct- Synergizing Reasoning and Acting in Language Models [link]	10/20 Project proposal due
10/17/23 - 10/23/23	October recess - No classes
Tue 10/24/23	Lecture #15: Modular deep learning Mixture of experts [ slides ]	Main readings: Modular Deep Learning (2022) [link] A Review of Sparse Expert Models in Deep Learning (2022) [link] Outrageously Large Neural Networks- The Sparsely-Gated Mixture-of-Experts Layer (2017) [link] Switch Transformers- Scaling to Trillion Parameter Models (2021) [link]
Thu 10/26/23	Midterm
Tue 10/31/23	Lecture #16: Retrieval augmented language models Guest lecturer: Sewon Min, University of Washington [ slides ]	Main readings: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (2020) [link] Improving language models by retrieving from trillions of tokens (2021) [link] REPLUG, Retrieval-Augmented Black-Box Language Models [link]
Tue 11/02/23	Lecture #17: Build an Ecosystem, Not a Monolith Guest lecturer: Colin Raffel, University of Toronto [ slides ]	Main readings: Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning (2022) [link] Exploring and Predicting Transferability across NLP Tasks (2020) [link] Editing Models with Task Arithmetic (2022) [link]
Tue 11/07/23	Lecture #18: Modeling long sequences Hierarchical and graph-based methods Recurrence and memory [ slides ]	Main readings: Higher-order Coreference Resolution with Coarse-to-fine Inference (2018) [link] Entity, Relation, and Event Extraction with Contextualized Span Representations (2020) [link] Memorizing transfomers (2022) [link] Hierarchical Graph Network for Multi-hop Question Answering [link] Compressive Transformers for Long-Range Sequence Modelling (2020) [link] Efficient transformers - A survey (2022) [link]	HW 2 due
Thu 11/09/23	Lecture #19: Modeling long sequences Sparse attention patterns Approximating attention Hardware aware efficiency [ slides ]	Main readings: Longformer- The Long-Document Transformer (2020) [link] BigBird - Transformers for Longer Sequences (2020) [link] Performer - Rethinking Attention with Performers (2021) [link] Reformer - The Efficient Transformer (2020) [link] Long T5 - Efficient Text-To-Text Transformer for Long Sequences (2022) [link]
Tue 11/14/23	Lecture #20: Training approaches for long sequences Hardware aware efficiency Societal considerations and impacts of foundation models [ slides ]	Main readings: FlashAttention - Fast and Memory-Efficient Exact Attention with IO-Awareness (2022) [link] PRIMERA- Pyramid-based Masked Sentence Pre-training for Multi-document Summarization (2022) [link] Peek Across- Improving Multi-Document Modeling via Cross-Document Question-Answering (2023) [link] What's in my big data? (2023) [link] Red Teaming Language Models with Language Models (2022) [link] Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (2022) [link]
Tue 11/16/23	Lecture #21: Vision transformers Diffusion models [ slides ]	Main readings: An Image is Worth 16x16 Words- Transformers for Image Recognition at Scale (2020) [link] Training data-efficient image transformers & distillation through attention (2021) [link] Denoising Diffusion Probabilistic Models (2020) [link]
11/17/23 - 11/26/23	Thanksgiving recess - No classes
Tue 11/28/23	Lecture #22: Final project presentations -- session 1
Thu 11/30/23	Lecture #23: Towards Large Foundation Vision Models Guest lecturer: Neil Houlsby, Google Deepmind [ slides ]	Main readings: Scaling Vision Transformers to 22 Billion Parameters (2023) [link] From Sparse to Soft Mixtures of Experts (2023) [link] Scaling Vision Transformers (2021) [link] Optional readings: PaLI-X- On Scaling up a Multilingual Vision and Language Model [link] PaLI- A Jointly-Scaled Multilingual Language-Image Model [link]
Friday 12/1/23	Lecture #24: Final project presentations -- session 2		12/9 HW 3 due
Tue 12/05/23	Lecture #25: Foundation Models for Code and Math Guest lecturer: Ansong Ni, Yale University [ slides ]	Main readings: Evaluating Large Language Models Trained on Code (2021) [link] Solving Quantitative Reasoning Problems with Language Models (2022) [link] StarCoder- May the source be with you! (2023) [link] Optional readings: Program Synthesis with Large Language Models (2021) [link] Show Your Work- Scratchpads for Intermediate Computation with Language Models (2021) [link]
Thu 12/07/23	Lecture #26: Moved to 12/1 (see above)		12/18 Final project report due

Thu 08/31/23

Lecture #1:

Course introduction
Logistics
Transformers - high level overview

[ slides ]

Main readings:

Attention is all you need (2017) [link]

Tues 09/05/23

Lecture #2:

Optimization, backpropagation, and training

[ slides ]

Main readings:

Deep Feedforward Networks. Ian Goodfellow, Yoshua Bengio, & Aaron Courville (2016). Deep Learning, Chapter 6.5. [link]
An overview of gradient descent optimization algorithms [link]
A Gentle Introduction to Torch Autograd [link]
Autograd Mechanics [link]

Thu 09/07/23

Lecture #3:

Word embeddings
Tokenization

[ slides ]

Main readings:

Distributed Representations of Words and Phrases and their Compositionality (2013) [link]
GloVe- Global Vectors for Word Representation (2014) [link]
BPE - Neural Machine Translation of Rare Words with Subword Units (2016) [link]
SentencePiece- A simple and language independent subword tokenizer and detokenizer for Neural Text Processing (2018) [link]

Tue 09/12/23

Lecture #4:

Transformers
Implementation details

[ slides ]
[ notebook transformer.ipynb ]

Main readings:

Attention is all you need (2017) [link]
The Annotated Transformer [link]

Thu 09/14/23

Lecture #5:

Positional Information - Absolute - Relative - ROPE - ALiBi
Multi-Query Attention
Grouped Multi-Query Attention
Inference
KV caching
Encoder/decoder-only vs encoder-decoder

[ slides ]

Main readings:

Train Short, Test Long- Attention with Linear Biases Enables Input Length Extrapolation [link]
RoFormer- Enhanced Transformer with Rotary Position Embedding [link]
Fast Transformer Decoding- One Write-Head is All You Need [link]

Tips on choosing a project [slides]

HW1 out [link]

Th 09/19/23

Lecture #6:

Transfer Learning

[ slides ]

Main readings:

ELMo, Deep Contextualized Representations [link]
ULMFit, Universal Language Model Fine-tuning for Text Classification [link]
BERT, Pre-training of Deep Bidirectional Transformers for Language Understanding [link]
ELECTRA, Pre-training Text Encoders as Discriminators Rather Than Generators [link]

Project teams due. [Team submission form]

Th 09/21/23

Lecture #7:

Model architecture and training objectives - Encoder-decoder, decoder-only - UL2 / FIM

[ slides ]

Main readings:

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer [link]
UL2- Unifying Language Learning Paradigms (2022) [link]
What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? (2022) [link]
BART, Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension [link]

Tue 09/26/23

Lecture #8:

Scale
Compute analysis in transformers

[ slides ]

Main readings:

Scaling Laws for Neural Language Models [link]
Training Compute-Optimal Large Language Models [link]

Thu 09/28/23

Lecture #9:

Scaling laws and GPT-3
Few-shot Learning
Prompting
In-context learning

[ slides ]

Main readings:

Language Models are Few-shot Learners (2020) [link]
Rethinking the Role of Demonstrations- What Makes In-Context Learning Work? (2022) [link]
Data Distributional Properties Drive Emergent In-Context Learning in Transformers (2022) [link]

Tue 10/03/23

Lecture #10:

Prompting
Emergence
Reasoning
Instruction tuning

[ slides ]

Main readings:

Chain of Thought Prompting Elicits Reasoning in Large Language Models [link]
Tree of Thoughts- Deliberate Problem Solving with Large Language Models [link]
The curious case of neural text degeneration (2020) [link]
Training language models to follow instructions with human feedback (2022) [link]
Multitask Prompted Training Enables Zero-Shot Task Generalization (2021) [link]
Finetuned Language Models Are Zero-Shot Learners (2021) [link]
Scaling Instruction-Finetuned Language Models (2022) [link]
Tree of Thoughts- Deliberate Problem Solving with Large Language Models (2023) [link]

Thu 10/05/23

Lecture #11:

Adaptation
Reinforcement Learning for language models fine-tuning

[ slides ]

Main readings:

Fine-Tuning Language Models from Human Preferences (2019) [link]
Learning to Summarize with Human Feedback (2020) [link]
InstructGPT - Fine-Tuning Language Models from Human Instructions (2021) [link]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (2022) [link]
Direct Preference Optimization- Your Language Model is Secretly a Reward Model (2023) [link]

Optional readings:

Parameter-Efficient Transfer Learning for NLP (2019) [link]
Prefix-Tuning- Optimizing Continuous Prompts for Generation (2021) [link]

HW 1 due (10/8)

Tues 10/10/23

Lecture #12:

Challenges and Opportunities of Building Open LLMs

Guest lecturer:
Iz Beltagy, Allen Institute for AI

[ slides ]

Main readings:

What Language Model to Train if You Have One Million GPU Hours (2022) [link]
Dolma- Trillion Token Open Corpus for Language Model Pretraining (2023) [link]
Llama 2- Open Foundation and Fine-Tuned Chat Models (2023) [link]
Pythia- A Suite for Analyzing Large Language Models Across Training and Scaling (2023) [link]
BLOOM- A 176B-Parameter Open-Access Multilingual Language Model (2022) [link]
Scaling Language Models- Methods, Analysis & Insights from Training Gopher (2022) [link]

Thu 10/12/23

Lecture #13:

Parameter efficient fine tuning

[ slides ]

Main readings:

Parameter-Efficient Transfer Learning for NLP (2019) [link]
Prefix-Tuning- Optimizing Continuous Prompts for Generation (2021) [link]
Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning (2022) [link]
LoRA- Low-Rank Adaptation of Large Language Models [link]
Scaling Down to Scale Up- A Guide to Parameter-Efficient Fine-Tuning [link]
Efficient Transformers- A Survey [link]

Tue 10/17/23

Lecture #14:

Guest lecture by Tushar Khot
Reasoning with (De)Composition

Guest lecturer:
Tushar Khot, Allen Institute for AI

[ slides ]

Main readings:

Hey AI, Can You Solve Complex Tasks by Talking to Agents? [link]
Toolformer - A Tool-Augmented Language Model (2022) [link]
Improving Language Model Negotiation with Self-Play and In-Context Learning from AI Feedback (2023) [link]
ReAct- Synergizing Reasoning and Acting in Language Models [link]

10/20 Project proposal due

10/17/23 - 10/23/23

October recess - No classes

Tue 10/24/23

Lecture #15:

Modular deep learning
Mixture of experts

[ slides ]

Main readings:

Modular Deep Learning (2022) [link]
A Review of Sparse Expert Models in Deep Learning (2022) [link]
Outrageously Large Neural Networks- The Sparsely-Gated Mixture-of-Experts Layer (2017) [link]
Switch Transformers- Scaling to Trillion Parameter Models (2021) [link]

Thu 10/26/23

Midterm

Tue 10/31/23

Lecture #16:

Retrieval augmented language models

Guest lecturer:
Sewon Min, University of Washington

[ slides ]

Main readings:

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (2020) [link]
Improving language models by retrieving from trillions of tokens (2021) [link]
REPLUG, Retrieval-Augmented Black-Box Language Models [link]

Tue 11/02/23

Lecture #17:

Build an Ecosystem, Not a Monolith

Guest lecturer:
Colin Raffel, University of Toronto

[ slides ]

Main readings:

Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning (2022) [link]
Exploring and Predicting Transferability across NLP Tasks (2020) [link]
Editing Models with Task Arithmetic (2022) [link]

Tue 11/07/23

Lecture #18:

Modeling long sequences
Hierarchical and graph-based methods
Recurrence and memory

[ slides ]

Main readings:

Higher-order Coreference Resolution with Coarse-to-fine Inference (2018) [link]
Entity, Relation, and Event Extraction with Contextualized Span Representations (2020) [link]
Memorizing transfomers (2022) [link]
Hierarchical Graph Network for Multi-hop Question Answering [link]
Compressive Transformers for Long-Range Sequence Modelling (2020) [link]
Efficient transformers - A survey (2022) [link]

HW 2 due

Thu 11/09/23

Lecture #19:

Modeling long sequences
Sparse attention patterns
Approximating attention
Hardware aware efficiency

[ slides ]

Main readings:

Longformer- The Long-Document Transformer (2020) [link]
BigBird - Transformers for Longer Sequences (2020) [link]
Performer - Rethinking Attention with Performers (2021) [link]
Reformer - The Efficient Transformer (2020) [link]
Long T5 - Efficient Text-To-Text Transformer for Long Sequences (2022) [link]

Tue 11/14/23

Lecture #20:

Training approaches for long sequences
Hardware aware efficiency
Societal considerations and impacts of foundation models

[ slides ]

Main readings:

FlashAttention - Fast and Memory-Efficient Exact Attention with IO-Awareness (2022) [link]
PRIMERA- Pyramid-based Masked Sentence Pre-training for Multi-document Summarization (2022) [link]
Peek Across- Improving Multi-Document Modeling via Cross-Document Question-Answering (2023) [link]
What's in my big data? (2023) [link]
Red Teaming Language Models with Language Models (2022) [link]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (2022) [link]

Tue 11/16/23

Lecture #21:

Vision transformers
Diffusion models

[ slides ]

Main readings:

An Image is Worth 16x16 Words- Transformers for Image Recognition at Scale (2020) [link]
Training data-efficient image transformers & distillation through attention (2021) [link]
Denoising Diffusion Probabilistic Models (2020) [link]

11/17/23 - 11/26/23

Thanksgiving recess - No classes

Tue 11/28/23

Lecture #22:

Final project presentations -- session 1

Thu 11/30/23

Lecture #23:

Towards Large Foundation Vision Models

Guest lecturer:
Neil Houlsby, Google Deepmind

[ slides ]

Main readings:

Scaling Vision Transformers to 22 Billion Parameters (2023) [link]
From Sparse to Soft Mixtures of Experts (2023) [link]
Scaling Vision Transformers (2021) [link]

Optional readings:

PaLI-X- On Scaling up a Multilingual Vision and Language Model [link]
PaLI- A Jointly-Scaled Multilingual Language-Image Model [link]

Friday 12/1/23

Lecture #24:

Final project presentations -- session 2

12/9 HW 3 due

Tue 12/05/23

Lecture #25:

Foundation Models for Code and Math

Guest lecturer:
Ansong Ni, Yale University

[ slides ]

Main readings:

Evaluating Large Language Models Trained on Code (2021) [link]
Solving Quantitative Reasoning Problems with Language Models (2022) [link]
StarCoder- May the source be with you! (2023) [link]

Optional readings:

Program Synthesis with Large Language Models (2021) [link]
Show Your Work- Scratchpads for Intermediate Computation with Language Models (2021) [link]

Thu 12/07/23

Lecture #26:

Moved to 12/1 (see above)

12/18 Final project report due