Pretrained Models I

Course Description

Large-scale pretrained models (Bert, T5 and GPT) have dramatically changed the landscape of artificial intelligence (natural language processing, computer vision and robotics). With huge parameters, these models encode rich knowledge and prove effective backbones for downstream tasks over training models from scratch.

This seminar is the first part of “pretrained models” course and focuses on transformer-based pretrained models with encoder-only architecture. It is a transition course, and is designed to equip students with essential knowledge to engage with cutting-edge NLP research. It contains four parts:

Part 1: Introduction
Part 2: Model Architecture and Learning
Part 3: Model Analysis and Interpretation
Part 4: Efficient Pretrained Models

Instructor: Meng Li

Time: Tuesdays, 2:15-3:45pm (first meeting on April 16)

Room: 2.14.0.32

Course Management System: Moodle

Syllabus

(Note: You are welcome to suggest other topics or papers.)

Date	Topic	Readings	Related Materials	Presenter	Slides
	Part 1: Introduction
2024/04/16	Transformer	Attention is All you Need	The Annotated Transformer; The Illustrated Transformer; Huggingface NLP course on transformer; Attention? Attention!	Meng	slides
2024/04/23	Pretrained Models with Encoder-only Architecture	(1) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding; (2) RoBERTa: A Robustly Optimized BERT Pretraining Approach; (3) ALBERT: A Lite BERT for Self-supervised Learning of Language Representations; (4) SpanBERT: Improving Pre-training by Representing and Predicting Spans; (5) XLNet: Generalized Autoregressive Pretraining for Language Understanding	A Primer in BERTology: What We Know About How BERT Works; The Illustrated BERT, ELMo	Meng	slides
2024/04/30	no class
	Part 2: Model Architecture and Learning
2024/05/07	Tokenization	(1) Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP; (2) Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance	Neural Machine Translation of Rare Words with Subword Units ; Huggingface NLP course on tokenizers; BPE Explainer	(1) Mai; (2) Imge
2024/05/14	Self-Supervised Learning	(1) Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks; (2) A Simple Framework for Contrastive Learning of Visual Representations	A Cookbook of Self-Supervised Learning; A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives; Tutorial on SimCLR	(1) Dimitrije; (2) Theodore
2024/05/21	Transfer Learning	(1) CogTaskonomy: Cognitively Inspired Task Taxonomy Is Beneficial to Transfer Learning in NLP; (2) Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT	Tutorial on transfer learning for NLP (NAACL 2019) [code]	(1) Elvira; (2) Emma
	Part 3: Model Analysis and Interpretation
2024/05/28	Linguistic Knowledge of Pretrained Models	(1) A Structural Probe for Finding Syntax in Word Representations; (2) Probing Pretrained Language Models for Lexical Semantics	Probing Classifiers: Promises, Shortcomings, and Advances; Designing and Interpreting Probes with Control Tasks	(2) Nicolas
2024/06/04	World Knowledge of Pretrained Models	(1) Evaluating Commonsense in Pre-Trained Language Models; (2) Probing Pre-Trained Language Models for Cross-Cultural Differences in Values		(1) Altar; (2) Ruilin
	Part 4: Efficient Pretrained Models
2024/06/11	Pruning	(1) The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks; (2) Are Sixteen Heads Really Better than One?	Compressing Large-Scale Transformer-Based Models A Case Study on BERT; Pytorch pruning tutorial; Diving Into Model Pruning in Deep Learning	(2) Jerycho
2024/06/18	~~Quantization~~	(1) Understanding and Overcoming the Challenges of Efficient Transformer Quantization; (2) I-BERT: Integer-only BERT Quantization	Pytorch quantization recipe; A Tale of Model Quantization in TF Lite
2024/06/25	Knowledge Distillation	(1) TinyBERT Distilling BERT for Natural Language Understanding; (2) MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers	Distilling the Knowledge in a Neural Network; Pytorch knowledge distillation tutorial; Distilling Knowledge in Neural Networks With Weights & Biases	(1) Kseniya

Requirement

Prerequisites. You are expected to have essential understanding of neural networks and have completed one course on natural language processing. In addition, you should be a curious and active learner to prepare your presentation and discuss.

Registration. If you would like to participate, you should directly register through PULS. In addition, please drop an email to meng.li (at) uni-potsdam.de until April 19 (23:59), 2024. In your email, please:

Tell me your name, semester, and major
Name your top-3 paper choices from the syllabus for presenting.
Explain why you want to take this course
List some of your related experience in deep learning/natural language processing/implementing NLP models

Format. We will focus on one topic each week.

In the first two units, I will provide tutorials on transformer and pretrained models.
From the fourth week, there are two readings for each topic and two students will present in each unit. Each student will present one paper and lead followed discussion or activities. The presentation should last 20-30 minutes and leave 15-25 minutes for discussion or activities. Students are expected to read both papers every week, and submit one question for each paper by Monday evening (23:59).

Grading

Scheme. Questions about readings: 20%; Presentation: 30%; Final paper: 50%.

Questions. From the fourth week, each student submits one question for each paper by Monday evening (23:59) on Moodle. Questions are graded on a 3-point scale (0: no question submitted, 1: superficial question, 2: insightful question). (Superficial questions are questions that usually rephrase the content of papers, and their answers can be found in the papers. Insightful questions are inquisitive. These questions could identify information gap or logical fallacy in papers, and connect with old literature in a new perspective.)

Presentations.

Your presentation is expected to motivate the paper with meaningful questions in a broad context, and outline claims and their support. Not all details need to be included in a limited time.
Your presentation in this seminar is not supposed to be a perfect pitching. It will not affect presentation grade if you do not understand some points of the assigned paper. Rather, it is expected to be open and transparent. Confusing questions could be a starting point for in-class discussions and your research in the future.
Rehearse your presentation and improve it with feedbacks from your friends or fellow students if time permits. There are numerous books and videos on presentation, and feel free to learn and practice when necessary.

Final paper.

Note: We will discuss this in the first meeting. Requirements may be changed based on popular demand.

The final paper should have 5 pages of main content (plus unlimited reference and appendix) following the ACL template.

Option 1: A technical report of a small independent project.
Option 2: A review paper.
Option 3: topic of your choice in discussion with me.

Both the proposal and the final paper should be uploaded on Moodle.

Proposal due date: June 16 (23:59), 2024
Final paper due date: October 13 (23:59), 2024

Contact

Please contact Meng at meng.li (at) uni-potsdam.de for any questions.