This is the archived website for the Spring 2024 offering of CS336.
The latest offering is here.
Logistics
- Lectures: Monday/Wednesday 1:30-2:50pm in McMurtry Art & Art History Building, Oshman Presentation Space, Room 102
-
Office hours:
- Tatsunori Hashimoto: Friday 3-4pm Gates 354
- Percy Liang: Monday 3-4pm in Gates 350
- Nelson Liu: Wednesday 3-4pm in Gates 315
- Gabriel Poesia: Thursday 10-11am in Gates 315
- Contact: Students should ask all course-related questions in public Slack channels. All announcements will also be made in Slack. For personal matters, email cs336-spr2324-staff@lists.stanford.edu.
Content
What is this course about?
Language models serve as the cornerstone of modern natural language processing (NLP) applications and open up a new paradigm of having a single general purpose system address a range of downstream tasks. As the field of artificial intelligence (AI), machine learning (ML), and NLP continues to grow, possessing a deep understanding of language models becomes essential for scientists and engineers alike. This course is designed to provide students with a comprehensive understanding of language models by walking them through the entire process of developing their own. Drawing inspiration from operating systems courses that create an entire operating system from scratch, we will lead students through every aspect of language model creation, including data collection and cleansing for pre-training, transformer model construction, model training, and evaluation before deployment.
Prerequisites
-
Proficiency in Python
The majority of class assignments will be in Python. Unlike most other AI classes, students will be given minimal scaffolding. The amount of code you will write will be at least an order of magnitude greater than for other classes. Therefore, being proficient in Python and software engineering is paramount.
-
Experience with deep learning and systems optimization
A significant part of the course will involve making neural language models run quickly and efficiently on GPUs across multiple machines. We expect students to be able to have a strong familiarity with PyTorch and know basic systems concepts like the memory hierarchy.
-
College Calculus, Linear Algebra (e.g. MATH 51, CME 100)
You should be comfortable understanding matrix/vector notation and operations.
-
Basic Probability and Statistics (e.g. CS 109 or equivalent)
You should know the basics of probabilities, Gaussian distributions, mean, standard deviation, etc.
-
Machine Learning (e.g. CS221, CS229, CS230, CS124, CS224N)
You should be comfortable with the basics of machine learning and deep learning.
Note that this is a 5-unit class. This is a very implementation-heavy class, so please allocate enough time for it.
Coursework
Assignments
-
Assignment 1: Basics [leaderboard]
- Implement all of the components (tokenizer, model architecture, optimizer) necessary to train a standard Transformer language model.
- Train a minimal language model.
-
Assignment 2: Systems
- Profile and benchmark the model and layers from Assignment 1, and optimize RMSNorm with a custom GPU kernel.
- Build a memory-efficient, distributed version of the Assignment 1 model.
-
Assignment 3: Scaling
- Query a training API to fit a scaling law and project model scaling.
-
Assignment 4: Data [leaderboard]
- Convert raw Common Crawl dumps into usable pretraining data.
- Perform filtering and deduplication to improve model performance.
-
Assignment 5: Alignment
- Annotate an instruction-tuning data for the model.
- Implement and apply RLHF to align the model.
Honor code
Like all other classes at Stanford, we take the student Honor Code seriously. Please respect the following policies:- Collaboration: Study groups are allowed, but students must understand and complete their own assignments, and hand in one assignment per student. If you worked in a group, please put the names of the members of your study group at the top of your assignment. Please ask if you have any questions about the collaboration policy.
- AI tools: Use of language models such as ChatGPT is permitted for low-level programming questions or high-level conceptual questions about language models, but using it directly to solve the problem is prohibited.
- Existing code: Implementations for many of the things you will implement exist online. The handouts we'll give will be self-contained, so that you will not need to consult third-party code for producing your own implementation. Thus, you should not look at any existing code unless when otherwise specified in the handouts.
Submitting coursework
- All coursework are submitted via Gradescope by the deadline. Do not submit your coursework via email.
- If anything goes wrong, please ask a question in Slack or contact a course assistant.
- You can submit as many times as you'd like until the deadline: we will only grade the last submission.
- Partial work is better than not submitting any work.
Late days
- Each student has 6 late days to use. A late day extends the deadline by 24 hours.
- You can use up to 3 late days per assignment.
Regrade requests
If you believe that the course staff made an objective error in grading, you may submit a regrade request on Gradescope within 3 days after the grades are released.
Sponsor
We would like to thank Together AI for sponsoring the compute for this class.
Schedule
Percy's lectures are all in Python and available at this repository.
| # | Date | Description | Course Materials | Events | Deadlines |
|---|---|---|---|---|---|
| 1 | Mon April 1 | Overview, tokenization (Percy) | lecture_01.py |
Assignment 1 out
[code] [preview] [leaderboard] |
|
| 2 | Wed April 3 | Pytorch, resource accounting (Percy) | lecture_02.py | ||
| 3 | Mon April 8 | Architectures, hyperparameters (Tatsu) | lecture 3.pdf | ||
| 4 | Wed April 10 | Mixture of experts (Tatsu) | lecture 4.pdf | ||
| 5 | Mon April 15 | GPUs (Tatsu) | lecture 5.pdf | Assignment 1 due | |
| 6 | Wed April 17 | Kernels, Triton (Percy) | lecture_06.py | Assignment 2 out
[code] [preview] |
|
| 7 | Mon April 22 | Parallelism (Tatsu) | lecture 7.pdf | ||
| 8 | Wed April 24 | Parallelism (Percy) | lecture_08.py | ||
| 9 | Mon April 29 | Scaling laws (Tatsu) | lecture 9.pdf | ||
| 10 | Wed May 1 | Scaling laws (Tatsu) | lecture 10.pdf |
Assignment 2 due
Assignment 3 out [code] [preview] |
|
| 11 | Mon May 6 | Data (Percy) | lecture_11.py | ||
| 12 | Wed May 8 | Data (Percy) | lecture_12.py | Assignment 3 due | |
| Sat May 11 |
Assignment 4 out
[code] [preview] [leaderboard] |
||||
| 13 | Mon May 13 | Data (Percy) | lecture_13.py | ||
| 14 | Wed May 15 | Data (Percy) | lecture_14.py | ||
| 15 | Mon May 20 | Alignment (Tatsu) | lecture 15.pdf | ||
| 16 | Wed May 22 | Alignment (Tatsu) | lecture 16.pdf | ||
| - | Mon May 27 | Memorial Day - no classes | Assignment 4 due | ||
| 17 | Wed May 29 | Evals (Tatsu) | lecture 17.pdf |
Assignment 5 out
[code] [preview] |
|
| 18 | Mon June 3 | Guest lecture by Ce Zhang | |||
| 19 | Wed June 5 | Guest lecture by Aakanksha Chowdhery |