RLHF for AI-Alignment

August 25, 2025

CS 594: RLHF Theory for AI-Alignment and Fine-Tuning LLMs (Fall, 2025)

📡 •Instructor: Aadirupa Saha

---------- This website is under construction ----------

📅 Course Schedule (NOT FINALIZED)

Date	Topic	Materials	Extra	Announcements
Aug 26	Introduction Basics of AI-Alignment	AI-Alignment Survey [Chap 1.1, 2.1]		Piazza up!
Aug 28	Basics of Stochastic MAB ETC Algorithm	SL 4.1-4.4, 6.1 PG 5.1, 5.2	Order Notations Basic Distributions: [1], [2]
Sep 2	ETC Analysis Basic Concentration Inequalities	SL 4.5, 5.1-5.3, 6.1 PG 5.2	SubGaussianity Hoeffding's Inequality	Instructions for Weekly Posts on Piazza!
Sep 4	SubGaussianity, Hoeffding's Inequality ETC Analysis (contd)	SL 5.2, 5.3, 6.1 PG 5.2	Eps-Greedy: PG 5.4 Thompson Sampling	Sign-up Sheet up on Piazza! Deadline: Sept 11th
Sep 6 [Extra Class]	UCB Algorithm Analysis of UCB-Regret Optimality	SL 7.1 PG 5.3	SubExp RVs and Concentrations Book: Concentration Inequalities	Scribing Instructions on Piazza!
Sep 9	Bandits on Cont. Decision Spaces Lipschitz Bandits Linear Bandits	SL 19.1, 19.2, 20.1 PG 6.1.2	Original OFUL paper
Sep 11	Linear Bandits (contd)	SL Lem. 19.4, Thm. 19.2 PG Thm. 6.5	Other LinB Algos: SupLinUCB: Improved Rates with Countable actions Optimal Design Algorithms: SL 21.1,22 GeometricHedge: Adversarial Setting
Sep 13 [Extra Class]	Adversarial Setting - FTRL for OLO - OMD (full feedback) - OMD (bandit feedback)	- FTRL: PG 2.2.2 - OMD: PG 2.2.3 - OMD+Bandit: PG Eqn. 4.1, Sec 4.3 - YC-notes - OMD for Convex Losses: Gradient Trick to linearize convex losses: PG 3.1	Faster Rates with Hessian: Online Newton Step (ONS): PG 3.2.1 Agarwal et al'06: Portfolio Management with ONS Stochastic Mirror Descent: OSMD
Sep 16	Contextual Bandits (mini-RL): - EXP4 - SqrCB	- EXP4: PG 4.2.1 - SquareCB, AK-notes (Realizability, Online Oracle)	Notable CB Algos: MiniMonster (Finite Policy Class) ByPassing Monster (Realizability, Online Oracle) SquareCB.Lin (Cont. Decision)
Sep 18	Paper presentation	Ratnesh: MAB & N/W Interference Aresh: Worst-Case MAB Behaviour
Sep 30	Dueling Bandits
Oct 2	Demo/ Paper presentation
Oct 7	Contextual Bandits w/ Prefs
Oct 9	Paper presentation
Oct 14	Coralling: Meta Learning (contd)
Oct 16	Intro to RL
Oct 21	MDP Basics
Oct 23	Tabular MDP-UCB-VI
Oct 28	Paper presentation
Oct 30	—-- Happy Halloween! —--
Nov 4	Linear function approximation
Nov 6	PG Methods
Nov 11	PPO + TRPO
Nov 13	Paper presentation
Nov 18	Imitation Learning
Nov 20	Paper presentation
Nov 25	Project Presentation
Nov 27	—-- Thanksgiving Break! —--
Dec 2	Project Presentation
Dec 4	Project Presentation

📚 Course Description (Tentitive)

Overview

This course aims to provide a rigorous mathematical foundation for understanding and implementing Reinforcement Learning from Human Feedback (RLHF). The curriculum is structured around interconnected modules that progress from theoretical foundations to practical implementation.

Modules

Topics Set 1: Formalizing the Alignment Problem

Establishes the conceptual framework for AI alignment, examining outer versus inner alignment, reward misspecification, and Goodhart’s Law in human feedback systems. Students analyze real-world failure modes and develop intuition for how alignment can degrade in deployed systems.

Topics Set 2: Evaluation & Alignment Verification

Addresses measuring alignment beyond simple reward maximization, covering multi-dimensional evaluation frameworks (HHH: Helpfulness, Harmlessness, Honesty), human evaluation pipeline design, adversarial testing, and robustness verification techniques are essential for production deployment.

Topics Set 3: Reinforcement Learning Theory

Provides the mathematical foundations underlying RLHF algorithms. Beginning with MDP fundamentals and Bellman equations, the module progresses through policy gradient methods, exploration strategies in tabular and linear settings, and advanced topics including low-rank MDPs and uniform convergence theory. Students master both classical results and recent theoretical developments.

Topics Set 4: RLHF Theory and Practice

Synthesizes preference learning, contextual bandits, and human-in-the-loop optimization. Topics include active learning for efficient feedback collection, handling noisy and biased human inputs, integrating multiple feedback sources, and maintaining safety and robustness guarantees throughout the pipeline.

Topics Set 5: Large Language Model Theory

Bridges abstract RL theory and practical LLM deployment. Covers transformer architectures, fine-tuning methodologies, parameter-efficient adaptation (LoRA, adapters), preference modeling for reward extraction, and specialized RL algorithms (e.g., PPO with KL regularization) tailored to language model optimization.

Prerequisites

This course demands strong mathematical maturity and technical proficiency. Advanced linear algebra and probability theory are essential (matrix analysis, eigendecompositions, concentration inequalities, stochastic processes). Knowledge of machine learning theory (optimization, generalization bounds, statistical learning theory) is required.

Programming competency in Python is required (PyTorch or similar recommended), and LaTeX proficiency is mandatory for assignments and the final project. The theoretical content assumes familiarity with measure theory, basics of functional analysis, and advanced calculus. Students unsure about preparation should complete prerequisite coursework before enrolling.

Learning Outcomes

Graduates will understand the mathematical principles governing preference learning, design and evaluate alignment verification systems, and implement production-grade RLHF pipelines. The course prepares students for advanced research roles in AI labs, senior engineering roles deploying large language models, and independent research in AI alignment. Students will be equipped to drive innovation, lead technical teams, and contribute to the theoretical and practical advances shaping the future of safe AI deployment.

⚠️ Prerequisites

Expect this to be a fairly math intensive course. Please familiarize yourself with the basics of Probability-Statistics (PS) and Linear-Algebra (LA). Recommended introductory lectures to check if you are comfortable with the basics:

Familiarity with LaTeX for scientific writing for scribing the lecture notes (only for MS students). You can learn the basics from A Simple Quickstart Guide. Many other online tutorials are available for beginners — feel free to explore and use whichever best suits your needs.

Programming (Experiments) (in Python). Be prepared to code: [Python ML Tutorials], [Google Colab] (many online tutorials available for beginners).

A strong grasp of the foundational material outlined above is expected of all students taking the course for credit. Insufficient preparation may adversely affect your ability to engage with the course content and perform successfully in assessments, which may impact your final grades.

🏆 Grading Policy

Project + Report: 20%
Problem selection + Motivation 5% • Solution 10% • Experiments 5%
Paper Presentation + Coding: 15%
Work Motivation 5% • Theoretical explanation 10% • Experiments +5% extra credit
Scribe: 15%
Approximately 2 lectures
Piazza-Weekly Problem: 15% +5% extra credit
Post a new unresolved problem per week (with motivation based), on that week's lectures
Piazza-Weekly Paper: 15% +5% extra credit
Find a related paper per week based on the topics covered in that week
Class Participation: 10% +5% extra credit
Lecture questions/answers 5% • Presentation questions 5%
Quiz: 10%
Random quizzes

📖 Resources

Class lecture will be based on, but not limited to, the following books:

[SB] Reinforcement Learning: An Introduction by Sutton & Barto
[SL] Bandit Algorithms by Szepesvari & Lattimore
[PG] Online Learning Notes by Pierre Gaillard
[FO] A Modern Introduction to Online Learning by Francesco Orabona
[THK] Machine Learning from Human Preferences by Truong, Haupt, Koyejo
[RLM] RL: Theory & Algorithms by Agarwal, Jiang, Kakade, Sun
[FRL] Foundations of Reinforcement Learning and Interactive Decision Making by Foster & Rakhlin

🎯 Course Logistics

📍 Location: CDRLC
⏰ Schedule: Tuesday & Thursday, 2:00 - 3:15 PM
🏛️ Office Hours: Thurdays 5:00–6:00 PM or by appointment
📧 Piazza: CS_594 [Only Creditors!]

📌 Important Dates

Select Project Topic: Oct 3rd, 2025
Project Presentation: Nov 25th, Dec 2nd, Dec 4th, 2025
Project Report: Dec 6th, 2025

ml, ai, python, course