August 25, 2025

CS 594: RLHF Theory for AI-Alignment and Fine-Tuning LLMs (Fall, 2025)

πŸ“‘ β€’Instructor: Aadirupa Saha

---------- This website is under construction ----------

πŸ“… Course Schedule (NOT FINALIZED)

DateTopicReadingMaterialsNotesTop-3 Paper Recommendation
Aug 26Introduction
Basics of AI-Alignment
Aug 28Concentration Bounds MAB
Sep 2UCB Algorithm
Sep 4Online Mirror Descent
Sep 9Linear Bandits
Sep 11Paper presentation
  • Heavy-tailed bandits
  • Bandits with side info
  • Bandits with budgets
Sep 16Linear Bandits (contd)
Sep 18Dueling Bandits: Learning from Preferences
Sep 23Paper presentation
  • Beyond Linear Bandits: Kernel Bandits
  • Lipschitz bandits
  • Neural Bandits
Sep 25Paper presentation
  • MNL Bandits
  • Dynamic Bandits
  • Sleeping Bandits
Sep 30Dueling Bandits (contd)
Oct 2Paper presentation
  • Robust DB
  • Fair DB
  • DB with correlated preferences
Oct 7Contextual Bandits: EXP4 (1 step RL)
Oct 9Paper presentation
Oct 14Contextual MAB (contd)
Oct 16Contextual - Dueling Bandits
Oct 21Intro to RL + MDP Basics
Oct 23Tabular MDP-UCB-VI
Oct 28Paper presentation
Oct 30β€”-- Happy Halloween! β€”--
Nov 4Linear function approximation
Nov 6PG Methods
Nov 11PPO + TRPO
Nov 13Paper presentation
Nov 18Imitation Learning
Nov 20Paper presentation
Nov 25Presentation (20 mins/ team)
Nov 27β€”-- Thanksgiving Break! β€”--
Dec 2Presentation (20 mins/ team)
Dec 4Presentation (20 mins/ team)

πŸ“š Course Description (Tentitive)

Overview

This course aims to provide a rigorous mathematical foundation for understanding and implementing Reinforcement Learning from Human Feedback (RLHF). The curriculum is structured around interconnected modules that progress from theoretical foundations to practical implementation.

Modules

Topics Set 1: Formalizing the Alignment Problem

Establishes the conceptual framework for AI alignment, examining outer versus inner alignment, reward misspecification, and Goodhart’s Law in human feedback systems. Students analyze real-world failure modes and develop intuition for how alignment can degrade in deployed systems.

Topics Set 2: Evaluation & Alignment Verification

Addresses measuring alignment beyond simple reward maximization, covering multi-dimensional evaluation frameworks (HHH: Helpfulness, Harmlessness, Honesty), human evaluation pipeline design, adversarial testing, and robustness verification techniques are essential for production deployment.

Topics Set 3: Reinforcement Learning Theory

Provides the mathematical foundations underlying RLHF algorithms. Beginning with MDP fundamentals and Bellman equations, the module progresses through policy gradient methods, exploration strategies in tabular and linear settings, and advanced topics including low-rank MDPs and uniform convergence theory. Students master both classical results and recent theoretical developments.

Topics Set 4: RLHF Theory and Practice

Synthesizes preference learning, contextual bandits, and human-in-the-loop optimization. Topics include active learning for efficient feedback collection, handling noisy and biased human inputs, integrating multiple feedback sources, and maintaining safety and robustness guarantees throughout the pipeline.

Topics Set 5: Large Language Model Theory

Bridges abstract RL theory and practical LLM deployment. Covers transformer architectures, fine-tuning methodologies, parameter-efficient adaptation (LoRA, adapters), preference modeling for reward extraction, and specialized RL algorithms (e.g., PPO with KL regularization) tailored to language model optimization.

Prerequisites

This course demands strong mathematical maturity and technical proficiency. Advanced linear algebra and probability theory are essential (matrix analysis, eigendecompositions, concentration inequalities, stochastic processes). Knowledge of machine learning theory (optimization, generalization bounds, statistical learning theory) is required.

Programming competency in Python is required (PyTorch or similar recommended), and LaTeX proficiency is mandatory for assignments and the final project. The theoretical content assumes familiarity with measure theory, basics of functional analysis, and advanced calculus. Students unsure about preparation should complete prerequisite coursework before enrolling.

Learning Outcomes

Graduates will understand the mathematical principles governing preference learning, design and evaluate alignment verification systems, and implement production-grade RLHF pipelines. The course prepares students for advanced research roles in AI labs, senior engineering roles deploying large language models, and independent research in AI alignment. Students will be equipped to drive innovation, lead technical teams, and contribute to the theoretical and practical advances shaping the future of safe AI deployment.

⚠️ Prerequisites

Expect this to be a fairly math intensive course. Please familiarize yourself with the basics of Probability-Statistics (PS) and Linear-Algebra (LA). Recommended introductory lectures to check if you are comfortable with the basics:

Familiarity with LaTeX for scientific writing for scribing the lecture notes (only for MS students). You can learn the basics from A Simple Quickstart Guide. Many other online tutorials are available for beginners β€” feel free to explore and use whichever best suits your needs.

Programming (Experiments) (in Python). Be prepared to code: [Python ML Tutorials], [Google Colab] (many online tutorials available for beginners).

A strong grasp of the foundational material outlined above is expected of all students taking the course for credit. Insufficient preparation may adversely affect your ability to engage with the course content and perform successfully in assessments, which may impact your final grades.

πŸ† Grading Policy

  • Project + Report: 25%
    Problem selection 5% β€’ Motivation 5% β€’ Solution 10% β€’ Experiments 5%
  • Paper Presentation + Coding: 20%
    Topic choice + Motivation 5% β€’ Theoretical explanation 10% β€’ Experiments 5%
  • Scribe: 15%
    Approximately 2 lectures
  • Piazza-Weekly Problem: 10% +5% extra credit
    Post a new unresolved problem per week (with motivation based), on that week's lectures
  • Piazza-Weekly Paper: 10% +5% extra credit
    Find a related paper per week based on the topics covered in that week
  • Class Participation: 10% +5% extra credit
    Lecture questions/answers 5% β€’ Presentation questions 5%
  • Quiz: 10%
    Random iClicker quizzes

πŸ“– Resources

Class lecture will be based on, but not limited to, the following books:

🎯 Course Logistics

  • πŸ“ Location: CDRLC
  • ⏰ Schedule: Tuesday & Thursday, 2:00 - 3:15 PM
  • πŸ›οΈ Office Hours: Thurdays 5:00–6:00 PM or by appointment
  • πŸ“§ Piazza: TBA

πŸ“Œ Important Dates

  • Select Project Topic: Oct 3rd, 2025
  • Project Presentation: Nov 25th, Dec 2nd, Dec 4th, 2025
  • Project Report: Dec 6th, 2025