Part 1 - LLMs: Backbone of Modern AI
Introduction
(Required) The Illustrated Transformer
(Required) The Llama 3 Herd of Models (Sections 2, 3.3, and 4.1), Llama Team, AI @ Meta
(Required) Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Anand Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia . SC 21.
Scaling LLM Pre-Training
(Required - Context Parallelism) WLB-LLM: Workload-Balanced 4D Parallelism for Large Language Model Training
Zheng Wang, Anna Cai, Xinfeng Xie, Zaifeng Pan, Yue Guan, Weiwei Chu, Jie Wang, Shikai Li, Jianyu Huang, Chris Cai, Yuchen Hao, and Yufei Ding . OSDI 25.
Enabling Parallelism Hot Switching for Efficient Training of Large Language Models
Hao Ge, Fangcheng Fu, Haoyang Li, Xuanyu Wang, Sheng Lin, Yujie Wang, Xiaonan Nie, Hailin Zhang, Xupeng Miao, and Bin Cui . SOSP 24.
(Required - Auto Parallelism) Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
Lianmin Zheng, Siyuan Zhuang, Zhuohan Li, Heng Zhang, Joseph E. Gonzalez, and Ion Stoica . OSDI 22.
PartIR: Composing SPMD Partitioning Strategies for Machine Learning
Sami Alabed, Daniel Belov, Bart Chrzaszcz, Juliana Franco, Dominik Grewe, Dougal Maclaurin, James Molloy, Tom Natan, Tamara Norman, Xiaoyue Pan, Adam Paszke, Norman A. Rink, Michael Schaarschmidt, Timur Sitdikov, Agnieszka Swietlik, Dimitrios Vytiniotis, and Joel Wee . ASPLOS 25.
(Required - Network) RDMA over Ethernet for Distributed Training at Meta Scale
Adithya Gangidi, Rui Miao, Shengbao Zheng, Sai J. Bondu, Guilherme Goes, Hany Morsy, Rohit Puri, Mohammad Riftadi, Ashmitha J. Shetty, Jingyi Yang, Shuqiang Zhang, Mikel Jimenez Fernandez, Shashidhar Gandham, and Hongyi Zeng . SIGCOMM 24.
CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters
Sudarsanan Rajasekaran, Manya Ghobadi, and Aditya Akella . NSDI 24.
(Required - Silent Data Corruption) Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks
Yuxuan Jiang, Ziming Zhou, Boyu Xu, Beijie Liu, Runhui Xu, and Peng Huang . OSDI 25.
SuperBench: Improving Cloud AI Infrastructure Reliability with Proactive Validation
Yifan Xiong, Yuting Jiang, Ziyue Yang, Lei Qu, Guoshuai Zhao, Shuguang Liu, Dong Zhong, Boris Pinzur, Jie Zhang, Yang Wang, Jithin Jose, Hossein Pourreza, Jeff Baxter, Kushal Datta, Prabhat Ram, Luke Melton, Joe Chau, Peng Cheng, Yongqiang Xiong, and Lidong Zhou . ATC 24.
(Required - Fault-Tolerance) Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates
Insu Jang, Zhenning Yang, Zhen Zhang, Xin Jin, and Mosharaf Chowdhury . SOSP 23.
Tenplex: Dynamic parallelism for deep learning using parallelizable tensor collections
Marcel Wagenländer, Guo Li, Bo Zhao, Luo Mai, and Peter Pietzuch . SOSP 24.
LLM Post-Training for Alignment
(Required - Resource Efficiency) HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu . EuroSys 25.
Optimizing RLHF Training for Large Language Models with Stage Fusion
Yinmin Zhong, Zili Zhang, Bingyang Wu, Shengyu Liu, Yukun Chen, Changyi Wan, Hanpeng Hu, Lei Xia, Ranchen Ming, Yibo Zhu, and Xin Jin . NSDI 25.
(Required - Async RL) AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning
Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, Tongkai Yang, Binhang Yuan, and Yi Wu . Arxiv 2025.
Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models
Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux, Arian Hosseini, Rishabh Agarwal, Aaron Courville . ICLR 25.
Efficient LLM Serving
(Required - KV Cache Management) Efficient Memory Management for Large Language Model Serving with PagedAttention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica . SOSP 23.
Orca: A Distributed Serving System for Transformer-Based Generative Models
Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun . OSDI 22.
(Required - Optimal Throughput) NanoFlow: Towards Optimal Large Language Model Serving Throughput
Kan Zhu, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Yufei Gao, Qinyu Xu, Tian Tang, Zihao Ye, Keisuke Kamahori, Chien-Yu Lin, Stephanie Wang, Arvind Krishnamurthy, and Baris Kasikci . OSDI 25.
Taming Throughput–Latency Trade-off in LLM Inference with Sarathi-Serve
Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee . OSDI 24.
(Required - Prefill/Decode Disaggregation) DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang . OSDI 24.
LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism
Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe Liu, and Xin Jin . SOSP 24.
(Required - New Hardware) WaferLLM: Large Language Model Inference at Wafer Scale
Congjie He, Yeqi Huang, Pei Mu, Ziming Miao, Jilong Xue, Lingxiao Ma, Fan Yang, Luo Mai . OSDI 25.
Aqua: Network-Accelerated Memory Offloading for LLMs in Scale-Up GPU Domains
Abhishek Vijaya Kumar, Gianni Antichi, and Rachee Singh . ASPLOS 25.
Mixture-of-Experts
(Required - Training) FSMoE: A Flexible and Scalable Training System for Sparse Mixture-of-Experts Models
Xinglin Pan, Wenxiang Lin, Lin Zhang, Shaohuai Shi, Zhenheng Tang, Rui Wang, Bo Li, and Xiaowen Chu . ASPLOS 25.
MegaBlocks: Efficient Sparse Training with Mixture-of-Experts
Trevor Gale, Deepak Narayanan, Cliff Young, and Matei Zaharia . MLSys 23.
(Required - Serving) MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs
Shiyi Cao, Shu Liu, Tyler Griggs, Peter Schafhalter, Xiaoxuan Liu, Ying Sheng, Joseph E. Gonzalez, Matei Zaharia, and Ion Stoica . ASPLOS 25.
Read-ME: Refactorizing LLMs as Router-Decoupled Mixture-of-Experts with System Co-Design
Ruisi Cai, Yeonju Ro, Geon-Woo Kim, Peihao Wang, Babak Ehteshami Bejnordi, Aditya Akella, and Zhangyang Wang . NeurIPS 24.
Part 2 - GenAI: Beyond Simple Text Generation
Multi-Modal Generation
(Required) The Illustrated Stable Diffusion
(Required - Diffusion Model Serving) Approximate Caching for Efficiently Serving Text-to-Image Diffusion Models
Shubham Agarwal, Subrata Mitra, Sarthak Chakraborty, Srikrishna Karanam, Koyel Mukherjee, and Shiv K. Saini . NSDI 24.
DiffServe: Efficiently Serving Text-to-Image Diffusion Models with Query-Aware Model Scaling
Sohaib Ahmad, Qizheng Yang, Haoliang Wang, Ramesh K. Sitaraman, and Hui Guan . MLSys 24.
(Required - Video Gen Model) CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yuxuan Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang . ICLR 25.
MovieGen: A Cast of Media Foundation Models
Meta AI Research . arXiv 2024.
Retrieval-Augmented Generation
(Required) RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving
Wenqi Jiang, Suvinay Subramanian, Cat Graves, Gustavo Alonso, Amir Yazdanbakhsh, and Vidushi Dadu . ISCA 25.
(Required) Patchwork: A Unified Framework for RAG Serving
Bodun Hu, Luis Pabon, Saurabh Agarwal, and Aditya Akella . Arxiv 2025.
METIS: Fast Quality-Aware RAG Systems with Configuration Adaptation
Siddhant Ray, Rui Pan, Zhuohan Gu, Kuntai Du, Shaoting Feng, Ganesh Ananthanarayanan, Ravi Netravali, and Junchen Jiang . arXiv 2025.
TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval
Chien-Yu Lin, Keisuke Kamahori, Yiyu Liu, Xiaoxiang Shi, Madhav Kashyap, Yile Gu, Zihao Ye, Stephanie Wang, Arvind Krishnamurthy, and Baris Kasikci . arXiv 2025.
LLM-Augmented Agentic Systems
(Required) Parrot: Efficient Serving of LLM-based Applications with Semantic Variable
Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, and Lili Qiu . OSDI 24.
(Required) Autellix: An Efficient Serving Engine for LLM Agents as General Programs
Michael Luo, Xiaoxiang Shi, Colin Cai, Tianjun Zhang, Justin Wong, Yichuan Wang, Chi Wang, Yanping Huang, Zhifeng Chen, Joseph E. Gonzalez, and Ion Stoica . arXiv 2025.
Towards end-to-end optimization of llm-based applications with ayo
Xin Tan, Yimin Jiang, Yitao Yang, and Hong Xu . ASPLOS 25.
Part 3 - GenAI for Systems
(Required) Intent-Driven Network Management with Multi-Agent LLMs: The Confucius Framework
Zhaodong Wang, Samuel Lin, Guanqing Yan, Soudeh Ghorbani, Minlan Yu, Jiawei Zhou, Nathan Hu, Lopa Baruah, Sam Peters, Srikanth Kamath, Jerry Yang, and Ying Zhang . SIGCOMM 25.
(Required) NetLLM: Adapting Large Language Models for Networking
Duo Wu, Xianda Wang, Yaqi Qiao, Zhi Wang, Junchen Jiang, Shuguang Cui, and Fangxin Wang . SIGCOMM 24.
Zoom2Net: Constrained Network Telemetry Imputation
Xizheng Wang, Yuxiong He, Maria Apostolaki, Xin Jin, Mohammad Alizadeh, and Jennifer Rexford . SIGCOMM 24.
(Required) Resource-efficient Inference with Foundation Model Programs
Lunyiu Nie, Zhimin Ding, Kevin Yu, Marco Cheung, Christopher Jermaine, and Swarat Chaudhuri . COLM 25.
TextGrad: Automatic “Differentiation” via Text
Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou . arXiv 2024.
(Required) AlphaEvolve: A Coding Agent for Scientific and Algorithmic Discovery
Alexander Novikov, Ngân Vũ, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Z. Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, et al. . DeepMind White Paper 2025.