Part 1 - LLMs: Backbone of Modern AI
Scaling LLM Training
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Anand Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia . SC 21.
Reducing Activation Recomputation in Large Transformer Models
Vijay A. Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro . MLSys 23.
Concerto: Automatic Communication Optimization and Scheduling for Large-Scale Deep Learning
Shenggan Cheng, Shengjie Lin, Lansong Diao, Hao Wu, Siyu Wang, Chang Si, Ziming Liu, Xuanlei Zhao, Jiangsu Du, Wei Lin, and Yang You . ASPLOS 25.
RDMA over Ethernet for Distributed Training at Meta Scale
Adithya Gangidi, Rui Miao, Shengbao Zheng, Sai J. Bondu, Guilherme Goes, Hany Morsy, Rohit Puri, Mohammad Riftadi, Ashmitha J. Shetty, Jingyi Yang, Shuqiang Zhang, Mikel Jimenez Fernandez, Shashidhar Gandham, and Hongyi Zeng . SIGCOMM 24.
CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters
Sudarsanan Rajasekaran, Manya Ghobadi, and Aditya Akella . NSDI 24.
Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates
Insu Jang, Zhenning Yang, Zhen Zhang, Xin Jin, and Mosharaf Chowdhury . SOSP 23.
Efficient LLM Serving
Efficient Memory Management for Large Language Model Serving with PagedAttention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica . SOSP 23.
SplitWise: Efficient Generative LLM Inference Using Phase Splitting
Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini . ISCA 24.
NanoFlow: Towards Optimal Large Language Model Serving Throughput
Kan Zhu, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Yufei Gao, Qinyu Xu, Tian Tang, Zihao Ye, Keisuke Kamahori, Chien-Yu Lin, Stephanie Wang, Arvind Krishnamurthy, and Baris Kasikci . arXiv (2024).
Llumnix: Dynamic Scheduling for Large Language Model Serving
Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin . OSDI 24.
Taming Throughput–Latency Trade-off in LLM Inference with Sarathi-Serve
Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee . OSDI 24.
Mixture-of-Experts
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean . ICLR 17.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
William Fedus, Barret Zoph, and Noam Shazeer . JMLR 23.
FSMoE: A Flexible and Scalable Training System for Sparse Mixture-of-Experts Models
Xinglin Pan, Wenxiang Lin, Lin Zhang, Shaohuai Shi, Zhenheng Tang, Rui Wang, Bo Li, and Xiaowen Chu . ASPLOS 25.
Pre-Gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference
Ranggi Hwang, Jianyu Wei, Shijie Cao, Changho Hwang, Xiaohu Tang, Ting Cao, and Mao Yang . ISCA 24.
Read-ME: Refactorizing LLMs as Router-Decoupled Mixture-of-Experts with System Co-Design
Ruisi Cai, Yeonju Ro, Geon-Woo Kim, Peihao Wang, Babak Ehteshami Bejnordi, Aditya Akella, and Zhangyang Wang . NeurIPS 24.
Part 2 - GenAI: Beyond Simple Text Generation
Multi-Modal Generation
Approximate Caching for Efficiently Serving Text-to-Image Diffusion Models
Shubham Agarwal, Subrata Mitra, Sarthak Chakraborty, Srikrishna Karanam, Koyel Mukherjee, and Shiv K. Saini . NSDI 24.
DiffServe: Efficiently Serving Text-to-Image Diffusion Models with Query-Aware Model Scaling
Sohaib Ahmad, Qizheng Yang, Haoliang Wang, Ramesh K. Sitaraman, and Hui Guan . MLSys 24.
MovieGen: A Cast of Media Foundation Models
Meta AI Research . arXiv 2024.
ModServe: Scalable and Resource-Efficient Large Multimodal Model Serving
Haoran Qiu, Yihan Wang, Yao Zhang, Minjia Zhang, Jianping Fan, and Yuxiong He . Preprint 2024.
DREAM: A Dynamic Scheduler for Dynamic Real-Time Multi-Model ML Workloads
Seah Kim, Hyoukjun Kwon, Jinook Song, Jihyuck Jo, Yu-Hsin Chen, Liangzhen Lai, and Vikas Chandra . ASPLOS 23.
LLM-Augmented Agentic Systems
RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving
Wenqi Jiang, Suvinay Subramanian, Cat Graves, Gustavo Alonso, Amir Yazdanbakhsh, and Vidushi Dadu . arXiv 2025.
TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval
Chien-Yu Lin, Keisuke Kamahori, Yiyu Liu, Xiaoxiang Shi, Madhav Kashyap, Yile Gu, Zihao Ye, Stephanie Wang, Arvind Krishnamurthy, and Baris Kasikci . arXiv 2025.
Autellix: An Efficient Serving Engine for LLM Agents as General Programs
Michael Luo, Xiaoxiang Shi, Colin Cai, Tianjun Zhang, Justin Wong, Yichuan Wang, Chi Wang, Yanping Huang, Zhifeng Chen, Joseph E. Gonzalez, and Ion Stoica . arXiv 2025.
Parrot: Efficient Serving of LLM-based Applications with Semantic Variable
Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, and Lili Qiu . OSDI 24.
Resource-efficient Inference with Foundation Model Programs
Lunyiu Nie, Zhimin Ding, Kevin Yu, Marco Cheung, Christopher Jermaine, and Swarat Chaudhuri . arXiv 2025.
Efficiently Serving LLM Reasoning Programs with Certaindex
Yichao Fu, Junda Chen, Siqi Zhu, Zheyu Fu, Zhongdongming Dai, Aurick Qiao, and Hao Zhang . arXiv 2024.
Part 3 - GenAI for Systems
Curie: Toward Rigorous and Automated Scientific Experimentation with AI Agents
Patrick T. J. Kon, Jiachen Liu, Qiuyi Ding, Yiming Qiu, Zhenning Yang, Yibo Huang, Jayanth Srinivasa, Myungjin Lee, Mosharaf Chowdhury, and Ang Chen . arXiv 2025.
Mathematical Discoveries from Program Search with Large Language Models (FunSearch)
Matej Balog, Bernhard Schölkopf, Karl Tuyls, Oriol Vinyals, Pushmeet Kohli, and Ulrich Paquet . Nature 2023.
AlphaEvolve: A Coding Agent for Scientific and Algorithmic Discovery
Alexander Novikov, Ngân Vũ, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Z. Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, et al. . DeepMind White Paper 2025.
NetLLM: Adapting Large Language Models for Networking
Duo Wu, Xianda Wang, Yaqi Qiao, Zhi Wang, Junchen Jiang, Shuguang Cui, and Fangxin Wang . SIGCOMM 24.
TextGrad: Automatic “Differentiation” via Text
Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou . arXiv 2024.
Zoom2Net: Constrained Network Telemetry Imputation
Xizheng Wang, Yuxiong He, Maria Apostolaki, Xin Jin, Mohammad Alizadeh, and Jennifer Rexford . SIGCOMM 24.