I am a fifth-year Ph.D. student in the Department of Computer Science at the University of Chicago, advised by Prof. Junchen Jiang and Prof. Shan Lu. My research builds efficient inference systems for large language models, in particular, I've built the first compression and streaming system for KV cache that's designed to reduce its inline network transmission latency: CacheGen, and the first translation system for KV cache between two different LLMs: DroidSpeak.
I received my B.S. in Computer Science from the University of Wisconsin–Madison in 2021, where I was fortunate to be advised by Prof. Shivaram Venkataraman.
* denotes equal contribution. Full list available on Google Scholar.
A production-ready stack for deploying LLM inference systems at scale.