ML System Papers and Resources
ML System Papers and Resources
ML System Papers
July 2025
- any4: Learned 4-bit Numeric Representation for LLMs
- AutoTriton: Automatic Triton Programming with Reinforcement Learning in LLMs
- Fast and Simplex: 2-Simplicial Attention in Triton
- ZeCO: Zero Communication Overhead Sequence Parallelism for Linear Attention
- AsyncFlow: An Asynchronous Streaming RL Framework for Efficient LLM Post-Training
- Kwai Keye-VL Technical Report
June 2025
- Ovis-U1 Technical Report
- VMoBA: Mixture-of-Block Attention for Video Diffusion Models
- Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation
- GPU Kernel Scientist: An LLM-Driven Framework for Iterative Kernel Optimization
-
BestServe: Serving Strategies with Optimal Goodput in Collocation and Disaggregation Architectures
- Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity
- Truncated Proximal Policy Optimization
- MoTE: Mixture of Ternary Experts for Memory-efficient Large Multimodal Models
- SlimMoE: Structured Compression of Large MoE Models via Expert Slimming and Distillation
- CommVQ: Commutative Vector Quantization for KV Cache Compression
- DiLoCoX: A Low-Communication Large-Scale Training Framework for Decentralized Cluster
- Scaling Speculative Decoding with Lookahead Reasoning
- OAgents: An Empirical Study of Building Effective Agents
- Efficient RL Training - Optimizing Memory Usage in verl
- FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
-
Hardware-Efficient Attention for Fast Decoding
- PAROAttention: Pattern-Aware ReOrdering for Efficient Sparse and Quantized Attention in Visual Generation Models
- Scaling Test-time Compute for LLM Agents
- MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
- Magistral
- Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models
- Seedance 1.0: Exploring the Boundaries of Video Generation Models
- NoLoCo: No-all-reduce Low Communication Training Method for Large Models
- Reinforcement Pre-Training
- SeerAttention-R: Sparse Attention Adaptation for Long Reasoning
- Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
- Inference-Time Hyper-Scaling with KV Cache Compression
- MiMo-VL Technical Report
May 2025
- KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction
- SageAttention2++: A More Efficient Implementation of SageAttention2
- QwenLong-CPRS: Towards $\infty$-LLMs with Dynamic Context Optimization
- Scaling Law for Quantization-Aware Training
- Quartet: Native FP4 Training Can Be Optimal for Large Language Models
- Emerging Properties in Unified Multimodal Pretraining
- SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training
- Chain-of-Model Learning for Language Model
- AdaCoT: Pareto-Optimal Adaptive Chain-of-Thought Triggering via Reinforcement Learning
- Fast and Accurate Sparse Attention Inference by Delta Correction
- Model Merging in Pre-training of Large Language Models
- EfficientLLM: Efficiency in Large Language Models
- Qwen3 Technical Report
- DanceGRPO: Unleashing GRPO on Visual Generation
- AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection
- MiMo: Unlocking the Reasoning Potential of Language Model – From Pretraining to Posttraining
- Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
- Seed1.5-VL Technical Report
- Llama-Nemotron: Efficient Reasoning Models
- An Empirical Study of Qwen3 Quantization
- BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs
April 2025
- Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler
- TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives
- Multi-Token Attention
- BitNet b1.58 2B4T Technical Report
- Efficient Pretraining Length Scaling
- Hogwild! Inference: Parallel LLM Generation via Concurrent Attention
- Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models
March 2025
- Qwen2.5-Omni Technical Report
- Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts
- Gemma 3 Technical Report
- Frac-Connections: Fractional Extension of Hyper-Connections
- DAPO: An Open-Source LLM Reinforcement Learning System at Scale
- Pranjal Shankhdhar Outperforming cuBLAS on H100: a Worklog
- GitHub - bertmaher/simplegemm
- Training Video Foundation Models with NVIDIA NeMo
- Technologies on Effectiveness and Efficiency: A Survey of State Spaces Models
- A Review of DeepSeek Models’ Key Innovative Techniques
- Transformers without Normalization
- Communication-Efficient Language Model Training Scales Reliably and Robustly: Scaling Laws for DiLoCo
- Quantization for OpenAI’s Whisper Models: A Comparative Analysis
- Gemini Embedding: Generalizable Embeddings from Gemini
February 2025
- Hardware-Aligned and Natively Trainable Sparse Attention
- Qwen2.5-VL Technical Report
- Memory-Efficient LoRA Training for Large Language Models
- Eager Updates For Overlapped Communication and Computation in DiLoCo
- Logical Reasoning in Large Language Models: A Survey
- InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU
- TransMLA: Multi-Head Latent Attention Is All You Need
- Matryoshka Quantization
- How To Scale Your Model
- FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation
January 2025
- s1: Simple test-time scaling
- Streaming DiLoCo with overlapping communication: Towards
- Optimizing Large Language Model Training Using FP4 Quantization
- SFT Memorizes, RL Generalizes: A Comparative Study of Founda…
- Sigma: Differential Rescaling of Query, Key and Value for…
- Parameter-Efficient Fine-Tuning for Foundation Models
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via.…
- Kimi k1.5: Scaling Reinforcement Learning with LLMs
- Foundations of Large Language Models
- MiniMax-01: Scaling Foundation Models with Lightning Attenti…
- Tensor Product Attention Is All You Need
- O1 Replication Journey – Part 3: Inference-time Scaling for…
- Transformer-Squared: Self-adaptive LLMs
- PyTorch Forums [Distributed w/ TorchTitan] Breaking Barriers: Training Long…
- Scaling Laws for Floating Point Quantization Training
- Titans: Learning to Memorize at Test Time
December 2024
- DeepSeek-V3 Technical Report
- 1.58-bit FLUX
- OpenAI o1 System Card
- MixLLM: LLM Quantization with Global Mixed-precision between…
- Qwen2.5 Technical Report
- PyTorch Forums [Distributed w/ TorchTitan] Training with Zero-Bubble Pipeli…
- No More Adam: Learning Rate Scaling at Initialization is All…
- Byte Latent Transformer: Patches Scale Better Than Tokens
This post is licensed under CC BY 4.0 by the author.