Discover a novel mixed-precision quantization method for Mixture-of-Experts models that boosts accuracy and reduces inference costs with theoretical guaran...
Discover how Hybrid QUBO optimization improves neural network pruning by combining sensitivity metrics and dynamic search for better performance and effici...
Discover an ordered pipeline combining pruning, quantization, and distillation for efficient neural network compression with low latency and high accuracy.
Discover SoLA, a training-free method using soft activation sparsity and low-rank decomposition to compress large language models efficiently without perfo...