jp6/cu128/: spas-sage-attn-0.1.0 metadata and description
Accurate and efficient Sparse SageAttention.
| author | Jintao Zhang, Chendong Xiang, Haofeng Huang |
| author_email | jt-zhang6@gmail.com |
| classifiers |
|
| description_content_type | text/markdown |
| dynamic |
|
| license | BSD 3-Clause License |
| license_file |
|
| metadata_version | 2.4 |
| requires_python | >=3.9 |
Because this project isn't in the mirror_whitelist,
no releases from root/pypi are included.
| File | Tox results | History |
|---|---|---|
spas_sage_attn-0.1.0-cp312-cp312-linux_aarch64.whl
|
|
Sparge Attention
The official implementation of SpargeAttn, a universal sparse attention accelerating language, image, and video models.
Project Updates
- [2025-05-02]: 🎉SpargeAttn and SageAttention2 are accepted by ICML 2025!
- [2025-01-24]: 🎉SageAttention is accepted by ICLR 2025!
Installation
Base environment
python>=3.9,torch>=2.3.0
CUDA:>=12.8for Blackwell>=12.4for fp8 support on Ada>=12.3for fp8 support on Hopper>=12.0for Ampere
Install Package
pip install ninja # for parallel compilation
python setup.py install # or pip install -e .
Avalible API
-
spas_sage2_attn_meansim_cuda: SpargeAttn based on SageAttention2. -
spas_sage_attn_meansim_cuda: SpargeAttn based on SageAttention.
Usage Examples
CogVideoX
Tuning:
# sequential tuning
python evaluate/cogvideo_example.py --use_spas_sage_attn --model_out_path evaluate/models_dict/CogVideoX-2b_0.06_0.07.pt --tune
# parallel tuning, this will use all gpu available on the machine
python evaluate/cogvideo_example.py --use_spas_sage_attn --model_out_path evaluate/models_dict/CogVideoX-2b_0.06_0.07.pt --tune --parallel_tune
Inference:
# `--compile` is optional and will slow the first time inference.
python evaluate/cogvideo_example.py --use_spas_sage_attn --model_out_path evaluate/models_dict/CogVideoX-2b_0.06_0.07.pt --compile
Note: We provide pre-tuned hyper-parameters
CogVideoX-2b_0.06_0.07.ptthat allow running the inference script directly. However, for better performance in both speed and quality, we recommend re-tuning because the provided hyper-parameters are tuned with SpargeAttn based on SageAttention, whereas the default API is based on SageAttention2 now.
Note:
--compileis optional and will further accelerate video generation but bring an overhead for the first video generation.
LLama
The tuning and inference usage is similar to CogVideoX.
Supported models
Here’s a list of the tuned models so far, go to hugginface to see all tuned ckpt. Our approach is universal, and we warmly welcome contributions! Feel free to submit a pull request to support more models. 🚀
| model name | example script | tuned ckpt |
|---|---|---|
| CogVideoX-2b | evaluate/cogvideo_example.py | link |
| want2v-1.3B | evaluate/wan_example.py | link |
| Flux | evaluate/flux_example.py | TBD |
Performance

Note: All experiments in the above Table and our paper used SpargeAttn based on SageAttention. An updated implementation based on SageAttention2, is available now. It further offers a 30% speedup.
The quality of video generation on Mochi. |
End-to-end performance of NIAH. |
Citation
If you use this code or find our work valuable, please cite:
@inproceedings{zhang2025spargeattn,
title={Spargeattn: Accurate sparse attention accelerating any model inference},
author={Zhang, Jintao and Xiang, Chendong and Huang, Haofeng and Wei, Jia and Xi, Haocheng and Zhu, Jun and Chen, Jianfei},
booktitle={International Conference on Machine Learning (ICML)},
year={2025}
}
@inproceedings{zhang2025sageattention,
title={SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration},
author={Zhang, Jintao and Wei, Jia and Zhang, Pengle and Zhu, Jun and Chen, Jianfei},
booktitle={International Conference on Learning Representations (ICLR)},
year={2025}
}
@inproceedings{zhang2024sageattention2,
title={Sageattention2: Efficient attention with thorough outlier smoothing and per-thread int4 quantization},
author={Zhang, Jintao and Huang, Haofeng and Zhang, Pengle and Wei, Jia and Zhu, Jun and Chen, Jianfei},
booktitle={International Conference on Machine Learning (ICML)},
year={2025}
}