jp6/cu128/: spas-sage-attn-0.1.0 metadata and description

Accurate and efficient Sparse SageAttention.

author	Jintao Zhang, Chendong Xiang, Haofeng Huang
author_email	jt-zhang6@gmail.com
classifiers	Development Status :: 3 - Alpha Intended Audience :: Developers Topic :: Software Development :: Libraries :: Python Modules License :: OSI Approved :: BSD License Programming Language :: Python :: 3 Programming Language :: Python :: 3.9 Programming Language :: Python :: 3.10 Programming Language :: Python :: 3.11 Operating System :: OS Independent
description_content_type	text/markdown
dynamic	summary
license	BSD 3-Clause License
license_file	LICENSE
metadata_version	2.4
requires_python	>=3.9

Because this project isn't in the mirror_whitelist, no releases from root/pypi are included.

File	Tox results	History
spas_sage_attn-0.1.0-cp312-cp312-linux_aarch64.whl Size 7 MB Type Python Wheel Python 3.12		Uploaded to jp6/cu128 by jp6 2025-08-27 13:32:12

Sparge Attention

The official implementation of SpargeAttn, a universal sparse attention accelerating language, image, and video models.

SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference

speed comparison.

overview.

Project Updates

[2025-05-02]: 🎉SpargeAttn and SageAttention2 are accepted by ICML 2025!
[2025-01-24]: 🎉SageAttention is accepted by ICLR 2025!

Installation

Base environment

python>=3.9 , torch>=2.3.0

CUDA:
- >=12.8 for Blackwell
- >=12.4 for fp8 support on Ada
- >=12.3 for fp8 support on Hopper
- >=12.0 for Ampere

Install Package

pip install ninja   # for parallel compilation
python setup.py install   # or pip install -e .

Avalible API

spas_sage2_attn_meansim_cuda: SpargeAttn based on SageAttention2.
spas_sage_attn_meansim_cuda: SpargeAttn based on SageAttention.

Usage Examples

CogVideoX

Tuning:

# sequential tuning
python evaluate/cogvideo_example.py  --use_spas_sage_attn --model_out_path evaluate/models_dict/CogVideoX-2b_0.06_0.07.pt --tune

# parallel tuning, this will use all gpu available on the machine 
python evaluate/cogvideo_example.py  --use_spas_sage_attn --model_out_path evaluate/models_dict/CogVideoX-2b_0.06_0.07.pt --tune --parallel_tune

Inference:

# `--compile` is optional and will slow the first time inference.
python evaluate/cogvideo_example.py  --use_spas_sage_attn --model_out_path evaluate/models_dict/CogVideoX-2b_0.06_0.07.pt --compile

Note: We provide pre-tuned hyper-parameters CogVideoX-2b_0.06_0.07.pt that allow running the inference script directly. However, for better performance in both speed and quality, we recommend re-tuning because the provided hyper-parameters are tuned with SpargeAttn based on SageAttention, whereas the default API is based on SageAttention2 now.

Note: --compile is optional and will further accelerate video generation but bring an overhead for the first video generation.

LLama

The tuning and inference usage is similar to CogVideoX.

Supported models

Here’s a list of the tuned models so far, go to hugginface to see all tuned ckpt. Our approach is universal, and we warmly welcome contributions! Feel free to submit a pull request to support more models. 🚀

model name	example script	tuned ckpt
CogVideoX-2b	evaluate/cogvideo_example.py	link
want2v-1.3B	evaluate/wan_example.py	link
Flux	evaluate/flux_example.py	TBD

Performance

Local Image

Note: All experiments in the above Table and our paper used SpargeAttn based on SageAttention. An updated implementation based on SageAttention2, is available now. It further offers a 30% speedup.

The quality of video generation on Mochi.

End-to-end performance of NIAH.

Citation

If you use this code or find our work valuable, please cite:

@inproceedings{zhang2025spargeattn,
  title={Spargeattn: Accurate sparse attention accelerating any model inference},
  author={Zhang, Jintao and Xiang, Chendong and Huang, Haofeng and Wei, Jia and Xi, Haocheng and Zhu, Jun and Chen, Jianfei},
  booktitle={International Conference on Machine Learning (ICML)},
  year={2025}
}

@inproceedings{zhang2025sageattention,
  title={SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration}, 
  author={Zhang, Jintao and Wei, Jia and Zhang, Pengle and Zhu, Jun and Chen, Jianfei},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2025}
}

@inproceedings{zhang2024sageattention2,
  title={Sageattention2: Efficient attention with thorough outlier smoothing and per-thread int4 quantization},
  author={Zhang, Jintao and Huang, Haofeng and Zhang, Pengle and Wei, Jia and Zhu, Jun and Chen, Jianfei},
  booktitle={International Conference on Machine Learning (ICML)},
  year={2025}
}

devpi