(Experimental) A high-throughput and memory-efficient inference and serving engine for LLMs on DGX Spark / GB10
nvidia cuda-kernels cutlass local-inference vllm llm-inference qwen paged-attention self-hosted-ai gb10 sm120 nvfp4 dgx-spark fp4-quantization attention-kernel fp8-kv-cache
-
Updated
May 4, 2026 - Python