Resource scheduling and cluster management for AI
-
Updated
Jun 6, 2024 - JavaScript
Resource scheduling and cluster management for AI
Best practices & guides on how to write distributed pytorch training code
GPU management System on Kubernetes For AI, Deep-Learning, Machine-Learning Researcher.
qihoo360 xlearning with GPU support; AI on Hadoop
Mixed-vendor GPU inference cluster manager with speculative decoding
Example code for deploying GPU workloads on ECS
Project "Springfield"
Detailed Analysis Traces for AI jobs leveraging spot GPU resources
Detailed Analysis Traces for GPU-Disaggregated Deep Learning Recommendation Models
AI Inference Gateway - orchestrates Ollama, vLLM, cloud providers, and vision services into a unified, production-ready platform
Web UI for orchestrating distributed llama.cpp RPC GPU clusters with auto node discovery, telemetry, and one-click deployment.
Docker Images for the GPU Cluster
Patent-pending K8s DaemonSet that validates GPU nodes via GEMM latency, fail-slow CV detection, and NVLink ring profiling before Slurm resumes them
MCP bridge for OpenClaw: Connect remote AI agents (OpenCode, Cursor, Windsurf) to your local MacBook browser and shell. Features secure Ed25519 signing and token-efficient unified control.
A simple tool to expose only specified number of GPUs with desired memory to Tensorflow
Complete setup guide for a 2-node NVIDIA DGX Spark cluster — distributed training, CUDA inference with EXO, NCCL tuning for Grace Blackwell, NVMe-TCP shared storage, and 200 Gb/s direct fabric networking.
기술 지원 종료로 유휴화된 28GPU 클러스터를 K8s + YOLOv8 MLOps 환경으로 단독 현대화 및 재구축
Add a description, image, and links to the gpu-cluster topic page so that developers can more easily learn about it.
To associate your repository with the gpu-cluster topic, visit your repo's landing page and select "manage topics."