#

gpu-cluster

Here are 39 public repositories matching this topic...

microsoft / pai

Resource scheduling and cluster management for AI

kubernetes machine-learning cloud ai jupyter chainer tensorflow gpu scheduling pytorch artificial-intelligence gpu-computing resource-management model-training on-premise cluster-manager gpu-cluster cluster-management gpu-scheduler

Updated Jun 6, 2024
JavaScript

LambdaLabsML / distributed-training-guide

Best practices & guides on how to write distributed pytorch training code

gpu cluster mpi cuda slurm pytorch sharding kuberentes distributed-training nccl gpu-cluster deepspeed fsdp lambdalabs

Updated Oct 22, 2025
Python

cy69855522 / AI-Power

AI动力(AI Power) GPU云平台使用指南

gpu gpu-cluster aipower gpu-cloud

Updated Nov 2, 2021

AI-Ocean / Ocean

GPU management System on Kubernetes For AI, Deep-Learning, Machine-Learning Researcher.

kubernetes machine-learning research ai deep-learning gpu kubernetes-cluster gpu-acceleration gpu-monitoring gpu-cluster

Updated Oct 29, 2023

deepfinch / XLearning-GPU

qihoo360 xlearning with GPU support; AI on Hadoop

distributed-systems caffe deep-learning hadoop mxnet tensorflow gpu gpu-computing gpu-cluster

Updated Jan 10, 2018
Java

S-Lab-System-Group / ChronusArtifact

cloud-computing gpu-cluster

Updated Jan 7, 2022
Python

youngharold / tightwad

Mixed-vendor GPU inference cluster manager with speculative decoding

Updated Apr 28, 2026
Python

brentley / tensorflow-container

Example code for deploying GPU workloads on ECS

docker ai gpu ml ecs aiml gpu-computing ecs-cluster gpu-cluster ecs-task

Updated Mar 12, 2019
Dockerfile

uitml / springfield

Project "Springfield"

docker kubernetes machine-learning research tensorflow cuda gpu-cluster

Updated Jan 30, 2026
Shell

sultanul-ovi / GPU-Cluster-Spot-Resource-Dataset-Analysis

Detailed Analysis Traces for AI jobs leveraging spot GPU resources

data-visualization artificial-intelligence data-analysis alibaba gpu-cluster

Updated Aug 13, 2025
Jupyter Notebook

sultanul-ovi / Alibaba-GPU-Cluster-Dataset-2025-Analysis

Detailed Analysis Traces for GPU-Disaggregated Deep Learning Recommendation Models

deep-learning data-visualization recommendation-system data-analysis alibaba gpu-cluster

Updated Aug 12, 2025
Jupyter Notebook

languageseed / valet-gateway

AI Inference Gateway - orchestrates Ollama, vLLM, cloud providers, and vision services into a unified, production-ready platform

python embeddings inference-server gpu-cluster openai-api llm vllm ollama ai-gateway

Updated Jan 5, 2026
Python

arseniy0924 / rpc_manager

Web UI for orchestrating distributed llama.cpp RPC GPU clusters with auto node discovery, telemetry, and one-click deployment.

python flask rpc rpc-server rpc-client gpu-cluster llm llama-cpp gguf distributed-inference ai-cluster gguf-models ai-cluster-automation

Updated Mar 8, 2026
JavaScript

acm-uiuc / gpu-cluster-images

Docker Images for the GPU Cluster

docker ai deep-learning docker-image nvidia nvidia-docker gpu-cluster

Updated Feb 25, 2018
Python

justin-oleary / straggler-shield

Patent-pending K8s DaemonSet that validates GPU nodes via GEMM latency, fail-slow CV detection, and NVLink ring profiling before Slurm resumes them

go kubernetes machine-learning gpu cuda nvidia daemonset gpu-cluster a100 h100

Updated Feb 21, 2026
Go

acm-uiuc / gpu-cluster-backend

docker deep-learning backend gpu nvidia nvidia-docker gpu-cluster

Updated Mar 25, 2019
Python

lucas-jo / openclaw-bridge-remote

MCP bridge for OpenClaw: Connect remote AI agents (OpenCode, Cursor, Windsurf) to your local MacBook browser and shell. Features secure Ed25519 signing and token-efficient unified control.

typescript mcp browser-automation hybrid-cloud bun gpu-cluster remote-development agentic-ai openclaw

Updated Feb 19, 2026
TypeScript

saravanabalagi / mask-gpu

A simple tool to expose only specified number of GPUs with desired memory to Tensorflow

tensorflow-gpu gpu-cluster gpu-management mask-gpu expose-gpu

Updated Dec 20, 2019
Python

ArgentAIOS / dgx-spark-cluster

Complete setup guide for a 2-node NVIDIA DGX Spark cluster — distributed training, CUDA inference with EXO, NCCL tuning for Grace Blackwell, NVMe-TCP shared storage, and 200 Gb/s direct fabric networking.

Updated Apr 11, 2026
Python

52K-0812 / 28gpu-cluster-rebuild

기술 지원 종료로 유휴화된 28GPU 클러스터를 K8s + YOLOv8 MLOps 환경으로 단독 현대화 및 재구축

kubernetes ai grafana prometheus nvidia jupyterhub argo calico on-premise gpu-cluster mlops mlops-project yolov8

Updated May 2, 2026

Improve this page

Add a description, image, and links to the gpu-cluster topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the gpu-cluster topic, visit your repo's landing page and select "manage topics."