展示HN:InferX – 一款原生AI操作系统,支持每个GPU运行50个大型语言模型并实现热插拔功能。
大家好,我们正在开发InferX,这是一种原生AI运行时,可以快速捕捉大型语言模型(LLM)的完整GPU执行状态(包括权重、KV缓存、CUDA上下文),并在2秒内恢复。这使得我们能够像线程一样热插拔模型,无需重新加载,也没有冷启动的时间。
我们将每个模型视为一个轻量级、可恢复的进程,类似于LLM推理的操作系统。
为什么这很重要:
- 每个GPU可以运行超过50个LLM(范围在7B到13B之间)
- GPU利用率达到90%(而传统设置大约为30-40%)
- 通过直接在GPU上快照和恢复,避免了冷启动
- 设计用于自主工作流、工具链和多租户使用场景
- 对于Codex CLI风格的编排或突发性多模型应用非常有帮助
虽然还处于早期阶段,但我们已经看到构建者和基础设施人员表现出强烈的兴趣。欢迎分享您的想法、反馈或希望看到测试的边缘案例。
演示链接: [https://inferx.net](https://inferx.net)
推特:@InferXai
查看原文
Hey folks , We’ve been building InferX. an AI-native runtime that snapshots the full GPU execution state of LLMs (weights, KV cache, CUDA context) and restores it in under 2s. This lets us hot-swap models like threads. no reloading, no cold starts.<p>We treat each model as a lightweight, resumable process. like an OS for LLM inference.<p>Why it matters:<p>•Run 50+ LLMs per GPU (7B–13B range)<p>•90% GPU utilization (vs ~30–40% with conventional setups)<p>•Avoids cold starts by snapshotting and restoring directly on GPU
•Designed for agentic workflows, toolchains, and multi-tenant use cases<p>•Helpful for Codex CLI-style orchestration or bursty multi-model apps<p>Still early, but we’re seeing strong interest from builders and infra folks.
Would love thoughts, feedback, or edge cases you’d want to see tested.<p>Demo: <a href="https://inferx.net" rel="nofollow">https://inferx.net</a>
X: @InferXai