展示HN:InferX – 一款原生AI操作系统,支持每个GPU运行50个大型语言模型并实现热插拔功能。

1作者: pveldandi2 个月前原帖
大家好,我们正在开发InferX,这是一种原生AI运行时,可以快速捕捉大型语言模型(LLM)的完整GPU执行状态(包括权重、KV缓存、CUDA上下文),并在2秒内恢复。这使得我们能够像线程一样热插拔模型,无需重新加载,也没有冷启动的时间。 我们将每个模型视为一个轻量级、可恢复的进程,类似于LLM推理的操作系统。 为什么这很重要: - 每个GPU可以运行超过50个LLM(范围在7B到13B之间) - GPU利用率达到90%(而传统设置大约为30-40%) - 通过直接在GPU上快照和恢复,避免了冷启动 - 设计用于自主工作流、工具链和多租户使用场景 - 对于Codex CLI风格的编排或突发性多模型应用非常有帮助 虽然还处于早期阶段,但我们已经看到构建者和基础设施人员表现出强烈的兴趣。欢迎分享您的想法、反馈或希望看到测试的边缘案例。 演示链接: [https://inferx.net](https://inferx.net) 推特:@InferXai
查看原文
Hey folks , We’ve been building InferX. an AI-native runtime that snapshots the full GPU execution state of LLMs (weights, KV cache, CUDA context) and restores it in under 2s. This lets us hot-swap models like threads. no reloading, no cold starts.<p>We treat each model as a lightweight, resumable process. like an OS for LLM inference.<p>Why it matters:<p>•Run 50+ LLMs per GPU (7B–13B range)<p>•90% GPU utilization (vs ~30–40% with conventional setups)<p>•Avoids cold starts by snapshotting and restoring directly on GPU •Designed for agentic workflows, toolchains, and multi-tenant use cases<p>•Helpful for Codex CLI-style orchestration or bursty multi-model apps<p>Still early, but we’re seeing strong interest from builders and infra folks. Would love thoughts, feedback, or edge cases you’d want to see tested.<p>Demo: <a href="https:&#x2F;&#x2F;inferx.net" rel="nofollow">https:&#x2F;&#x2F;inferx.net</a> X: @InferXai