HackerNews中文版

大家好，我们正在开发InferX，这是一种原生AI运行时，可以快速捕捉大型语言模型（LLM）的完整GPU执行状态（包括权重、KV缓存、CUDA上下文），并在2秒内恢复。这使得我们能够像线程一样热插拔模型，无需重新加载，也没有冷启动的时间。我们将每个模型视为一个轻量级、可恢复的进程，类似于LLM推理的操作系统。为什么这很重要： - 每个GPU可以运行超过50个LLM（范围在7B到13B之间） - GPU利用率达到90%（而传统设置大约为30-40%） - 通过直接在GPU上快照和恢复，避免了冷启动 - 设计用于自主工作流、工具链和多租户使用场景 - 对于Codex CLI风格的编排或突发性多模型应用非常有帮助虽然还处于早期阶段，但我们已经看到构建者和基础设施人员表现出强烈的兴趣。欢迎分享您的想法、反馈或希望看到测试的边缘案例。演示链接： [https://inferx.net](https://inferx.net) 推特：@InferXai

查看原文

Hey folks , We’ve been building InferX. an AI-native runtime that snapshots the full GPU execution state of LLMs (weights, KV cache, CUDA context) and restores it in under 2s. This lets us hot-swap models like threads. no reloading, no cold starts.We treat each model as a lightweight, resumable process. like an OS for LLM inference.Why it matters:•Run 50+ LLMs per GPU (7B–13B range)•90% GPU utilization (vs ~30–40% with conventional setups)•Avoids cold starts by snapshotting and restoring directly on GPU •Designed for agentic workflows, toolchains, and multi-tenant use cases•Helpful for Codex CLI-style orchestration or bursty multi-model appsStill early, but we’re seeing strong interest from builders and infra folks. Would love thoughts, feedback, or edge cases you’d want to see tested.Demo: <a href="https://inferx.net" rel="nofollow">https://inferx.net</a> X: @InferXai

展示HN：InferX – 一款原生AI操作系统，支持每个GPU运行50个大型语言模型并实现热插拔功能。