kvcache-ai/Mooncake

[TE] Refactor: unify libfabric-based transports (EFA / CXI) shared infrastructure

Open

#2,657 opened on Jun 29, 2026

View on GitHub
 (3 comments) (0 reactions) (0 assignees)C++ (803 forks)auto 404
good first issue

Repository metrics

Stars
 (5,470 stars)
PR merge metrics
 (Avg merge 4d 6h) (224 merged PRs in 30d)

Description

Background

With #2535 (CXI backend) merged, we now have two transports built on libfabric that share significant structural overlap but are implemented as independent forks. This was the pragmatic choice for initial integration (see discussion in #2535), but leaves maintenance debt.

Scope

  1. Topology — CXI currently reuses InfiniBand device discovery paths in topology.cpp. CXI is not IB; it should have its own topology functions or a shared libfabric_topology abstraction (ref: @alogfans' comment).

  2. Shared base class or utilities — Extract common libfabric patterns (endpoint management, CQ polling, MR registration, error handling) into a shared layer that both EFA and CXI can consume, reducing duplicated code.

  3. mr_key_t consolidation — The current #if defined(USE_EFA) || defined(USE_CXI) guards work but will grow unwieldy if more libfabric providers are added. Consider a compile-time or runtime abstraction.

Non-goals

  • Changing the public API or metadata wire format
  • Merging EFA and CXI into a single transport (they have meaningful behavioral differences in MR handling and transfer semantics)

References

  • #2535 — CXI backend PR
  • #2564 — 64-bit MR key fix

Contributor guide