milvus-io/milvus

[FR] streamingcoord: support load-aware weighting in vchannelFair balancer (byte-rate or memory-pressure term)

Open

#49,568 opened on May 7, 2026

View on GitHub
 (5 comments) (0 reactions) (2 assignees)Go (4,000 forks)batch import
help wantedkind/featuretriage/accepted

Repository metrics

Stars
 (44,298 stars)
PR merge metrics
 (Avg merge 7d 8h) (280 merged PRs in 30d)

Description

Summary

The vchannelFair balancer (the only registered streaming.walBalancer.balancePolicy in 2.6.10–2.6.15) computes its cost function from pchannel/vchannel count deviations only:

cost = PChannelWeight    * (pDiff)^2
     + VChannelWeight    * (vDiff)^2
     + AntiAffinityWeight * (1 - affinity)

(internal/streamingcoord/server/balancer/policy/vchannelfair/expected_layout.go @ v2.6.15)

There is no byte-rate, throughput, or memory-pressure term. When per-pchannel write rates are highly skewed — e.g. one collection has bursty writes while peers are quiet — count-balance leaves the hot pchannel pinned to a single streamingnode while peers sit idle. The policy is doing exactly what it was designed for; this request is to extend it.

Reproduction profile (paraphrased)

A production cluster running Milvus 2.6.15 with embedded Woodpecker, rootCoord.dmlChannelNum=32, eight streamingnode pods at requests=limits=8 GiB, ingesting a 4M-row batch via a delete-then-upsert pattern on a high-write-rate collection.

Observed during a campaign:

Time Hot pod mem Peer pods mem Action Result
T+0 pod-A 165% of 2 GiB request 7/8 pods <10% util (none yet — diagnosed)
T+17m pod-A 94% of 4 GiB after request bump 7/8 pods 150–500 MiB killed pod-A pod-B becomes hot
T+29m pod-B 61% of 4 GiB 7/8 pods <10% util watching pattern repeats on next ramp

Bumping streamingnode count from 5 → 8 mid-campaign produced no redistribution: walBalancer left the hot pchannel where it was because count was already even. Only manual `kubectl delete pod` on the hot pod forced reassignment, and the load just rotated to another single pod.

Workarounds attempted

  1. Increase pod count — does not help. Count-balance has nothing to redistribute.
  2. Increase per-pod memory — buys time, doesn't fix the asymmetry. Pushes OOM further out, not away.
  3. Tighten `walBalancer.triggerInterval` / `minRebalanceIntervalThreshold` / `vchannelFair.rebalanceTolerance` / `rebalanceMaxStep` / `antiAffinityWeight` — shortens the duration of a hot-spot but doesn't prevent re-concentration on the next pod, because the cost function still has nothing to weight against.
  4. `limitWriting.memProtection` — global write-deny when one pod hits ~85% mem. Not a balance fix; it just hard-denies cluster-wide writes when the asymmetry causes one pod to climb. Worse than the OOM. We disable it.
  5. `shards_num` per collectionwould fix it (spreads writes across more pchannels so count-balance becomes effectively load-balance). Requires collection recreation; high operational cost.
  6. Manual operator intervention (`kubectl delete pod`) — current standing playbook. Forces reassignment but rotates the problem rather than solving it.

Proposed enhancement

Extend the `vchannelFair` cost function with an optional load-weight term:

cost += LoadWeight * (loadDiff)^2

Where `loadDiff[node]` is the per-streamingnode deviation from cluster mean of an existing prometheus signal — e.g. `streamingnode_wal_append_bytes_rate` (5–60s rolling window).

Add corresponding config keys (defaults preserve current behaviour):

streaming:
  walBalancer:
    balancePolicy:
      vchannelFair:
        loadWeight: 0.0          # default 0 = backward-compatible
        loadMetric: bytes_rate   # bytes_rate | memory | (extensible)
        loadWindow: 30s

The cost-function structure already accepts weighted squared-diff terms; this is an additive extension rather than a redesign. Operators who don't set `loadWeight` get exactly today's behaviour.

Workflow impact if implemented

  1. Eliminates manual mid-campaign rebalancing. Today an operator watches streamingnode mem skew and does `kubectl delete pod` on the hot one every 15–30 minutes during peak load. With load-weighted balance, the policy would proactively reassign the hot pchannel before mem reaches the protection threshold.
  2. Restores `memProtection` as a viable safety net. Today it's disabled because asymmetric load makes it fire as a global write-deny rather than a per-pod safety bound. Memory-aware balance would keep all pods within the protection threshold under steady load, letting `memProtection` fire only on genuine cluster-wide overload.
  3. Streamingnode horizontal scaling becomes useful again. Today adding pods doesn't help — count-balance has nothing to redistribute. Load-weighted balance lets a freshly-scaled pod absorb hot pchannels.
  4. Reduces operator paging. Hot-pod-mem OOM is currently the dominant on-call signal during heavy ingest; a load-weighted policy prevents the asymmetric climb in the first place.

Backward compatibility

`loadWeight: 0.0` default preserves current behaviour exactly. Existing `vchannelFair` deployments that don't opt in see no change. The signal source (`streamingnode_wal_append_bytes_rate`) is already exported — no new metrics infrastructure needed.

Related issues

  • #40638 — vchannels unevenly distributed (closed for 2.6.0; introduced `vchannelFair` but didn't add load awareness)
  • #46026 — streamingnode memory leak under upsert workloads (compounds the asymmetry)
  • #48564 — `sessionDiscoverer.initDiscover` retains stale streamingnode sessions, blocking the balancer (separate but adjacent)
  • #47716 — streamingnode "freeze" / drain admin path inconsistent across components, deferred to 3.0 (so manual pchannel pin is not a viable interim lever)

Contributor guide