[FR] streamingcoord: support load-aware weighting in vchannelFair balancer (byte-rate or memory-pressure term) · milvus-io/milvus#49568

(5 comments) (0 reactions) (2 assignees)Go (4,000 forks)batch import

help wantedkind/featuretriage/accepted

Repository metrics

Stars: (44,298 stars)
PR merge metrics: (Avg merge 7d 8h) (280 merged PRs in 30d)

Description

Summary

The vchannelFair balancer (the only registered streaming.walBalancer.balancePolicy in 2.6.10–2.6.15) computes its cost function from pchannel/vchannel count deviations only:

cost = PChannelWeight    * (pDiff)^2
     + VChannelWeight    * (vDiff)^2
     + AntiAffinityWeight * (1 - affinity)

(internal/streamingcoord/server/balancer/policy/vchannelfair/expected_layout.go @ v2.6.15)

There is no byte-rate, throughput, or memory-pressure term. When per-pchannel write rates are highly skewed — e.g. one collection has bursty writes while peers are quiet — count-balance leaves the hot pchannel pinned to a single streamingnode while peers sit idle. The policy is doing exactly what it was designed for; this request is to extend it.

Reproduction profile (paraphrased)

A production cluster running Milvus 2.6.15 with embedded Woodpecker, rootCoord.dmlChannelNum=32, eight streamingnode pods at requests=limits=8 GiB, ingesting a 4M-row batch via a delete-then-upsert pattern on a high-write-rate collection.

Observed during a campaign:

Time	Hot pod mem	Peer pods mem	Action	Result
T+0	`pod-A` 165% of 2 GiB request	7/8 pods <10% util	(none yet — diagnosed)	—
T+17m	`pod-A` 94% of 4 GiB after request bump	7/8 pods 150–500 MiB	killed `pod-A`	`pod-B` becomes hot
T+29m	`pod-B` 61% of 4 GiB	7/8 pods <10% util	watching	pattern repeats on next ramp

Bumping streamingnode count from 5 → 8 mid-campaign produced no redistribution: walBalancer left the hot pchannel where it was because count was already even. Only manual `kubectl delete pod` on the hot pod forced reassignment, and the load just rotated to another single pod.

Workarounds attempted

Increase pod count — does not help. Count-balance has nothing to redistribute.
Increase per-pod memory — buys time, doesn't fix the asymmetry. Pushes OOM further out, not away.
Tighten `walBalancer.triggerInterval` / `minRebalanceIntervalThreshold` / `vchannelFair.rebalanceTolerance` / `rebalanceMaxStep` / `antiAffinityWeight` — shortens the duration of a hot-spot but doesn't prevent re-concentration on the next pod, because the cost function still has nothing to weight against.
`limitWriting.memProtection` — global write-deny when one pod hits ~85% mem. Not a balance fix; it just hard-denies cluster-wide writes when the asymmetry causes one pod to climb. Worse than the OOM. We disable it.
`shards_num` per collection — would fix it (spreads writes across more pchannels so count-balance becomes effectively load-balance). Requires collection recreation; high operational cost.
Manual operator intervention (`kubectl delete pod`) — current standing playbook. Forces reassignment but rotates the problem rather than solving it.

Proposed enhancement

Extend the `vchannelFair` cost function with an optional load-weight term:

cost += LoadWeight * (loadDiff)^2

Where `loadDiff[node]` is the per-streamingnode deviation from cluster mean of an existing prometheus signal — e.g. `streamingnode_wal_append_bytes_rate` (5–60s rolling window).

Add corresponding config keys (defaults preserve current behaviour):

streaming:
  walBalancer:
    balancePolicy:
      vchannelFair:
        loadWeight: 0.0          # default 0 = backward-compatible
        loadMetric: bytes_rate   # bytes_rate | memory | (extensible)
        loadWindow: 30s

The cost-function structure already accepts weighted squared-diff terms; this is an additive extension rather than a redesign. Operators who don't set `loadWeight` get exactly today's behaviour.

Workflow impact if implemented

Eliminates manual mid-campaign rebalancing. Today an operator watches streamingnode mem skew and does `kubectl delete pod` on the hot one every 15–30 minutes during peak load. With load-weighted balance, the policy would proactively reassign the hot pchannel before mem reaches the protection threshold.
Restores `memProtection` as a viable safety net. Today it's disabled because asymmetric load makes it fire as a global write-deny rather than a per-pod safety bound. Memory-aware balance would keep all pods within the protection threshold under steady load, letting `memProtection` fire only on genuine cluster-wide overload.
Streamingnode horizontal scaling becomes useful again. Today adding pods doesn't help — count-balance has nothing to redistribute. Load-weighted balance lets a freshly-scaled pod absorb hot pchannels.
Reduces operator paging. Hot-pod-mem OOM is currently the dominant on-call signal during heavy ingest; a load-weighted policy prevents the asymmetric climb in the first place.

Backward compatibility

`loadWeight: 0.0` default preserves current behaviour exactly. Existing `vchannelFair` deployments that don't opt in see no change. The signal source (`streamingnode_wal_append_bytes_rate`) is already exported — no new metrics infrastructure needed.

Related issues

#40638 — vchannels unevenly distributed (closed for 2.6.0; introduced `vchannelFair` but didn't add load awareness)
#46026 — streamingnode memory leak under upsert workloads (compounds the asymmetry)
#48564 — `sessionDiscoverer.initDiscover` retains stale streamingnode sessions, blocking the balancer (separate but adjacent)
#47716 — streamingnode "freeze" / drain admin path inconsistent across components, deferred to 3.0 (so manual pchannel pin is not a viable interim lever)

Contributor guide

Research direction: Examine the vchannelFair balancer cost function in internal/streamingcoord/server/balancer/policy/vchannelfair/expected layout.go. Understand the existing weighted squared diff terms (pDiff, vDiff, anti affinity). Design an optional load weight term using a Prometheus metric like streamingnode wal append bytes rate. Implement config keys loadWeight, loadMetric, loadWindow. Ensure default loadWeight=0 preserves backward compatibility. Write tests for load aware balancing behavior.
Tech stack: go
Domain: backend
Issue type: Feature
Difficulty: 3
Estimated time: 3-5 days
Activity status: Active
Clarity: Clear
Prerequisites: GoMilvus streaming coordination
Newbie friendliness: 45