flyteorg/flyte

[flyte2] Instrument the actions service (watcher metrics + dropped-updates counter)

Open

#7,450 opened on May 29, 2026

View on GitHub
 (2 comments) (0 reactions) (1 assignee)Python (378 forks)batch import
flyte2good first issue

Repository metrics

Stars
 (3,705 stars)
PR merge metrics
 (Avg merge 3d 8h) (116 merged PRs in 30d)

Description

Part of #7445. Depends on #7446 (the /metrics endpoint + initialized Scope must exist first).

Summary

Instrument the actions service with Prometheus metrics: implement the existing dropped-updates counter TODO, and add throughput / latency / queue-depth metrics for the TaskAction watcher.

Background

The actions service is already partly wired for metrics — it just has nothing to plug into yet:

  • actions/setup.go:39 already passes sc.Scope into NewActionsClient(...).
  • actions/k8s/client.go:91 already uses scope.NewSubScope("actions_filter") for the dedup bloom filter.
  • actions/k8s/client.go:65 has an explicit TODO: // TODO: add a prometheus counter for dropped updates when metrics are wired up.

Note on the metrics scope: When run via the unified manager (manager/cmd/main.go:75), sc.Scope is already initialized (promutils.NewScope("flyte")) before actions.Setup runs, so the bloom-filter sub-scope at client.go:91 works and there is no panic. The dependency on #7446 is because #7446 mounts the /metrics endpoint — without it, the metrics you add here are registered into the default registry but never exposed to a scrape. (#7446 also initializes sc.Scope at the framework level, which additionally makes the standalone actions/cmd/main.go binary safe — that path currently leaves sc.Scope nil, so client.go:90-91's scope.NewSubScope(...) would panic there, since RecordFilterSize defaults to 1 << 23 > 0.)

What to do

Using the Scope available on ActionsClient (passed in via NewActionsClient), add metrics under a dedicated sub-scope (e.g. scope.NewSubScope("watcher")):

  1. Dropped updates counter — implement the TODO at actions/k8s/client.go:65. Increment a counter whenever a watch update is dropped (e.g. buffer full / channel send would block).
  2. Watcher throughput — counter of TaskAction events processed, labeled by result (success/error).
  3. Processing latency — a timer/histogram around per-event handling in the watch worker loop.
  4. Queue/buffer depth — a gauge for the watch buffer occupancy (config WatchBufferSize), updated as events are enqueued/dequeued (or sampled periodically).

Acceptance criteria

  • /metrics exposes a dropped-updates counter, watcher event throughput (by result), processing latency, and buffer depth for the actions service.
  • The TODO at actions/k8s/client.go:65 is implemented and removed.
  • Metrics are created once under a dedicated sub-scope (no Prometheus duplicate-registration panics).
  • A unit test verifies the dropped-updates counter increments when an update is dropped, and that the throughput counter increments on event processing.

Pointers

  • actions/k8s/client.go — the watcher, worker loop, buffer, and the dropped-updates TODO (line 65); constructor NewActionsClient (line 77) already receives a promutils.Scope.
  • actions/setup.go:31-40 — where NewActionsClient is constructed with sc.Scope.
  • flytestdlib/promutils/scope.goScope helpers (MustNewCounter, MustNewGauge, MustNewStopWatch, NewSubScope).

Notes for contributors

  • Keep label cardinality bounded — label by result/status, never by action/run IDs or other user input.
  • This is independent of the runs-service instrumentation issues (#7447, #7448, #7449); all consume the same Scope from #7446.

Contributor guide