[flyte2] Instrument the actions service (watcher metrics + dropped-updates counter) · flyteorg/flyte#7450

(2 comments) (0 reactions) (1 assignee)Python (378 forks)batch import

flyte2good first issue

Repository metrics

Stars: (3,705 stars)
PR merge metrics: (Avg merge 3d 8h) (116 merged PRs in 30d)

Description

Part of #7445. Depends on #7446 (the /metrics endpoint + initialized Scope must exist first).

Summary

Instrument the actions service with Prometheus metrics: implement the existing dropped-updates counter TODO, and add throughput / latency / queue-depth metrics for the TaskAction watcher.

Background

The actions service is already partly wired for metrics — it just has nothing to plug into yet:

actions/setup.go:39 already passes sc.Scope into NewActionsClient(...).
actions/k8s/client.go:91 already uses scope.NewSubScope("actions_filter") for the dedup bloom filter.
actions/k8s/client.go:65 has an explicit TODO: // TODO: add a prometheus counter for dropped updates when metrics are wired up.

Note on the metrics scope: When run via the unified manager (manager/cmd/main.go:75), sc.Scope is already initialized (promutils.NewScope("flyte")) before actions.Setup runs, so the bloom-filter sub-scope at client.go:91 works and there is no panic. The dependency on #7446 is because #7446 mounts the /metrics endpoint — without it, the metrics you add here are registered into the default registry but never exposed to a scrape. (#7446 also initializes sc.Scope at the framework level, which additionally makes the standalone actions/cmd/main.go binary safe — that path currently leaves sc.Scope nil, so client.go:90-91's scope.NewSubScope(...) would panic there, since RecordFilterSize defaults to 1 << 23 > 0.)

What to do

Using the Scope available on ActionsClient (passed in via NewActionsClient), add metrics under a dedicated sub-scope (e.g. scope.NewSubScope("watcher")):

Dropped updates counter — implement the TODO at actions/k8s/client.go:65. Increment a counter whenever a watch update is dropped (e.g. buffer full / channel send would block).
Watcher throughput — counter of TaskAction events processed, labeled by result (success/error).
Processing latency — a timer/histogram around per-event handling in the watch worker loop.
Queue/buffer depth — a gauge for the watch buffer occupancy (config WatchBufferSize), updated as events are enqueued/dequeued (or sampled periodically).

Acceptance criteria

/metrics exposes a dropped-updates counter, watcher event throughput (by result), processing latency, and buffer depth for the actions service.
The TODO at actions/k8s/client.go:65 is implemented and removed.
Metrics are created once under a dedicated sub-scope (no Prometheus duplicate-registration panics).
A unit test verifies the dropped-updates counter increments when an update is dropped, and that the throughput counter increments on event processing.

Pointers

actions/k8s/client.go — the watcher, worker loop, buffer, and the dropped-updates TODO (line 65); constructor NewActionsClient (line 77) already receives a promutils.Scope.
actions/setup.go:31-40 — where NewActionsClient is constructed with sc.Scope.
flytestdlib/promutils/scope.go — Scope helpers (MustNewCounter, MustNewGauge, MustNewStopWatch, NewSubScope).

Notes for contributors

Keep label cardinality bounded — label by result/status, never by action/run IDs or other user input.
This is independent of the runs-service instrumentation issues (#7447, #7448, #7449); all consume the same Scope from #7446.

Contributor guide

Research direction: Explore the actions service code in actions/k8s/client.go and setup.go, and flytestdlib/promutils/scope.go to understand the existing Scope usage. Add Prometheus counters, gauge, and histogram under a new sub scope, implement the dropped updates TODO, and write unit tests to verify metric increments.
Tech stack: go
Domain: backend
Issue type: Feature
Difficulty: 3
Estimated time: 1-2 days
Activity status: Active
Clarity: Clear
Prerequisites: GoPrometheus metrics
Newbie friendliness: 65