headscale

mirror of https://github.com/juanfont/headscale.git synced 2026-07-17 16:36:02 +00:00

Author	SHA1	Message	Date
Kristoffer Dalby	4946d1c88d	state: log nodes with map-breaking data at startup Scan a node-health check registry at boot and log each node whose name can't form a valid FQDN, with the rename fix. Log-only, no mutation. Updates #3346	2026-07-01 15:19:08 +02:00
Kristoffer Dalby	c497612c99	state: reject renames whose FQDN exceeds the hostname limit A valid label can still overflow 255 chars under a long base_domain; gate RenameNode with the new types.ValidateGivenName. Updates #3346	2026-07-01 15:19:08 +02:00
Kristoffer Dalby	ee95cf58d9	hscontrol: add the OAuth client and access-token model Add the OAuth client type, its database storage, the scope grant package, policy tag-ownership exposure, and the state operations backing the v2 OAuth client-credentials flow.	2026-06-26 09:18:06 +02:00
Kristoffer Dalby	6b413fe237	api/v2: soft-revoke auth keys with a configurable collector Tailscale's keys API has no separate expire verb: DELETE is the revoke. Map it to a soft revoke so the key stays retrievable as invalid afterwards instead of vanishing, matching the SDK and Terraform's expectations. Add a revoked timestamp to pre-auth keys (migration plus schema), mark a key invalid once revoked, and have the keys DELETE handler stamp it rather than destroy the row. A background collector reaps revoked keys after a configurable retention window (preauth_keys.revoked_retention, default 168h), so the table does not grow without bound.	2026-06-21 04:17:03 +02:00
Kristoffer Dalby	2ff342764f	state: return ErrGivenNameInvalid for invalid node rename RenameNode wrapped the raw dnsname error, so the v2 API mapped it to a 500. Emit the sentinel so an invalid name surfaces as a 400.	2026-06-21 04:17:03 +02:00
Kristoffer Dalby	3d811f29a6	db: add API-key owner and pre-auth-key description for v2 Give API keys an optional owning user and pre-auth keys a free-text description, plus the state accessors the v2 API needs to act as a key's owner and to persist descriptions.	2026-06-21 04:17:03 +02:00
Kristoffer Dalby	8f314797ce	all: fix full-tree golangci-lint findings The new full-tree golangci-lint check reports issues the --new-from-rev diff lint hid: nine wsl_v5 whitespace gaps, a prealloc, and an unparam (setCSRFCookie never errored, so drop the return and update callers). gocyclo on the central UpdateNodeFromMapRequest and an SA1019 NetMap deprecation in an integration helper are suppressed with reasons.	2026-06-20 21:08:01 +02:00
Kristoffer Dalby	560b6d81ad	hscontrol: remove the proto, gRPC and grpc-gateway stack With the v1 API served by Huma and the CLI on the HTTP client, nothing uses the gRPC service or its generated code. Delete proto/, the generated gen/go and gen/openapiv2, grpcv1.go, the buf config, the proto .Proto() helpers and gRPC config, and the proto build tooling from the flake and CI.	2026-06-19 15:21:00 +02:00
Kristoffer Dalby	ea5165e325	util, db: generate key material as hex via tailscale rands Some checks are pending Build / build-nix (push) Waiting to run Details Build / build-cross (GOARCH=amd64 GOOS=darwin) (push) Waiting to run Details Build / build-cross (GOARCH=amd64 GOOS=linux) (push) Waiting to run Details Build / build-cross (GOARCH=arm64 GOOS=darwin) (push) Waiting to run Details Build / build-cross (GOARCH=arm64 GOOS=linux) (push) Waiting to run Details Check Generated Files / check-generated (push) Waiting to run Details Build (main) / container (push) Waiting to run Details Build (main) / binaries (amd64, darwin) (push) Waiting to run Details Build (main) / binaries (amd64, linux) (push) Waiting to run Details Build (main) / binaries (arm64, darwin) (push) Waiting to run Details Build (main) / binaries (arm64, linux) (push) Waiting to run Details NixOS Module Tests / nix-module-check (push) Waiting to run Details Tests / test (push) Waiting to run Details	2026-06-17 16:12:19 +02:00
Kristoffer Dalby	368b9e7edd	state: add regression test for NodeKey-rotation binding	2026-06-17 16:12:19 +02:00
Kristoffer Dalby	27468f944b	all: adopt maps and cmp helpers	2026-06-17 16:12:19 +02:00
Kristoffer Dalby	60f0544b78	dns, change, noise, auth, capver: misc consolidation	2026-06-17 16:12:19 +02:00
Kristoffer Dalby	145672f0b6	state: consolidate route-election and node helpers	2026-06-17 16:12:19 +02:00
Kristoffer Dalby	55b9e3cfda	db, sqliteconfig: consolidate query and config helpers	2026-06-17 16:12:19 +02:00
Kristoffer Dalby	1689478485	state: return all nodes for a machine key, reject ambiguous ownership Collapse the single-pick machine-key lookups onto GetNodesByMachineKeyAllUsers so callers see every node sharing a machine key and reject the ambiguous or impossible cases (tagged and user-owned coexistence; a tagged key with several user-owned candidates) instead of mutating an arbitrarily-picked node. Updates #3312	2026-06-15 20:29:15 +02:00
Kristoffer Dalby	a1d3e98255	state: allow key expiry to be set on tagged nodes Tagged nodes disable key expiry by default but can still have one set explicitly, and changing tags leaves expiry unchanged, matching Tailscale. Updates #3312	2026-06-15 20:29:15 +02:00
Kristoffer Dalby	b83bf3f993	state: serialise registration per machine key Concurrent registrations of one machine key each saw no existing node and created their own, duplicating nodes and IPs. Hold a per-machine lock across the find-then-create section. Updates #3312	2026-06-15 20:29:15 +02:00
Kristoffer Dalby	96d2e6ed60	state: roll back node store when re-registration write fails Re-registration mutated the node store before the database write and did not revert on failure, so a restart could drop the client's current node key. Snapshot the node and restore it if the write fails. Updates #3312	2026-06-15 20:29:15 +02:00
Kristoffer Dalby	bff216a184	state: update node in place on pre-auth-key re-registration A reusable key on a converted node, or a tagged key on a user-owned node, fell through to new-node creation, leaving two nodes per machine. Match by machine key and update or convert in place. Updates #3312	2026-06-15 20:29:15 +02:00
Kristoffer Dalby	fd08b8fa8c	state: make any-user machine-key lookup deterministic The lookup returned the first map match, so re-auth branch choice varied with map order once a machine key had more than one node. Prefer the tagged node, else the lowest node ID. Updates #3312	2026-06-15 20:29:15 +02:00
Kristoffer Dalby	a73d38bb3f	state: reject re-registration claiming another node's key The pre-auth-key path wrote the client node key without the collision check the auth path applies, so a re-registration could claim another node's key and poison the node-key index. Reject keys bound to a different machine. Updates #3312	2026-06-15 20:29:15 +02:00
Kristoffer Dalby	e759d9fc90	auth: re-validate key when an expired node re-registers The re-registration fast-path skipped validation for a matching node key without checking expiry, so an expired node could re-auth with a spent key. Gate it on the node not being expired. Updates #3312	2026-06-15 20:29:15 +02:00
Kristoffer Dalby	0961e79e16	state: re-register converted tagged nodes with reused key A node converted to tagged is re-indexed under no user, so re-registration keyed on the key's owner missed it and rejected the spent one-shot key. Match the existing tagged node by machine key. Fixes #3312	2026-06-15 20:29:15 +02:00
Kristoffer Dalby	a5ef3aff15	state: patch relogins and gate endpoint broadcasts Some checks are pending Build / build-nix (push) Waiting to run Details Build / build-cross (GOARCH=amd64 GOOS=darwin) (push) Waiting to run Details Build / build-cross (GOARCH=amd64 GOOS=linux) (push) Waiting to run Details Build / build-cross (GOARCH=arm64 GOOS=darwin) (push) Waiting to run Details Build / build-cross (GOARCH=arm64 GOOS=linux) (push) Waiting to run Details Check Generated Files / check-generated (push) Waiting to run Details Build (main) / container (push) Waiting to run Details Build (main) / binaries (amd64, darwin) (push) Waiting to run Details Build (main) / binaries (amd64, linux) (push) Waiting to run Details Build (main) / binaries (arm64, darwin) (push) Waiting to run Details Build (main) / binaries (arm64, linux) (push) Waiting to run Details NixOS Module Tests / nix-module-check (push) Waiting to run Details Tests / test (push) Waiting to run Details Relogin sent as a peer patch with endpoints preserved; endpoint-only deltas broadcast only on useful (non-STUN) changes.	2026-06-15 12:02:39 +02:00
Kristoffer Dalby	f497b4efd7	state, poll: refcount poll sessions, mark offline only on last release A cancelled map request whose handler ran late could Connect after the live session, steal the newest SessionEpoch, then exit without disconnecting (stillConnected path); the live session's final Disconnect was rejected as stale and the node stayed online forever (relogin flake). Counted releases are order-independent, so overlapping sessions cannot strand a node in either direction.	2026-06-11 16:28:25 +02:00
kloba	71a4ce3c9f	noise: re-delegate SSH check when the auth session is missing (#3306 )	2026-06-10 11:48:02 +02:00
Kristoffer Dalby	88044f43ff	hscontrol: satisfy golangci-lint on changed lines Sentinel ErrNodeKeyInUse (err113); key the visible-peer set by tailcfg.NodeID to drop an int64->uint64 cast (gosec G115); NewRequestWithContext (noctx); wsl.	2026-06-09 15:21:18 +02:00
Kristoffer Dalby	5e05652a78	mapper: derive incremental visibility from one shared filter filterVisiblePeerPatches and filterVisibleNodes now share one visiblePeerIDs helper using the live MatchersForNode/ReduceNodes set, so paths cannot drift.	2026-06-09 15:21:18 +02:00
Kristoffer Dalby	4914f9f2fd	state: reject re-auth claiming another machine's NodeKey applyAuthNodeUpdate rotated the node's NodeKey to the client-supplied value without the 1:1 NodeKey/MachineKey check createAndSaveNewNode (f8f08cf7) and getAndValidateNode enforce. A re-authenticating node could thus rotate its key to a victim's and poison the NodeStore NodeKey index, denying the victim service. Apply the same uniqueness check on the re-auth path.	2026-06-09 15:21:18 +02:00
Kristoffer Dalby	eb57a3a62b	state: reject registration claiming another machine's NodeKey createAndSaveNewNode trusted the client NodeKey without checking it was already bound to a different machine, so an authenticated party could register a node carrying a victim's public NodeKey, poison the NodeStore NodeKey index, and make the victim's MapRequest resolve to the wrong node (rejected by getAndValidateNode = DoS). Enforce the 1:1 NodeKey/MachineKey binding at registration, as poll time already does.	2026-06-09 15:21:18 +02:00
Kristoffer Dalby	a518a5076a	state: batch route auto-approval into one policy rebuild Per-node SetApprovedRoutes made each policy reload O(m*n^2).	2026-06-08 10:04:49 +02:00
Kristoffer Dalby	08f186f22a	state: skip database persist for keepalive-only map requests A lone LastSeen bump no longer triggers a full-row write and policy rescan.	2026-06-08 10:04:49 +02:00
Kristoffer Dalby	017162dac1	state: signal NodeStore shutdown without closing writeQueue Writes racing Stop now drop cleanly instead of panicking on send.	2026-06-08 10:04:49 +02:00
Kristoffer Dalby	06d6816dc9	state: keep nil expiry for nodes that stay tagged on reauth The convert-from-tag arm wrongly set an expiry on still-tagged nodes.	2026-06-08 10:04:49 +02:00
Kristoffer Dalby	2e2401833b	state: persist live NodeStore node in persistNodeToDB Avoids clobbering concurrent admin writes (tags, routes, rename).	2026-06-08 10:04:49 +02:00
Kristoffer Dalby	5228cb1a40	change: drop subnet-router full update, use policy change Don't send a full update when a subnet router goes up or down; the gated policy change already recomputes peers and is a smaller payload. Updates #3293	2026-06-03 19:05:24 +02:00
Kristoffer Dalby	7706552c99	state: gate reconnect PolicyChange on NodeNeedsPeerRecompute Connect and Disconnect appended change.PolicyChange() on every reconnect. PolicyChange sets RequiresRuntimePeerComputation, so the batcher rebuilt a full netmap (packet filters, SSH policy, peer serialization) for every connected node — O(N) per reconnect, O(N^2) on a restart storm. On a small VM this saturated CPU after the v0.28 -> v0.29 upgrade. Emit it only when the node's online state changes what peers compute: subnet routers, relay targets, and via targets. An ordinary reconnect now sends just the lightweight online/offline peer patch. Relay and via targets still recompute, so peers drop a stale PeerRelay allocation when a relay goes offline. Fixes #3293	2026-06-03 14:51:57 +02:00
Kristoffer Dalby	4cca63155d	all: apply godoc [Name] link conventions across comments Some checks are pending Build / build-nix (push) Waiting to run Details Build / build-cross (GOARCH=amd64 GOOS=darwin) (push) Waiting to run Details Build / build-cross (GOARCH=amd64 GOOS=linux) (push) Waiting to run Details Build / build-cross (GOARCH=arm64 GOOS=darwin) (push) Waiting to run Details Build / build-cross (GOARCH=arm64 GOOS=linux) (push) Waiting to run Details Check Generated Files / check-generated (push) Waiting to run Details Build (main) / container (push) Waiting to run Details Build (main) / binaries (amd64, darwin) (push) Waiting to run Details Build (main) / binaries (amd64, linux) (push) Waiting to run Details Build (main) / binaries (arm64, darwin) (push) Waiting to run Details Build (main) / binaries (arm64, linux) (push) Waiting to run Details NixOS Module Tests / nix-module-check (push) Waiting to run Details Tests / test (push) Waiting to run Details Every Go-identifier reference in // and /* */ comments now uses godoc's [Name] linking syntax so pkg.go.dev and `go doc` render them as clickable cross-references. No behaviour change. Pattern applied across the tree: In-package [Foo], [Foo.Bar] Cross-package [pkg.Foo], [pkg.Foo.Bar] Stdlib [netip.Prefix], [errors.Is], [context.Context] Tailscale [tailcfg.MapResponse], [tailcfg.Node.CapMap], [tailcfg.NodeAttrSuggestExitNode] Skip rules: - File:line refs left as plain text - HuJSON wire keys inside backtick raw strings untouched - ACL/policy syntax tokens (tag:foo, autogroup:self, ...) not Go symbols, left as plain text - JSON/OIDC wire keys, gorm tags, RFC IPv6 placeholders, markdown link tags, decorative dividers — all left as-is	2026-05-19 09:55:22 +02:00
Kristoffer Dalby	17236fd284	all: annotate complex functions with gocyclo rationale Splitting these functions does not buy clarity — each has been extracted before and put back. Pin the //nolint:gocyclo on each with the reason their shape resists clean decomposition. policy/v2/policy.go ViaRoutesForPeer — three-pass via-grant resolution policy/v2/filter.go compileSSHPolicy — per-rule branches with intertwined autogroup:self handling (annotated in the earlier nil-error commit) state/state.go HandleNodeFromPreAuthKey — security- sensitive sequential validation order servertest/routes_test TestRoutes — table-driven test driver with many independent subtests Also: //nolint:recvcheck on policy/v2.SSHUser — UnmarshalJSON requires a pointer receiver; the other methods on this string newtype use value receivers by convention.	2026-05-19 09:55:22 +02:00
Kristoffer Dalby	e2f2f9211f	state, servertest: property-test HA election + invariant catalogue Expand TestPrimaryRoutesProperty (5 -> 9 ops). New ops mirror the production shapes the failure cases hit: BatchProbeResults via UpdateNodes, SimultaneousDisconnect via UpdateNodes, SetApprovedRoutes that leaves announced RoutableIPs intact, OfflineExpiry that keeps Unhealthy set. The model now tracks announced and approved separately and recomputes the intersection. Strengthen the per-op assertions to cover invariants the model alone cannot prove: every primary must be online, every primary must currently advertise its prefix, no flap onto an unhealthy candidate when a healthy one was available, no flap off a previous primary that remains a healthy candidate. The check now takes a pre-op snapshot so the anti-flap rule has a stable reference. Add TestHAProberProperty in servertest. It drives a real TestServer with three HA-route-advertising clients through rapid-drawn sequences of ClientDisconnect / ClientReconnect / ProberTick / WaitForSnapshot ops and re-checks the same shape invariants after every step. Document the system in hscontrol/state/HA_INVARIANTS.md: a state machine over (Healthy+Online, Unhealthy+Online, Offline, OfflineExpired), fifteen numbered invariants with predicates and violation paths, and a coverage matrix mapping each invariant to its unit, servertest, and integration tests. Three rows pin the recent fixes to the invariants they enforce.	2026-05-18 17:18:08 +02:00
Kristoffer Dalby	c7630b505b	state: leave prefix unmapped when all primary candidates unhealthy electPrimaryRoutes' all-unhealthy fallback picked candidates[0] when the previous primary was no longer a candidate. The Phase-5 simultaneous dual-disconnect path in TestHASubnetRouterFailoverDocker Disconnect hits this asymmetrically: a batched probe cycle marks both routers unhealthy with prev=r2 preserved, then the grace-period Disconnect for r2 drops it from candidates. With prev gone and the remaining r1 still carrying its Unhealthy bit, the fallback pointed peers at the cable-pulled r1 — flapping primary to an unreachable node and tripping requirePrimaryStable. Leave the prefix unmapped when prev is gone and every candidate is unhealthy. Peers see no advertiser instead of an unreachable one, which is honest: the next probe cycle re-evaluates and picks whichever node responds. The property-test model that mirrored the old behaviour is updated to match.	2026-05-18 17:18:08 +02:00
Kristoffer Dalby	de6be71a86	state: batch HA probe results so dual-disconnect cannot flap primary requirePrimaryStable in TestHASubnetRouterFailoverDockerDisconnect Phase 5a (simultaneous cable-pull of both routers) intermittently caught the primary flipping to the offline r1. Both probe goroutines mark their target unhealthy back-to-back; SetNodeUnhealthy publishes a fresh NodeStore snapshot each call, so the intermediate snapshot — r1 unhealthy, r2 still healthy — runs the election with one healthy candidate left and picks it. The next snapshot then enters the all-unhealthy preserve-prev path, which preserves the wrong choice. Collect probe results from the cycle and apply them through a new NodeStore.UpdateNodes batched op so the election only runs once, with the cycle's final health state. PolicyChange dispatch moves outside the wg.Go goroutines and fires once if the primary assignment actually changed.	2026-05-18 17:18:08 +02:00
Kristoffer Dalby	fb8eecae25	state: defer HA failover when probe target reconnected mid-cycle The HA prober dispatches a PingRequest, waits ProbeTimeout (5s), and marks the node unhealthy if no callback arrives. A node that bounced its poll session between probe cycles satisfies two conditions that conspire to fail TestHASubnetRouterFailover: a probe queued against the previous session is silently dropped when the worker writes to the closed connection (timeout always fires), and a probe sent immediately after reconnect lands while wgengine is still rebuilding magicsock state from the new netmap. Either path installs a spurious unhealthy bit, which sends the preserved-primary anti-flap the wrong way. Record the session observed at dispatch time and drop the timeout path if the node reconnected since. Require the session to survive a full probe cycle before a timeout can drive a failover.	2026-05-18 17:18:08 +02:00
Kristoffer Dalby	b1196baf6d	state: add regression test for Node slice persistence Drives the persist path for ApprovedRoutes, Tags and Endpoints — seed a non-empty value, clear to nil, read the column back from disk, then close the State and reopen one against the same sqlite file to simulate a server restart. Pins the contract the named IsZero slice types enforce so future changes to the persist path cannot silently drop a cleared slice column. Updates #3110	2026-05-15 11:21:58 +02:00
Kristoffer Dalby	7a20db9f49	types: persist Node JSON slices via named IsZero types Endpoints, Tags and ApprovedRoutes serialize as JSON on Node. GORM's struct Updates path skips fields it considers zero, and reflect treats a nil slice as zero — clearing any of these columns via the State persist path would leave the previous value in the database. Introduce Strings, Prefixes and AddrPorts as named slice types whose IsZero() always reports false, so GORM keeps the column in the UPDATE regardless of the slice being nil or empty. JSON marshalling is unchanged: nil serializes to null, empty to []. List() returns the underlying unnamed slice for callers (mainly testify assertions over reflect.DeepEqual) that distinguish the named type from its base. Regenerated types_clone.go and types_view.go follow the field-type swap. Test assertions across hscontrol/{db,state,servertest} updated to call .List() where reflect.DeepEqual previously matched the raw slice type. Fixes #3110	2026-05-15 11:21:58 +02:00
Kristoffer Dalby	6fcff9e352	mapper, state: deliver nodeAttrs through MapResponse and harden nextdns DoH rewrite WithSelfNode and buildTailPeers merge each node's policy CapMap into the tailcfg.Node.CapMap they emit. State.NodeCapMap and State.NodeCapMaps wrap the policy manager: NodeCapMap returns a defensive clone per call; NodeCapMaps snapshots the full per-node map once for batched callers, amortising pm.mu acquisition across a peer build. generateDNSConfig grew a per-node CapMap argument so it can apply nodeAttr-driven DNS overlays. The nextdns DoH rewrite hardens against policy-controlled inputs: - nextDNSDoHHost anchors the prefix match instead of substring, so a hostile resolver URL cannot smuggle a nextdns hostname in a path or query. - nextDNSProfileFromCapMap accepts only profile names matching [A-Za-z0-9._-]{1,64} and picks the lexicographically first when multiple are granted -- deterministic, no shell metacharacters or URL fragments through. - addNextDNSMetadata composes the rewritten URL via url.Parse + url.Values rather than fmt.Sprintf, so existing query strings on the resolver URL survive and metadata cannot inject a new component. WithTaildropEnabled in servertest controls cfg.Taildrop.Enabled per test so cap/file-sharing emission can be toggled in tests that need to verify the off path.	2026-05-13 14:22:30 +02:00
SAY-5	01e548e030	state: avoid nil deref in registration handlers when old user is missing Some checks failed Build / build-nix (push) Has been cancelled Details Build / build-cross (GOARCH=amd64 GOOS=darwin) (push) Has been cancelled Details Build / build-cross (GOARCH=amd64 GOOS=linux) (push) Has been cancelled Details Build / build-cross (GOARCH=arm64 GOOS=darwin) (push) Has been cancelled Details Build / build-cross (GOARCH=arm64 GOOS=linux) (push) Has been cancelled Details Check Generated Files / check-generated (push) Has been cancelled Details Build (main) / container (push) Has been cancelled Details Build (main) / binaries (amd64, darwin) (push) Has been cancelled Details Build (main) / binaries (amd64, linux) (push) Has been cancelled Details Build (main) / binaries (arm64, darwin) (push) Has been cancelled Details Build (main) / binaries (arm64, linux) (push) Has been cancelled Details NixOS Module Tests / nix-module-check (push) Has been cancelled Details Tests / test (push) Has been cancelled Details Mirror the guard from HandleNodeFromPreAuthKey in HandleNodeFromAuthPath. Both functions log the old user's name in the "different user" branch when an existing NodeStore entry under the same machine key belongs to another user. UserView.Name dereferences the backing User pointer unconditionally, so when the cached node was loaded with a non-nil UserID but a nil User (Preload join missed the row, or upstream code left the snapshot in that shape), the log call panics with a nil-pointer dereference at hscontrol/types/types_view.go:97. The panic is caught by the http2 server's runHandler for the noise control plane, so the process keeps running but every retry produces a new panic — production has observed bursts of ~1.9k panics per hour during a tailscaled reconnect loop. The gRPC/OIDC entry has no equivalent recover and would surface the panic to the caller. Guard both call sites with oldUser.Valid() and fall back to an empty old-user name when the pointer is nil. The "Creating new node for different user" log line still includes the existing node ID, hostname, machine key, and new user, so operator visibility is preserved. Add reproduction tests for both handlers seeding the orphan shape directly into NodeStore via PutNodeInStoreForTest. Co-Authored-By: Kristoffer Dalby <kristoffer@dalby.cc>	2026-05-06 07:23:02 +01:00
Kristoffer Dalby	94ec607bca	state: per-goroutine deadline in HA probe cycle `time.After(ProbeTimeout)` returned a single channel shared by every probe goroutine in the cycle. Only the first goroutine to receive the deadline tick drains the channel; any other goroutine still waiting on its `responseCh` is then stuck forever, `wg.Wait()` never returns, and the scheduler loop in `app.go` stalls on the next tick. The condition fires whenever two or more nodes time out in the same cycle — common under cable-pull where IsOnline lags reality and both routers stay in the candidate set as half-open TCP. Move the timer inside each goroutine so every probe has its own deadline. Updates #3234	2026-04-30 12:52:05 +01:00
Kristoffer Dalby	3d5c0af4e7	state: preserve previous primary when all HA advertisers unhealthy Some checks are pending Build / build-nix (push) Waiting to run Details Build / build-cross (GOARCH=amd64 GOOS=darwin) (push) Waiting to run Details Build / build-cross (GOARCH=amd64 GOOS=linux) (push) Waiting to run Details Build / build-cross (GOARCH=arm64 GOOS=darwin) (push) Waiting to run Details Build / build-cross (GOARCH=arm64 GOOS=linux) (push) Waiting to run Details Check Generated Files / check-generated (push) Waiting to run Details Build (main) / container (push) Waiting to run Details Build (main) / binaries (amd64, darwin) (push) Waiting to run Details Build (main) / binaries (amd64, linux) (push) Waiting to run Details Build (main) / binaries (arm64, darwin) (push) Waiting to run Details Build (main) / binaries (arm64, linux) (push) Waiting to run Details NixOS Module Tests / nix-module-check (push) Waiting to run Details Tests / test (push) Waiting to run Details electPrimaryRoutes' all-unhealthy fallback picked candidates[0] (lowest NodeID) regardless of who was prev. Under cable-pull semantics IsOnline lags reality (long-poll TCP half-open), so both routers stay in candidates and both go Unhealthy via the prober — the fallback then churned primary to a node that was itself unreachable. Prefer prev when still in candidates; fall through to candidates[0] only when prev is gone. Anti-blackhole holds. Update the property test reference model and split the unit test into existence (KeepsAPrimary) and identity (PreservesPrevious) cases. Fixes #3203	2026-04-29 18:08:39 +01:00
Kristoffer Dalby	9f7c8e9a07	state: clear Unhealthy when node leaves HA candidate set Restore the legacy auto-clear at write boundaries that drop HA candidacy: Disconnect, SetApprovedRoutes(empty), and UpdateNodeFromMapRequest shrinking advertised routes to empty. Plus a defensive guard in SetNodeUnhealthy. Updates #3203	2026-04-29 18:08:39 +01:00

1 2 3

144 commits