kat/docs/plan/phase2.md
2025-05-09 19:15:50 -04:00

99 lines
8.7 KiB
Markdown

# **Phase 2: Basic Agent & Node Lifecycle (Init, Join, PKI)**
* **Goal**: Implement the secure registration of a new agent node to an existing leader, including PKI for mTLS, and establish periodic heartbeating for status updates and failure detection.
* **RFC Sections Primarily Used**: 2.3 (Node Communication Protocol), 4.1.1 (Initial Leader Setup - CA), 4.1.2 (Agent Node Join - CSR), 10.1 (API Security - mTLS), 10.6 (Internal PKI), 4.1.3 (Node Heartbeat), 4.1.4 (Node Departure and Failure Detection - basic).
**Tasks & Sub-Tasks:**
1. **Implement Internal PKI Utilities (`internal/pki/ca.go`, `internal/pki/certs.go`)**
* **Purpose**: Create and manage the Certificate Authority and sign certificates for mTLS.
* **Details**:
* `GenerateCA()`: Creates a new RSA key pair and a self-signed X.509 CA certificate. Saves to disk (e.g., `/var/lib/kat/pki/ca.key`, `/var/lib/kat/pki/ca.crt`). Path from `cluster.kat` `backupPath` parent dir, or a new `pkiPath`.
* `GenerateCertificateRequest(commonName, keyOutPath, csrOutPath)`: Agent uses this. Generates RSA key, creates a CSR.
* `SignCertificateRequest(caKeyPath, caCertPath, csrData, certOutPath, duration)`: Leader uses this. Loads CA key/cert, parses CSR, issues a signed certificate.
* Helper functions to load keys and certs from disk.
* **Potential Challenges**: Handling cryptographic operations correctly and securely. Permissions for key storage.
* **Verification**: Unit tests for `GenerateCA`, `GenerateCertificateRequest`, `SignCertificateRequest`. Generated certs should be verifiable against the CA.
2. **Leader: Initialize CA & Its Own mTLS Certs on `init` (`cmd/kat-agent/main.go`)**
* **Purpose**: The first leader needs to establish the PKI and secure its own API endpoint.
* **Details**:
* During `kat-agent init`, after etcd is up and leadership is confirmed:
* Call `pki.GenerateCA()` if CA files don't exist.
* Generate its own server key and CSR (e.g., for `leader.kat.cluster.local`).
* Sign its own CSR using the CA to get its server certificate.
* Configure its (future) API HTTP server to use these server key/cert for TLS and require client certs (mTLS).
* **Verification**: After `kat-agent init`, CA key/cert and leader's server key/cert exist in the configured PKI path.
3. **Implement Basic API Server with mTLS on Leader (`internal/api/server.go`, `internal/api/router.go`)**
* **Purpose**: Provide the initial HTTP endpoints required for agent join, secured with mTLS.
* **Details**:
* Setup `http.Server` with `tls.Config`:
* `Certificates`: Leader's server key/cert.
* `ClientAuth: tls.RequireAndVerifyClientCert`.
* `ClientCAs`: Pool containing the cluster CA certificate.
* Minimal router (e.g., `gorilla/mux` or `http.ServeMux`) for:
* `POST /internal/v1alpha1/join`: Endpoint for agent to submit CSR. (Internal as it's part of bootstrap).
* **Verification**: An HTTPS client (e.g., `curl` with appropriate client certs) can connect to the leader's API port if it presents a cert signed by the cluster CA. Connection fails without a client cert or with a cert from a different CA.
4. **Agent: `join` Command & CSR Submission (`cmd/kat-agent/main.go`, `internal/cli/join.go` - or similar for agent logic)**
* **Purpose**: Allow a new agent to request to join the cluster and obtain its mTLS credentials.
* **Details**:
* `kat-agent join` subcommand:
* Flags: `--leader-api <ip:port>`, `--advertise-address <ip_or_interface_name>`, `--node-name <name>` (optional, leader can generate).
* Generate its own key pair and CSR using `pki.GenerateCertificateRequest()`.
* Make an HTTP POST to Leader's `/internal/v1alpha1/join` endpoint:
* Payload: CSR data, advertise address, requested node name, initial WireGuard public key (placeholder for now).
* For this *initial* join, the client may need to trust the leader's CA cert via an out-of-band mechanism or `--leader-ca-cert` flag, or use a token for initial auth if mTLS is strictly enforced from the start. *RFC implies mTLS is mandatory, so agent needs CA cert to trust leader, and leader needs to accept CSR perhaps based on a pre-shared token initially before agent has its own signed cert.* For simplicity in V1, the initial join POST might happen over HTTPS where the agent trusts the leader's self-signed cert (if leader has one before CA is used for client auth) or a pre-shared token authorizes the CSR signing. RFC 4.1.2 states "Leader, upon validating the join request (V1 has no strong token validation, relies on network trust)". This needs clarification. *Assume network trust for now: agent connects, sends CSR, leader signs.*
* Receive signed certificate and CA certificate from Leader. Store them locally.
* **Potential Challenges**: Securely bootstrapping trust for the very first communication to the leader to submit the CSR.
* **Verification**: `kat-agent join` command:
* Generates key/CSR.
* Successfully POSTs CSR to leader.
* Receives and saves its signed certificate and the CA certificate.
5. **Leader: CSR Signing & Node Registration (Handler for `/internal/v1alpha1/join`)**
* **Purpose**: Validate joining agent, sign its CSR, and record its registration.
* **Details**:
* Handler for `/internal/v1alpha1/join`:
* Parse CSR, advertise address, WG pubkey from request.
* Validate (minimal for now).
* Generate a unique Node Name if not provided. Assign a Node UID.
* Sign the CSR using `pki.SignCertificateRequest()`.
* Store Node registration data in etcd via `StateStore` (`/kat/nodes/registration/{nodeName}`: UID, advertise address, WG pubkey placeholder, join timestamp).
* Return the signed agent certificate and the cluster CA certificate to the agent.
* **Verification**:
* After agent joins, its certificate is signed by the cluster CA.
* Node registration data appears correctly in etcd under `/kat/nodes/registration/{nodeName}`.
6. **Agent: Establish mTLS Client for Subsequent Comms & Implement Heartbeating (`internal/agent/agent.go`)**
* **Purpose**: Agent uses its new mTLS certs to communicate status to the Leader.
* **Details**:
* Agent configures its HTTP client to use its signed key/cert and the cluster CA cert for all future Leader communications.
* Periodic Heartbeat (RFC 4.1.3):
* Ticker (e.g., every `agentTickSeconds` from `cluster.kat`, default 15s).
* On tick, gather basic node status (node name, timestamp, initial resource capacity stubs).
* HTTP `POST` to Leader's `/v1alpha1/nodes/{nodeName}/status` endpoint using the mTLS-configured client.
* **Verification**: Agent logs successful heartbeat POSTs.
7. **Leader: Receive Heartbeats & Basic Failure Detection (Handler for `/v1alpha1/nodes/{nodeName}/status`, `internal/leader/leader.go`)**
* **Purpose**: Leader tracks agent status and detects failures.
* **Details**:
* API endpoint `/v1alpha1/nodes/{nodeName}/status` (mTLS required):
* Receives status update from agent.
* Updates node's actual state in etcd (`/kat/nodes/status/{nodeName}/heartbeat`: timestamp, reported status). Could use an etcd lease for this key, renewed by agent heartbeats.
* Failure Detection (RFC 4.1.4):
* Leader has a reconciliation loop or periodic check.
* Scans `/kat/nodes/status/` in etcd.
* If a node's last heartbeat timestamp is older than `nodeLossTimeoutSeconds` (from `cluster.kat`), update its status in etcd to `NotReady` (e.g., `/kat/nodes/status/{nodeName}/condition: NotReady`).
* **Potential Challenges**: Efficiently scanning for dead nodes without excessive etcd load.
* **Milestone Verification**:
* `kat-agent init` runs as Leader, CA created, its API is up with mTLS.
* A second `kat-agent join ...` process successfully:
* Generates CSR, gets it signed by Leader.
* Saves its cert and CA cert.
* Starts sending heartbeats to Leader using mTLS.
* Leader logs receipt of heartbeats from the joined Agent.
* Node status (last heartbeat time) is updated in etcd by the Leader.
* If the joined Agent process is stopped, after `nodeLossTimeoutSeconds`, the Leader updates the node's status in etcd to `NotReady`. This can be verified using `etcdctl` or a `StateStore.Get` call.