# **Phase 2: Basic Agent & Node Lifecycle (Init, Join, PKI)** * **Goal**: Implement the secure registration of a new agent node to an existing leader, including PKI for mTLS, and establish periodic heartbeating for status updates and failure detection. * **RFC Sections Primarily Used**: 2.3 (Node Communication Protocol), 4.1.1 (Initial Leader Setup - CA), 4.1.2 (Agent Node Join - CSR), 10.1 (API Security - mTLS), 10.6 (Internal PKI), 4.1.3 (Node Heartbeat), 4.1.4 (Node Departure and Failure Detection - basic). **Tasks & Sub-Tasks:** 1. **Implement Internal PKI Utilities (`internal/pki/ca.go`, `internal/pki/certs.go`)** * **Purpose**: Create and manage the Certificate Authority and sign certificates for mTLS. * **Details**: * `GenerateCA()`: Creates a new RSA key pair and a self-signed X.509 CA certificate. Saves to disk (e.g., `/var/lib/kat/pki/ca.key`, `/var/lib/kat/pki/ca.crt`). Path from `cluster.kat` `backupPath` parent dir, or a new `pkiPath`. * `GenerateCertificateRequest(commonName, keyOutPath, csrOutPath)`: Agent uses this. Generates RSA key, creates a CSR. * `SignCertificateRequest(caKeyPath, caCertPath, csrData, certOutPath, duration)`: Leader uses this. Loads CA key/cert, parses CSR, issues a signed certificate. * Helper functions to load keys and certs from disk. * **Potential Challenges**: Handling cryptographic operations correctly and securely. Permissions for key storage. * **Verification**: Unit tests for `GenerateCA`, `GenerateCertificateRequest`, `SignCertificateRequest`. Generated certs should be verifiable against the CA. 2. **Leader: Initialize CA & Its Own mTLS Certs on `init` (`cmd/kat-agent/main.go`)** * **Purpose**: The first leader needs to establish the PKI and secure its own API endpoint. * **Details**: * During `kat-agent init`, after etcd is up and leadership is confirmed: * Call `pki.GenerateCA()` if CA files don't exist. * Generate its own server key and CSR (e.g., for `leader.kat.cluster.local`). * Sign its own CSR using the CA to get its server certificate. * Configure its (future) API HTTP server to use these server key/cert for TLS and require client certs (mTLS). * **Verification**: After `kat-agent init`, CA key/cert and leader's server key/cert exist in the configured PKI path. 3. **Implement Basic API Server with mTLS on Leader (`internal/api/server.go`, `internal/api/router.go`)** * **Purpose**: Provide the initial HTTP endpoints required for agent join, secured with mTLS. * **Details**: * Setup `http.Server` with `tls.Config`: * `Certificates`: Leader's server key/cert. * `ClientAuth: tls.RequireAndVerifyClientCert`. * `ClientCAs`: Pool containing the cluster CA certificate. * Minimal router (e.g., `gorilla/mux` or `http.ServeMux`) for: * `POST /internal/v1alpha1/join`: Endpoint for agent to submit CSR. (Internal as it's part of bootstrap). * **Verification**: An HTTPS client (e.g., `curl` with appropriate client certs) can connect to the leader's API port if it presents a cert signed by the cluster CA. Connection fails without a client cert or with a cert from a different CA. 4. **Agent: `join` Command & CSR Submission (`cmd/kat-agent/main.go`, `internal/cli/join.go` - or similar for agent logic)** * **Purpose**: Allow a new agent to request to join the cluster and obtain its mTLS credentials. * **Details**: * `kat-agent join` subcommand: * Flags: `--leader-api `, `--advertise-address `, `--node-name ` (optional, leader can generate). * Generate its own key pair and CSR using `pki.GenerateCertificateRequest()`. * Make an HTTP POST to Leader's `/internal/v1alpha1/join` endpoint: * Payload: CSR data, advertise address, requested node name, initial WireGuard public key (placeholder for now). * For this *initial* join, the client may need to trust the leader's CA cert via an out-of-band mechanism or `--leader-ca-cert` flag, or use a token for initial auth if mTLS is strictly enforced from the start. *RFC implies mTLS is mandatory, so agent needs CA cert to trust leader, and leader needs to accept CSR perhaps based on a pre-shared token initially before agent has its own signed cert.* For simplicity in V1, the initial join POST might happen over HTTPS where the agent trusts the leader's self-signed cert (if leader has one before CA is used for client auth) or a pre-shared token authorizes the CSR signing. RFC 4.1.2 states "Leader, upon validating the join request (V1 has no strong token validation, relies on network trust)". This needs clarification. *Assume network trust for now: agent connects, sends CSR, leader signs.* * Receive signed certificate and CA certificate from Leader. Store them locally. * **Potential Challenges**: Securely bootstrapping trust for the very first communication to the leader to submit the CSR. * **Verification**: `kat-agent join` command: * Generates key/CSR. * Successfully POSTs CSR to leader. * Receives and saves its signed certificate and the CA certificate. 5. **Leader: CSR Signing & Node Registration (Handler for `/internal/v1alpha1/join`)** * **Purpose**: Validate joining agent, sign its CSR, and record its registration. * **Details**: * Handler for `/internal/v1alpha1/join`: * Parse CSR, advertise address, WG pubkey from request. * Validate (minimal for now). * Generate a unique Node Name if not provided. Assign a Node UID. * Sign the CSR using `pki.SignCertificateRequest()`. * Store Node registration data in etcd via `StateStore` (`/kat/nodes/registration/{nodeName}`: UID, advertise address, WG pubkey placeholder, join timestamp). * Return the signed agent certificate and the cluster CA certificate to the agent. * **Verification**: * After agent joins, its certificate is signed by the cluster CA. * Node registration data appears correctly in etcd under `/kat/nodes/registration/{nodeName}`. 6. **Agent: Establish mTLS Client for Subsequent Comms & Implement Heartbeating (`internal/agent/agent.go`)** * **Purpose**: Agent uses its new mTLS certs to communicate status to the Leader. * **Details**: * Agent configures its HTTP client to use its signed key/cert and the cluster CA cert for all future Leader communications. * Periodic Heartbeat (RFC 4.1.3): * Ticker (e.g., every `agentTickSeconds` from `cluster.kat`, default 15s). * On tick, gather basic node status (node name, timestamp, initial resource capacity stubs). * HTTP `POST` to Leader's `/v1alpha1/nodes/{nodeName}/status` endpoint using the mTLS-configured client. * **Verification**: Agent logs successful heartbeat POSTs. 7. **Leader: Receive Heartbeats & Basic Failure Detection (Handler for `/v1alpha1/nodes/{nodeName}/status`, `internal/leader/leader.go`)** * **Purpose**: Leader tracks agent status and detects failures. * **Details**: * API endpoint `/v1alpha1/nodes/{nodeName}/status` (mTLS required): * Receives status update from agent. * Updates node's actual state in etcd (`/kat/nodes/status/{nodeName}/heartbeat`: timestamp, reported status). Could use an etcd lease for this key, renewed by agent heartbeats. * Failure Detection (RFC 4.1.4): * Leader has a reconciliation loop or periodic check. * Scans `/kat/nodes/status/` in etcd. * If a node's last heartbeat timestamp is older than `nodeLossTimeoutSeconds` (from `cluster.kat`), update its status in etcd to `NotReady` (e.g., `/kat/nodes/status/{nodeName}/condition: NotReady`). * **Potential Challenges**: Efficiently scanning for dead nodes without excessive etcd load. * **Milestone Verification**: * `kat-agent init` runs as Leader, CA created, its API is up with mTLS. * A second `kat-agent join ...` process successfully: * Generates CSR, gets it signed by Leader. * Saves its cert and CA cert. * Starts sending heartbeats to Leader using mTLS. * Leader logs receipt of heartbeats from the joined Agent. * Node status (last heartbeat time) is updated in etcd by the Leader. * If the joined Agent process is stopped, after `nodeLossTimeoutSeconds`, the Leader updates the node's status in etcd to `NotReady`. This can be verified using `etcdctl` or a `StateStore.Get` call.