99 lines
8.7 KiB
Markdown
99 lines
8.7 KiB
Markdown
# **Phase 2: Basic Agent & Node Lifecycle (Init, Join, PKI)**
|
|
|
|
* **Goal**: Implement the secure registration of a new agent node to an existing leader, including PKI for mTLS, and establish periodic heartbeating for status updates and failure detection.
|
|
* **RFC Sections Primarily Used**: 2.3 (Node Communication Protocol), 4.1.1 (Initial Leader Setup - CA), 4.1.2 (Agent Node Join - CSR), 10.1 (API Security - mTLS), 10.6 (Internal PKI), 4.1.3 (Node Heartbeat), 4.1.4 (Node Departure and Failure Detection - basic).
|
|
|
|
**Tasks & Sub-Tasks:**
|
|
|
|
1. **Implement Internal PKI Utilities (`internal/pki/ca.go`, `internal/pki/certs.go`)**
|
|
* **Purpose**: Create and manage the Certificate Authority and sign certificates for mTLS.
|
|
* **Details**:
|
|
* `GenerateCA()`: Creates a new RSA key pair and a self-signed X.509 CA certificate. Saves to disk (e.g., `/var/lib/kat/pki/ca.key`, `/var/lib/kat/pki/ca.crt`). Path from `cluster.kat` `backupPath` parent dir, or a new `pkiPath`.
|
|
* `GenerateCertificateRequest(commonName, keyOutPath, csrOutPath)`: Agent uses this. Generates RSA key, creates a CSR.
|
|
* `SignCertificateRequest(caKeyPath, caCertPath, csrData, certOutPath, duration)`: Leader uses this. Loads CA key/cert, parses CSR, issues a signed certificate.
|
|
* Helper functions to load keys and certs from disk.
|
|
* **Potential Challenges**: Handling cryptographic operations correctly and securely. Permissions for key storage.
|
|
* **Verification**: Unit tests for `GenerateCA`, `GenerateCertificateRequest`, `SignCertificateRequest`. Generated certs should be verifiable against the CA.
|
|
|
|
2. **Leader: Initialize CA & Its Own mTLS Certs on `init` (`cmd/kat-agent/main.go`)**
|
|
* **Purpose**: The first leader needs to establish the PKI and secure its own API endpoint.
|
|
* **Details**:
|
|
* During `kat-agent init`, after etcd is up and leadership is confirmed:
|
|
* Call `pki.GenerateCA()` if CA files don't exist.
|
|
* Generate its own server key and CSR (e.g., for `leader.kat.cluster.local`).
|
|
* Sign its own CSR using the CA to get its server certificate.
|
|
* Configure its (future) API HTTP server to use these server key/cert for TLS and require client certs (mTLS).
|
|
* **Verification**: After `kat-agent init`, CA key/cert and leader's server key/cert exist in the configured PKI path.
|
|
|
|
3. **Implement Basic API Server with mTLS on Leader (`internal/api/server.go`, `internal/api/router.go`)**
|
|
* **Purpose**: Provide the initial HTTP endpoints required for agent join, secured with mTLS.
|
|
* **Details**:
|
|
* Setup `http.Server` with `tls.Config`:
|
|
* `Certificates`: Leader's server key/cert.
|
|
* `ClientAuth: tls.RequireAndVerifyClientCert`.
|
|
* `ClientCAs`: Pool containing the cluster CA certificate.
|
|
* Minimal router (e.g., `gorilla/mux` or `http.ServeMux`) for:
|
|
* `POST /internal/v1alpha1/join`: Endpoint for agent to submit CSR. (Internal as it's part of bootstrap).
|
|
* **Verification**: An HTTPS client (e.g., `curl` with appropriate client certs) can connect to the leader's API port if it presents a cert signed by the cluster CA. Connection fails without a client cert or with a cert from a different CA.
|
|
|
|
4. **Agent: `join` Command & CSR Submission (`cmd/kat-agent/main.go`, `internal/cli/join.go` - or similar for agent logic)**
|
|
* **Purpose**: Allow a new agent to request to join the cluster and obtain its mTLS credentials.
|
|
* **Details**:
|
|
* `kat-agent join` subcommand:
|
|
* Flags: `--leader-api <ip:port>`, `--advertise-address <ip_or_interface_name>`, `--node-name <name>` (optional, leader can generate).
|
|
* Generate its own key pair and CSR using `pki.GenerateCertificateRequest()`.
|
|
* Make an HTTP POST to Leader's `/internal/v1alpha1/join` endpoint:
|
|
* Payload: CSR data, advertise address, requested node name, initial WireGuard public key (placeholder for now).
|
|
* For this *initial* join, the client may need to trust the leader's CA cert via an out-of-band mechanism or `--leader-ca-cert` flag, or use a token for initial auth if mTLS is strictly enforced from the start. *RFC implies mTLS is mandatory, so agent needs CA cert to trust leader, and leader needs to accept CSR perhaps based on a pre-shared token initially before agent has its own signed cert.* For simplicity in V1, the initial join POST might happen over HTTPS where the agent trusts the leader's self-signed cert (if leader has one before CA is used for client auth) or a pre-shared token authorizes the CSR signing. RFC 4.1.2 states "Leader, upon validating the join request (V1 has no strong token validation, relies on network trust)". This needs clarification. *Assume network trust for now: agent connects, sends CSR, leader signs.*
|
|
* Receive signed certificate and CA certificate from Leader. Store them locally.
|
|
* **Potential Challenges**: Securely bootstrapping trust for the very first communication to the leader to submit the CSR.
|
|
* **Verification**: `kat-agent join` command:
|
|
* Generates key/CSR.
|
|
* Successfully POSTs CSR to leader.
|
|
* Receives and saves its signed certificate and the CA certificate.
|
|
|
|
5. **Leader: CSR Signing & Node Registration (Handler for `/internal/v1alpha1/join`)**
|
|
* **Purpose**: Validate joining agent, sign its CSR, and record its registration.
|
|
* **Details**:
|
|
* Handler for `/internal/v1alpha1/join`:
|
|
* Parse CSR, advertise address, WG pubkey from request.
|
|
* Validate (minimal for now).
|
|
* Generate a unique Node Name if not provided. Assign a Node UID.
|
|
* Sign the CSR using `pki.SignCertificateRequest()`.
|
|
* Store Node registration data in etcd via `StateStore` (`/kat/nodes/registration/{nodeName}`: UID, advertise address, WG pubkey placeholder, join timestamp).
|
|
* Return the signed agent certificate and the cluster CA certificate to the agent.
|
|
* **Verification**:
|
|
* After agent joins, its certificate is signed by the cluster CA.
|
|
* Node registration data appears correctly in etcd under `/kat/nodes/registration/{nodeName}`.
|
|
|
|
6. **Agent: Establish mTLS Client for Subsequent Comms & Implement Heartbeating (`internal/agent/agent.go`)**
|
|
* **Purpose**: Agent uses its new mTLS certs to communicate status to the Leader.
|
|
* **Details**:
|
|
* Agent configures its HTTP client to use its signed key/cert and the cluster CA cert for all future Leader communications.
|
|
* Periodic Heartbeat (RFC 4.1.3):
|
|
* Ticker (e.g., every `agentTickSeconds` from `cluster.kat`, default 15s).
|
|
* On tick, gather basic node status (node name, timestamp, initial resource capacity stubs).
|
|
* HTTP `POST` to Leader's `/v1alpha1/nodes/{nodeName}/status` endpoint using the mTLS-configured client.
|
|
* **Verification**: Agent logs successful heartbeat POSTs.
|
|
|
|
7. **Leader: Receive Heartbeats & Basic Failure Detection (Handler for `/v1alpha1/nodes/{nodeName}/status`, `internal/leader/leader.go`)**
|
|
* **Purpose**: Leader tracks agent status and detects failures.
|
|
* **Details**:
|
|
* API endpoint `/v1alpha1/nodes/{nodeName}/status` (mTLS required):
|
|
* Receives status update from agent.
|
|
* Updates node's actual state in etcd (`/kat/nodes/status/{nodeName}/heartbeat`: timestamp, reported status). Could use an etcd lease for this key, renewed by agent heartbeats.
|
|
* Failure Detection (RFC 4.1.4):
|
|
* Leader has a reconciliation loop or periodic check.
|
|
* Scans `/kat/nodes/status/` in etcd.
|
|
* If a node's last heartbeat timestamp is older than `nodeLossTimeoutSeconds` (from `cluster.kat`), update its status in etcd to `NotReady` (e.g., `/kat/nodes/status/{nodeName}/condition: NotReady`).
|
|
* **Potential Challenges**: Efficiently scanning for dead nodes without excessive etcd load.
|
|
* **Milestone Verification**:
|
|
* `kat-agent init` runs as Leader, CA created, its API is up with mTLS.
|
|
* A second `kat-agent join ...` process successfully:
|
|
* Generates CSR, gets it signed by Leader.
|
|
* Saves its cert and CA cert.
|
|
* Starts sending heartbeats to Leader using mTLS.
|
|
* Leader logs receipt of heartbeats from the joined Agent.
|
|
* Node status (last heartbeat time) is updated in etcd by the Leader.
|
|
* If the joined Agent process is stopped, after `nodeLossTimeoutSeconds`, the Leader updates the node's status in etcd to `NotReady`. This can be verified using `etcdctl` or a `StateStore.Get` call.
|