# **Phase 2: Basic Agent & Node Lifecycle (Init, Join, PKI)**

*   **Goal**: Implement the secure registration of a new agent node to an existing leader, including PKI for mTLS, and establish periodic heartbeating for status updates and failure detection.
*   **RFC Sections Primarily Used**: 2.3 (Node Communication Protocol), 4.1.1 (Initial Leader Setup - CA), 4.1.2 (Agent Node Join - CSR), 10.1 (API Security - mTLS), 10.6 (Internal PKI), 4.1.3 (Node Heartbeat), 4.1.4 (Node Departure and Failure Detection - basic).

**Tasks & Sub-Tasks:**

1.  **Implement Internal PKI Utilities (`internal/pki/ca.go`, `internal/pki/certs.go`)**
    *   **Purpose**: Create and manage the Certificate Authority and sign certificates for mTLS.
    *   **Details**:
        *   `GenerateCA()`: Creates a new RSA key pair and a self-signed X.509 CA certificate. Saves to disk (e.g., `/var/lib/kat/pki/ca.key`, `/var/lib/kat/pki/ca.crt`). Path from `cluster.kat` `backupPath` parent dir, or a new `pkiPath`.
        *   `GenerateCertificateRequest(commonName, keyOutPath, csrOutPath)`: Agent uses this. Generates RSA key, creates a CSR.
        *   `SignCertificateRequest(caKeyPath, caCertPath, csrData, certOutPath, duration)`: Leader uses this. Loads CA key/cert, parses CSR, issues a signed certificate.
        *   Helper functions to load keys and certs from disk.
    *   **Potential Challenges**: Handling cryptographic operations correctly and securely. Permissions for key storage.
    *   **Verification**: Unit tests for `GenerateCA`, `GenerateCertificateRequest`, `SignCertificateRequest`. Generated certs should be verifiable against the CA.

2.  **Leader: Initialize CA & Its Own mTLS Certs on `init` (`cmd/kat-agent/main.go`)**
    *   **Purpose**: The first leader needs to establish the PKI and secure its own API endpoint.
    *   **Details**:
        *   During `kat-agent init`, after etcd is up and leadership is confirmed:
            *   Call `pki.GenerateCA()` if CA files don't exist.
            *   Generate its own server key and CSR (e.g., for `leader.kat.cluster.local`).
            *   Sign its own CSR using the CA to get its server certificate.
            *   Configure its (future) API HTTP server to use these server key/cert for TLS and require client certs (mTLS).
    *   **Verification**: After `kat-agent init`, CA key/cert and leader's server key/cert exist in the configured PKI path.

3.  **Implement Basic API Server with mTLS on Leader (`internal/api/server.go`, `internal/api/router.go`)**
    *   **Purpose**: Provide the initial HTTP endpoints required for agent join, secured with mTLS.
    *   **Details**:
        *   Setup `http.Server` with `tls.Config`:
            *   `Certificates`: Leader's server key/cert.
            *   `ClientAuth: tls.RequireAndVerifyClientCert`.
            *   `ClientCAs`: Pool containing the cluster CA certificate.
        *   Minimal router (e.g., `gorilla/mux` or `http.ServeMux`) for:
            *   `POST /internal/v1alpha1/join`: Endpoint for agent to submit CSR. (Internal as it's part of bootstrap).
    *   **Verification**: An HTTPS client (e.g., `curl` with appropriate client certs) can connect to the leader's API port if it presents a cert signed by the cluster CA. Connection fails without a client cert or with a cert from a different CA.

4.  **Agent: `join` Command & CSR Submission (`cmd/kat-agent/main.go`, `internal/cli/join.go` - or similar for agent logic)**
    *   **Purpose**: Allow a new agent to request to join the cluster and obtain its mTLS credentials.
    *   **Details**:
        *   `kat-agent join` subcommand:
            *   Flags: `--leader-api <ip:port>`, `--advertise-address <ip_or_interface_name>`, `--node-name <name>` (optional, leader can generate).
            *   Generate its own key pair and CSR using `pki.GenerateCertificateRequest()`.
            *   Make an HTTP POST to Leader's `/internal/v1alpha1/join` endpoint:
                *   Payload: CSR data, advertise address, requested node name, initial WireGuard public key (placeholder for now).
                *   For this *initial* join, the client may need to trust the leader's CA cert via an out-of-band mechanism or `--leader-ca-cert` flag, or use a token for initial auth if mTLS is strictly enforced from the start. *RFC implies mTLS is mandatory, so agent needs CA cert to trust leader, and leader needs to accept CSR perhaps based on a pre-shared token initially before agent has its own signed cert.* For simplicity in V1, the initial join POST might happen over HTTPS where the agent trusts the leader's self-signed cert (if leader has one before CA is used for client auth) or a pre-shared token authorizes the CSR signing. RFC 4.1.2 states "Leader, upon validating the join request (V1 has no strong token validation, relies on network trust)". This needs clarification. *Assume network trust for now: agent connects, sends CSR, leader signs.*
            *   Receive signed certificate and CA certificate from Leader. Store them locally.
    *   **Potential Challenges**: Securely bootstrapping trust for the very first communication to the leader to submit the CSR.
    *   **Verification**: `kat-agent join` command:
        *   Generates key/CSR.
        *   Successfully POSTs CSR to leader.
        *   Receives and saves its signed certificate and the CA certificate.

5.  **Leader: CSR Signing & Node Registration (Handler for `/internal/v1alpha1/join`)**
    *   **Purpose**: Validate joining agent, sign its CSR, and record its registration.
    *   **Details**:
        *   Handler for `/internal/v1alpha1/join`:
            *   Parse CSR, advertise address, WG pubkey from request.
            *   Validate (minimal for now).
            *   Generate a unique Node Name if not provided. Assign a Node UID.
            *   Sign the CSR using `pki.SignCertificateRequest()`.
            *   Store Node registration data in etcd via `StateStore` (`/kat/nodes/registration/{nodeName}`: UID, advertise address, WG pubkey placeholder, join timestamp).
            *   Return the signed agent certificate and the cluster CA certificate to the agent.
    *   **Verification**:
        *   After agent joins, its certificate is signed by the cluster CA.
        *   Node registration data appears correctly in etcd under `/kat/nodes/registration/{nodeName}`.

6.  **Agent: Establish mTLS Client for Subsequent Comms & Implement Heartbeating (`internal/agent/agent.go`)**
    *   **Purpose**: Agent uses its new mTLS certs to communicate status to the Leader.
    *   **Details**:
        *   Agent configures its HTTP client to use its signed key/cert and the cluster CA cert for all future Leader communications.
        *   Periodic Heartbeat (RFC 4.1.3):
            *   Ticker (e.g., every `agentTickSeconds` from `cluster.kat`, default 15s).
            *   On tick, gather basic node status (node name, timestamp, initial resource capacity stubs).
            *   HTTP `POST` to Leader's `/v1alpha1/nodes/{nodeName}/status` endpoint using the mTLS-configured client.
    *   **Verification**: Agent logs successful heartbeat POSTs.

7.  **Leader: Receive Heartbeats & Basic Failure Detection (Handler for `/v1alpha1/nodes/{nodeName}/status`, `internal/leader/leader.go`)**
    *   **Purpose**: Leader tracks agent status and detects failures.
    *   **Details**:
        *   API endpoint `/v1alpha1/nodes/{nodeName}/status` (mTLS required):
            *   Receives status update from agent.
            *   Updates node's actual state in etcd (`/kat/nodes/status/{nodeName}/heartbeat`: timestamp, reported status). Could use an etcd lease for this key, renewed by agent heartbeats.
        *   Failure Detection (RFC 4.1.4):
            *   Leader has a reconciliation loop or periodic check.
            *   Scans `/kat/nodes/status/` in etcd.
            *   If a node's last heartbeat timestamp is older than `nodeLossTimeoutSeconds` (from `cluster.kat`), update its status in etcd to `NotReady` (e.g., `/kat/nodes/status/{nodeName}/condition: NotReady`).
    *   **Potential Challenges**: Efficiently scanning for dead nodes without excessive etcd load.
    *   **Milestone Verification**:
        *   `kat-agent init` runs as Leader, CA created, its API is up with mTLS.
        *   A second `kat-agent join ...` process successfully:
            *   Generates CSR, gets it signed by Leader.
            *   Saves its cert and CA cert.
            *   Starts sending heartbeats to Leader using mTLS.
        *   Leader logs receipt of heartbeats from the joined Agent.
        *   Node status (last heartbeat time) is updated in etcd by the Leader.
        *   If the joined Agent process is stopped, after `nodeLossTimeoutSeconds`, the Leader updates the node's status in etcd to `NotReady`. This can be verified using `etcdctl` or a `StateStore.Get` call.