8.7 KiB
8.7 KiB
Phase 2: Basic Agent & Node Lifecycle (Init, Join, PKI)
- Goal: Implement the secure registration of a new agent node to an existing leader, including PKI for mTLS, and establish periodic heartbeating for status updates and failure detection.
- RFC Sections Primarily Used: 2.3 (Node Communication Protocol), 4.1.1 (Initial Leader Setup - CA), 4.1.2 (Agent Node Join - CSR), 10.1 (API Security - mTLS), 10.6 (Internal PKI), 4.1.3 (Node Heartbeat), 4.1.4 (Node Departure and Failure Detection - basic).
Tasks & Sub-Tasks:
-
Implement Internal PKI Utilities (
internal/pki/ca.go
,internal/pki/certs.go
)- Purpose: Create and manage the Certificate Authority and sign certificates for mTLS.
- Details:
GenerateCA()
: Creates a new RSA key pair and a self-signed X.509 CA certificate. Saves to disk (e.g.,/var/lib/kat/pki/ca.key
,/var/lib/kat/pki/ca.crt
). Path fromcluster.kat
backupPath
parent dir, or a newpkiPath
.GenerateCertificateRequest(commonName, keyOutPath, csrOutPath)
: Agent uses this. Generates RSA key, creates a CSR.SignCertificateRequest(caKeyPath, caCertPath, csrData, certOutPath, duration)
: Leader uses this. Loads CA key/cert, parses CSR, issues a signed certificate.- Helper functions to load keys and certs from disk.
- Potential Challenges: Handling cryptographic operations correctly and securely. Permissions for key storage.
- Verification: Unit tests for
GenerateCA
,GenerateCertificateRequest
,SignCertificateRequest
. Generated certs should be verifiable against the CA.
-
Leader: Initialize CA & Its Own mTLS Certs on
init
(cmd/kat-agent/main.go
)- Purpose: The first leader needs to establish the PKI and secure its own API endpoint.
- Details:
- During
kat-agent init
, after etcd is up and leadership is confirmed:- Call
pki.GenerateCA()
if CA files don't exist. - Generate its own server key and CSR (e.g., for
leader.kat.cluster.local
). - Sign its own CSR using the CA to get its server certificate.
- Configure its (future) API HTTP server to use these server key/cert for TLS and require client certs (mTLS).
- Call
- During
- Verification: After
kat-agent init
, CA key/cert and leader's server key/cert exist in the configured PKI path.
-
Implement Basic API Server with mTLS on Leader (
internal/api/server.go
,internal/api/router.go
)- Purpose: Provide the initial HTTP endpoints required for agent join, secured with mTLS.
- Details:
- Setup
http.Server
withtls.Config
:Certificates
: Leader's server key/cert.ClientAuth: tls.RequireAndVerifyClientCert
.ClientCAs
: Pool containing the cluster CA certificate.
- Minimal router (e.g.,
gorilla/mux
orhttp.ServeMux
) for:POST /internal/v1alpha1/join
: Endpoint for agent to submit CSR. (Internal as it's part of bootstrap).
- Setup
- Verification: An HTTPS client (e.g.,
curl
with appropriate client certs) can connect to the leader's API port if it presents a cert signed by the cluster CA. Connection fails without a client cert or with a cert from a different CA.
-
Agent:
join
Command & CSR Submission (cmd/kat-agent/main.go
,internal/cli/join.go
- or similar for agent logic)- Purpose: Allow a new agent to request to join the cluster and obtain its mTLS credentials.
- Details:
kat-agent join
subcommand:- Flags:
--leader-api <ip:port>
,--advertise-address <ip_or_interface_name>
,--node-name <name>
(optional, leader can generate). - Generate its own key pair and CSR using
pki.GenerateCertificateRequest()
. - Make an HTTP POST to Leader's
/internal/v1alpha1/join
endpoint:- Payload: CSR data, advertise address, requested node name, initial WireGuard public key (placeholder for now).
- For this initial join, the client may need to trust the leader's CA cert via an out-of-band mechanism or
--leader-ca-cert
flag, or use a token for initial auth if mTLS is strictly enforced from the start. RFC implies mTLS is mandatory, so agent needs CA cert to trust leader, and leader needs to accept CSR perhaps based on a pre-shared token initially before agent has its own signed cert. For simplicity in V1, the initial join POST might happen over HTTPS where the agent trusts the leader's self-signed cert (if leader has one before CA is used for client auth) or a pre-shared token authorizes the CSR signing. RFC 4.1.2 states "Leader, upon validating the join request (V1 has no strong token validation, relies on network trust)". This needs clarification. Assume network trust for now: agent connects, sends CSR, leader signs.
- Receive signed certificate and CA certificate from Leader. Store them locally.
- Flags:
- Potential Challenges: Securely bootstrapping trust for the very first communication to the leader to submit the CSR.
- Verification:
kat-agent join
command:- Generates key/CSR.
- Successfully POSTs CSR to leader.
- Receives and saves its signed certificate and the CA certificate.
-
Leader: CSR Signing & Node Registration (Handler for
/internal/v1alpha1/join
)- Purpose: Validate joining agent, sign its CSR, and record its registration.
- Details:
- Handler for
/internal/v1alpha1/join
:- Parse CSR, advertise address, WG pubkey from request.
- Validate (minimal for now).
- Generate a unique Node Name if not provided. Assign a Node UID.
- Sign the CSR using
pki.SignCertificateRequest()
. - Store Node registration data in etcd via
StateStore
(/kat/nodes/registration/{nodeName}
: UID, advertise address, WG pubkey placeholder, join timestamp). - Return the signed agent certificate and the cluster CA certificate to the agent.
- Handler for
- Verification:
- After agent joins, its certificate is signed by the cluster CA.
- Node registration data appears correctly in etcd under
/kat/nodes/registration/{nodeName}
.
-
Agent: Establish mTLS Client for Subsequent Comms & Implement Heartbeating (
internal/agent/agent.go
)- Purpose: Agent uses its new mTLS certs to communicate status to the Leader.
- Details:
- Agent configures its HTTP client to use its signed key/cert and the cluster CA cert for all future Leader communications.
- Periodic Heartbeat (RFC 4.1.3):
- Ticker (e.g., every
agentTickSeconds
fromcluster.kat
, default 15s). - On tick, gather basic node status (node name, timestamp, initial resource capacity stubs).
- HTTP
POST
to Leader's/v1alpha1/nodes/{nodeName}/status
endpoint using the mTLS-configured client.
- Ticker (e.g., every
- Verification: Agent logs successful heartbeat POSTs.
-
Leader: Receive Heartbeats & Basic Failure Detection (Handler for
/v1alpha1/nodes/{nodeName}/status
,internal/leader/leader.go
)- Purpose: Leader tracks agent status and detects failures.
- Details:
- API endpoint
/v1alpha1/nodes/{nodeName}/status
(mTLS required):- Receives status update from agent.
- Updates node's actual state in etcd (
/kat/nodes/status/{nodeName}/heartbeat
: timestamp, reported status). Could use an etcd lease for this key, renewed by agent heartbeats.
- Failure Detection (RFC 4.1.4):
- Leader has a reconciliation loop or periodic check.
- Scans
/kat/nodes/status/
in etcd. - If a node's last heartbeat timestamp is older than
nodeLossTimeoutSeconds
(fromcluster.kat
), update its status in etcd toNotReady
(e.g.,/kat/nodes/status/{nodeName}/condition: NotReady
).
- API endpoint
- Potential Challenges: Efficiently scanning for dead nodes without excessive etcd load.
- Milestone Verification:
kat-agent init
runs as Leader, CA created, its API is up with mTLS.- A second
kat-agent join ...
process successfully:- Generates CSR, gets it signed by Leader.
- Saves its cert and CA cert.
- Starts sending heartbeats to Leader using mTLS.
- Leader logs receipt of heartbeats from the joined Agent.
- Node status (last heartbeat time) is updated in etcd by the Leader.
- If the joined Agent process is stopped, after
nodeLossTimeoutSeconds
, the Leader updates the node's status in etcd toNotReady
. This can be verified usingetcdctl
or aStateStore.Get
call.