Init Docs

2025-05-09 19:15:50 -04:00
parent 9520ac0fd1
commit e03e27270b
7 changed files with 1743 additions and 0 deletions
--- a/docs/plan/filestructure.md
+++ b/docs/plan/filestructure.md
@@ -0,0 +1,134 @@
+# Directory/File Structure
+
+This structure assumes a Go-based project, as hinted by the Go interface definitions in the RFC.
+
+```
+kat-system/
+├── README.md               # Project overview, build instructions, contribution guide
+├── LICENSE                 # Project license (e.g., Apache 2.0, MIT)
+├── go.mod                  # Go modules definition
+├── go.sum                  # Go modules checksums
+├── Makefile                # Build, test, lint, generate code, etc.
+│
+├── api/
+│   └── v1alpha1/
+│       ├── kat.proto       # Protocol Buffer definitions for all KAT resources (Workload, Node, etc.)
+│       └── generated/      # Generated Go code from .proto files (e.g., using protoc-gen-go)
+│                           # Potentially OpenAPI/Swagger specs generated from protos too.
+│
+├── cmd/
+│   ├── kat-agent/
+│   │   └── main.go         # Entrypoint for the kat-agent binary
+│   └── katcall/
+│       └── main.go         # Entrypoint for the katcall CLI binary
+│
+├── internal/
+│   ├── agent/
+│   │   ├── agent.go        # Core agent logic, heartbeating, command processing
+│   │   ├── runtime.go      # Interface with ContainerRuntime (Podman)
+│   │   ├── build.go        # Git-native build process logic
+│   │   └── dns_resolver.go # Embedded DNS server logic
+│   │
+│   ├── leader/
+│   │   ├── leader.go       # Core leader logic, reconciliation loops
+│   │   ├── schedule.go     # Scheduling algorithm implementation
+│   │   ├── ipam.go         # IP Address Management logic
+│   │   ├── state_backup.go # etcd backup logic
+│   │   └── api_handler.go  # HTTP API request handlers (connects to api/v1alpha1)
+│   │
+│   ├── api/                # Server-side API implementation details
+│   │   ├── server.go       # HTTP server setup, middleware (auth, logging)
+│   │   ├── router.go       # API route definitions
+│   │   └── auth.go         # Authentication (mTLS, Bearer token) logic
+│   │
+│   ├── cli/
+│   │   ├── commands/       # Subdirectories for each katcall command (apply, get, logs, etc.)
+│   │   │   ├── apply.go
+│   │   │   └── ...
+│   │   ├── client.go       # HTTP client for interacting with KAT API
+│   │   └── utils.go        # CLI helper functions
+│   │
+│   ├── config/
+│   │   ├── types.go        # Go structs for Quadlet file kinds if not directly from proto
+│   │   ├── parse.go        # Logic for parsing and validating *.kat files (Quadlets, cluster.kat)
+│   │   └── defaults.go     # Default values for configurations
+│   │
+│   ├── store/
+│   │   ├── interface.go    # Definition of StateStore interface (as in RFC 5.1)
+│   │   └── etcd.go         # etcd implementation of StateStore, embedded etcd setup
+│   │
+│   ├── runtime/
+│   │   ├── interface.go    # Definition of ContainerRuntime interface (as in RFC 6.1)
+│   │   └── podman.go       # Podman implementation of ContainerRuntime
+│   │
+│   ├── network/
+│   │   ├── wireguard.go    # WireGuard setup and peer management logic
+│   │   └── types.go        # Network related internal types
+│   │
+│   ├── pki/
+│   │   ├── ca.go           # Certificate Authority management (generation, signing)
+│   │   └── certs.go        # Certificate generation and handling utilities
+│   │
+│   ├── observability/
+│   │   ├── logging.go      # Logging setup for components
+│   │   ├── metrics.go      # Metrics collection and exposure logic
+│   │   └── events.go       # Event recording and retrieval logic
+│   │
+│   ├── types/              # Core internal data structures if not covered by API protos
+│   │   ├── node.go
+│   │   ├── workload.go
+│   │   └── ...
+│   │
+│   ├── constants/
+│   │   └── constants.go    # Global constants (etcd key prefixes, default ports, etc.)
+│   │
+│   └── utils/
+│       ├── utils.go        # Common utility functions (error handling, string manipulation)
+│       └── tar.go          # Utilities for handling tar.gz Quadlet archives
+│
+├── docs/
+│   ├── rfc/
+│   │   └── RFC001-KAT.md   # The source RFC document
+│   ├── user-guide/         # User documentation (installation, getting started, tutorials)
+│   │   ├── installation.md
+│   │   └── basic_usage.md
+│   └── api-guide/          # API usage documentation (perhaps generated)
+│
+├── examples/
+│   ├── simple-service/     # Example Quadlet for a simple service
+│   │   ├── workload.kat
+│   │   └── VirtualLoadBalancer.kat
+│   ├── git-build-service/  # Example Quadlet for a service built from Git
+│   │   ├── workload.kat
+│   │   └── build.kat
+│   ├── job/                # Example Quadlet for a Job
+│   │   ├── workload.kat
+│   │   └── job.kat
+│   └── cluster.kat         # Example cluster configuration file
+│
+├── scripts/
+│   ├── setup-dev-env.sh    # Script to set up development environment
+│   ├── lint.sh             # Code linting script
+│   ├── test.sh             # Script to run all tests
+│   └── gen-proto.sh        # Script to generate Go code from .proto files
+│
+└── test/
+    ├── unit/               # Unit tests (mirroring internal/ structure)
+    ├── integration/        # Integration tests (e.g., agent-leader interaction)
+    └── e2e/                # End-to-end tests (testing full cluster operations via katcall)
+        ├── fixtures/       # Test Quadlet files
+        └── e2e_test.go
+```
+
+**Description of Key Files/Directories and Relationships:**
+
+*   **`api/v1alpha1/kat.proto`**: The source of truth for all resource definitions. `make generate` (or `scripts/gen-proto.sh`) would convert this into Go structs in `api/v1alpha1/generated/`. These structs will be used across the `internal/` packages.
+*   **`cmd/kat-agent/main.go`**: Initializes and runs the `kat-agent`. It will instantiate components from `internal/store` (for etcd), `internal/agent`, `internal/leader`, `internal/pki`, `internal/network`, and `internal/api` (for the API server if elected leader).
+*   **`cmd/katcall/main.go`**: Entry point for the CLI. It uses `internal/cli` components to parse commands and interact with the KAT API via `internal/cli/client.go`.
+*   **`internal/config/parse.go`**: Used by the Leader to parse submitted Quadlet `tar.gz` archives and by `kat-agent init` to parse `cluster.kat`.
+*   **`internal/store/etcd.go`**: Implements `StateStore` and manages the embedded etcd instance. Used by both Agent (for watching) and Leader (for all state modifications, leader election).
+*   **`internal/runtime/podman.go`**: Implements `ContainerRuntime`. Used by `internal/agent/runtime.go` to manage containers based on Podman.
+*   **`internal/agent/agent.go`** and **`internal/leader/leader.go`**: Contain the core state machines and logic for the respective roles. The `kat-agent` binary decides which role's logic to activate based on leader election status.
+*   **`internal/pki/ca.go`**: Used by `kat-agent init` to create the CA, and by the Leader to sign CSRs from joining agents.
+*   **`internal/network/wireguard.go`**: Used by agents to configure their local WireGuard interface based on data synced from etcd (managed by the Leader).
+*   **`internal/leader/api_handler.go`**: Implements the HTTP handlers for the API, using other leader components (scheduler, IPAM, store) to fulfill requests.
--- a/docs/plan/overview.md
+++ b/docs/plan/overview.md
@@ -0,0 +1,183 @@
+# Implementation Plan
+
+This plan breaks down the implementation into manageable phases, each with a testable milestone.
+
+**Phase 0: Project Setup & Core Types**
+*   **Goal**: Basic project structure, version control, build system, and core data type definitions.
+*   **Tasks**:
+    1.  Initialize Git repository, `go.mod`.
+    2.  Create initial directory structure (as above).
+    3.  Define core Proto3 messages in `api/v1alpha1/kat.proto` for: `Workload`, `VirtualLoadBalancer`, `JobDefinition`, `BuildDefinition`, `Namespace`, `Node` (internal representation), `ClusterConfiguration`.
+    4.  Set up `scripts/gen-proto.sh` and generate initial Go types.
+    5.  Implement parsing and basic validation for `cluster.kat` (`internal/config/parse.go`).
+    6.  Implement parsing and basic validation for Quadlet files (`workload.kat`, etc.) and their `tar.gz` packaging/unpackaging.
+*   **Milestone**:
+    *   `make generate` successfully creates Go types from protos.
+    *   Unit tests pass for parsing `cluster.kat` and a sample Quadlet directory (as `tar.gz`) into their respective Go structs.
+
+**Phase 1: State Management & Leader Election**
+*   **Goal**: A functional embedded etcd and leader election mechanism.
+*   **Tasks**:
+    1.  Implement the `StateStore` interface (RFC 5.1) with an etcd backend (`internal/store/etcd.go`).
+    2.  Integrate embedded etcd server into `kat-agent` (RFC 2.2, 5.2), configurable via `cluster.kat` parameters.
+    3.  Implement leader election using `go.etcd.io/etcd/client/v3/concurrency` (RFC 5.3).
+    4.  Basic `kat-agent init` functionality:
+        *   Parse `cluster.kat`.
+        *   Start single-node embedded etcd.
+        *   Campaign for and become leader.
+        *   Store initial cluster configuration (UID, CIDRs from `cluster.kat`) in etcd.
+*   **Milestone**:
+    *   A single `kat-agent init --config cluster.kat` process starts, initializes etcd, and logs that it has become the leader.
+    *   The cluster configuration from `cluster.kat` can be verified in etcd using an etcd client.
+    *   `StateStore` interface methods (`Put`, `Get`, `Delete`, `List`) are testable against the embedded etcd.
+
+**Phase 2: Basic Agent & Node Lifecycle (Init, Join, PKI)**
+*   **Goal**: Initial Leader setup, a second Agent joining with mTLS, and heartbeating.
+*   **Tasks**:
+    1.  Implement Internal PKI (RFC 10.6) in `internal/pki/`:
+        *   CA key/cert generation on `kat-agent init`.
+        *   CSR generation by agent on join.
+        *   CSR signing by Leader.
+    2.  Implement initial Node Communication Protocol (RFC 2.3) for join:
+        *   Agent (`kat-agent join --leader-api <...> --advertise-address <...>`) sends CSR to Leader.
+        *   Leader validates, signs, returns certs & CA. Stores node registration (name, UID, advertise addr, WG pubkey placeholder) in etcd.
+    3.  Implement basic mTLS for this join communication.
+    4.  Implement Node Heartbeat (`POST /v1alpha1/nodes/{nodeName}/status`) from Agent to Leader (RFC 4.1.3). Leader updates node status in etcd.
+    5.  Leader implements basic failure detection (marks Node `NotReady` in etcd if heartbeats cease) (RFC 4.1.4).
+*   **Milestone**:
+    *   `kat-agent init` establishes a Leader with a CA.
+    *   `kat-agent join` allows a second agent to securely register with the Leader, obtain certificates, and store its info in etcd.
+    *   Leader's API receives heartbeats from the joined Agent.
+    *   If a joined Agent is stopped, the Leader marks its status as `NotReady` in etcd after `nodeLossTimeoutSeconds`.
+
+**Phase 3: Container Runtime Interface & Local Podman Management**
+*   **Goal**: Agent can manage containers locally via Podman using the CRI.
+*   **Tasks**:
+    1.  Define `ContainerRuntime` interface in `internal/runtime/interface.go` (RFC 6.1).
+    2.  Implement the Podman backend for `ContainerRuntime` in `internal/runtime/podman.go` (RFC 6.2). Focus on: `CreateContainer`, `StartContainer`, `StopContainer`, `RemoveContainer`, `GetContainerStatus`, `PullImage`, `StreamContainerLogs`.
+    3.  Implement rootless execution strategy (RFC 6.3):
+        *   Mechanism to ensure dedicated user accounts (initially, assume pre-existing or manual creation for tests).
+        *   Podman systemd unit generation (`podman generate systemd`).
+        *   Managing units via `systemctl --user`.
+*   **Milestone**:
+    *   Agent process (upon a mocked internal command) can pull a specified image (e.g., `nginx`) and run it rootlessly using Podman and systemd user services.
+    *   Agent can stop, remove, and get the status/logs of this container.
+    *   All operations are performed via the `ContainerRuntime` interface.
+
+**Phase 4: Basic Workload Deployment (Single Node, Image Source Only, No Networking)**
+*   **Goal**: Leader can instruct an Agent to run a simple `Service` workload (single container, image source) on itself (if leader is also an agent) or a single joined agent.
+*   **Tasks**:
+    1.  Implement basic API endpoints on Leader for Workload CRUD (`POST/PUT /v1alpha1/n/{ns}/workloads` accepting `tar.gz`) (RFC 8.3, 4.2). Leader stores Quadlet files in etcd.
+    2.  Simplistic scheduling (RFC 4.4): If only one agent node, assign workload to it. Leader creates an "assignment" or "task" for the agent in etcd.
+    3.  Agent watches for assigned tasks from etcd.
+    4.  On receiving a task, Agent uses `ContainerRuntime` to deploy the container (image from `workload.kat`).
+    5.  Agent reports container instance status in its heartbeat. Leader updates overall workload status in etcd.
+    6.  Basic `katcall apply -f <dir>` and `katcall get workload <name>` functionality.
+*   **Milestone**:
+    *   User can deploy a simple single-container `Service` (e.g., `nginx`) using `katcall apply`.
+    *   The container runs on the designated Agent node.
+    *   `katcall get workload my-service` shows its status as running.
+    *   `katcall logs <instanceID>` streams container logs.
+
+**Phase 5: Overlay Networking (WireGuard) & IPAM**
+*   **Goal**: Nodes establish a WireGuard overlay network. Leader allocates IPs for containers.
+*   **Tasks**:
+    1.  Implement WireGuard setup on Agents (`internal/network/wireguard.go`) (RFC 7.1):
+        *   Key generation, public key reporting to Leader during join/heartbeat.
+        *   Leader stores Node WireGuard public keys and advertise endpoints in etcd.
+        *   Agent configures its `kat0` interface and peers by watching etcd.
+    2.  Implement IPAM in Leader (`internal/leader/ipam.go`) (RFC 7.2):
+        *   Node subnet allocation from `clusterCIDR` (from `cluster.kat`).
+        *   Container IP allocation from the node's subnet when a workload instance is scheduled.
+    3.  Agent uses the Leader-assigned IP when creating the container network/container with Podman.
+*   **Milestone**:
+    *   All joined KAT nodes form a WireGuard mesh; `wg show` on nodes confirms peer connections.
+    *   Leader allocates a unique overlay IP for each container instance.
+    *   Containers on different nodes can ping each other using their overlay IPs.
+
+**Phase 6: Distributed Agent DNS & Service Discovery**
+*   **Goal**: Basic service discovery using agent-local DNS for deployed services.
+*   **Tasks**:
+    1.  Implement Agent-local DNS server (`internal/agent/dns_resolver.go`) using `miekg/dns` (RFC 7.3).
+    2.  Leader writes DNS `A` records to etcd (e.g., `<workloadName>.<namespace>.<clusterDomain> -> <containerOverlayIP>`) when service instances become healthy/active.
+    3.  Agent DNS server watches etcd for DNS records and updates its local zones.
+    4.  Agent configures `/etc/resolv.conf` in managed containers to use its `kat0` IP as nameserver.
+*   **Milestone**:
+    *   A service (`service-a`) deployed on one node can be resolved by its DNS name (e.g., `service-a.default.kat.cluster.local`) by a container on another node.
+    *   DNS resolution provides the correct overlay IP(s) of `service-a` instances.
+
+**Phase 7: Advanced Workload Features & Full Scheduling**
+*   **Goal**: Implement `Job`, `DaemonService`, richer scheduling, health checks, volumes, and restart policies.
+*   **Tasks**:
+    1.  Implement `Job` type (RFC 3.4, 4.8): scheduling, completion tracking, backoff.
+    2.  Implement `DaemonService` type (RFC 3.2): ensures one instance per eligible node.
+    3.  Implement full scheduling logic in Leader (RFC 4.4): resource requests (`cpu`, `memory`), `nodeSelector`, Taint/Toleration, GPU (basic), "most empty" scoring.
+    4.  Implement `VirtualLoadBalancer.kat` parsing and Agent-side health checks (RFC 3.3, 4.6.3). Leader uses health status for service readiness and DNS.
+    5.  Implement container `restartPolicy` (RFC 3.2, 4.6.4) via systemd unit configuration.
+    6.  Implement `volumeMounts` and `volumes` (RFC 3.2, 4.7): `HostMount`, `SimpleClusterStorage`. Agent ensures paths are set up.
+*   **Milestone**:
+    *   `Job`s run to completion and their status is tracked.
+    *   `DaemonService`s run one instance on all eligible nodes.
+    *   Services are scheduled according to resource requests, selectors, and taints.
+    *   Unhealthy service instances are identified by health checks and reflected in status.
+    *   Containers restart based on their policy.
+    *   Workloads can mount host paths and simple cluster storage.
+
+**Phase 8: Git-Native Builds & Workload Updates/Rollbacks**
+*   **Goal**: Enable on-agent builds from Git sources and implement workload update strategies.
+*   **Tasks**:
+    1.  Implement `BuildDefinition.kat` parsing (RFC 3.5).
+    2.  Implement Git-native build process on Agent (`internal/agent/build.go`) using Podman (RFC 4.3).
+    3.  Implement `cacheImage` pull/push for build caching (Agent needs registry credentials configured locally).
+    4.  Implement workload update strategies in Leader (RFC 4.5): `Simultaneous`, `Rolling` (with `maxSurge`).
+    5.  Implement manual rollback mechanism (`katcall rollback workload <name>`) (RFC 4.5).
+*   **Milestone**:
+    *   A workload can be successfully deployed from a Git repository source, with the image built on the agent.
+    *   A deployed service can be updated using the `Rolling` strategy with observable incremental instance replacement.
+    *   A workload can be rolled back to its previous version.
+
+**Phase 9: Full API Implementation & CLI (`katcall`) Polish**
+*   **Goal**: A robust and comprehensive HTTP API and `katcall` CLI.
+*   **Tasks**:
+    1.  Implement all remaining API endpoints and features as per RFC Section 8. Ensure Proto3/JSON contracts are met.
+    2.  Implement API authentication: bearer token for `katcall` (RFC 8.1, 10.1).
+    3.  Flesh out `katcall` with all necessary commands and options (RFC 1.5 Terminology - katcall, RFC 8.3 hints):
+        *   `drain <nodeName>`, `get nodes/namespaces`, `describe <resource>`, etc.
+    4.  Improve error reporting and user feedback in CLI and API.
+*   **Milestone**:
+    *   All functionalities defined in the RFC can be managed and introspected via the `katcall` CLI interacting with the secure KAT API.
+    *   API documentation (e.g., Swagger/OpenAPI generated from protos or code) is available.
+
+**Phase 10: Observability, Backup/Restore, Advanced Features & Security**
+*   **Goal**: Implement observability features, state backup/restore, and other advanced functionalities.
+*   **Tasks**:
+    1.  Implement Agent & Leader logging to systemd journal/files; API for streaming container logs already in Phase 4/Milestone (RFC 9.1).
+    2.  Implement basic Metrics exposure (`/metrics` JSON endpoint on Leader/Agent) (RFC 9.2).
+    3.  Implement Events system: Leader records significant events in etcd, API to query events (RFC 9.3).
+    4.  Implement Leader-driven etcd state backup (`etcdctl snapshot save`) (RFC 5.4).
+    5.  Document and test the etcd state restore procedure (RFC 5.5).
+    6.  Implement Detached Node Operation and Rejoin (RFC 4.9).
+    7.  Provide standard Quadlet files and documentation for the Traefik Ingress recipe (RFC 7.4).
+    8.  Review and harden security aspects: API security, build security, network security, secrets handling (document current limitations as per RFC 10.5).
+*   **Milestone**:
+    *   Container logs are streamable via `katcall logs`. Agent/Leader logs are accessible.
+    *   Basic metrics are available via API. Cluster events can be listed.
+    *   Automated etcd backups are created by the Leader. Restore procedure is tested.
+    *   Detached node can operate locally and rejoin the main cluster.
+    *   Traefik can be deployed using provided Quadlets to achieve ingress.
+
+**Phase 11: Testing, Documentation, and Release Preparation**
+*   **Goal**: Ensure KAT v1.0 is robust, well-documented, and ready for release.
+*   **Tasks**:
+    1.  Write comprehensive unit tests for all core logic.
+    2.  Develop integration tests for component interactions (e.g., Leader-Agent, Agent-Podman).
+    3.  Create an E2E test suite using `katcall` to simulate real user scenarios.
+    4.  Write detailed user documentation: installation, configuration, tutorials for all features, troubleshooting.
+    5.  Perform performance testing on key operations (e.g., deployment speed, agent density).
+    6.  Conduct a thorough security review/audit against RFC security considerations.
+    7.  Establish a release process: versioning, changelog, building release artifacts.
+*   **Milestone**:
+    *   High test coverage.
+    *   Comprehensive user and API documentation is complete.
+    *   Known critical bugs are fixed.
+    *   KAT v1.0 is packaged and ready for its first official release.
--- a/docs/plan/phase1.md
+++ b/docs/plan/phase1.md
@@ -0,0 +1,81 @@
+# **Phase 1: State Management & Leader Election**
+
+*   **Goal**: Establish the foundational state layer using embedded etcd and implement a reliable leader election mechanism. A single `kat-agent` can initialize a cluster, become its leader, and store initial configuration.
+*   **RFC Sections Primarily Used**: 2.2 (Embedded etcd), 3.9 (ClusterConfiguration), 5.1 (State Store Interface), 5.2 (etcd Implementation Details), 5.3 (Leader Election).
+
+**Tasks & Sub-Tasks:**
+
+1.  **Define `StateStore` Go Interface (`internal/store/interface.go`)**
+    *   **Purpose**: Create the abstraction layer for all state operations, decoupling the rest of the system from direct etcd dependencies.
+    *   **Details**: Transcribe the Go interface from RFC 5.1 verbatim. Include `KV`, `WatchEvent`, `EventType`, `Compare`, `Op`, `OpType` structs/constants.
+    *   **Verification**: Code compiles. Interface definition matches RFC.
+
+2.  **Implement Embedded etcd Server Logic (`internal/store/etcd.go`)**
+    *   **Purpose**: Allow `kat-agent` to run its own etcd instance for single-node clusters or as part of a multi-node quorum.
+    *   **Details**:
+        *   Use `go.etcd.io/etcd/server/v3/embed`.
+        *   Function to start an embedded etcd server:
+            *   Input: configuration parameters (data directory, peer URLs, client URLs, name). These will come from `cluster.kat` or defaults.
+            *   Output: a running `embed.Etcd` instance or an error.
+        *   Graceful shutdown logic for the embedded etcd server.
+    *   **Verification**: A test can start and stop an embedded etcd server. Data directory is created and used.
+
+3.  **Implement `StateStore` with etcd Backend (`internal/store/etcd.go`)**
+    *   **Purpose**: Provide the concrete implementation for interacting with an etcd cluster (embedded or external).
+    *   **Details**:
+        *   Create a struct that implements the `StateStore` interface and holds an `etcd/clientv3.Client`.
+        *   Implement `Put(ctx, key, value)`: Use `client.Put()`.
+        *   Implement `Get(ctx, key)`: Use `client.Get()`. Handle key-not-found. Populate `KV.Version` with `ModRevision`.
+        *   Implement `Delete(ctx, key)`: Use `client.Delete()`.
+        *   Implement `List(ctx, prefix)`: Use `client.Get()` with `clientv3.WithPrefix()`.
+        *   Implement `Watch(ctx, keyOrPrefix, startRevision)`: Use `client.Watch()`. Translate etcd events to `WatchEvent`.
+        *   Implement `Close()`: Close the `clientv3.Client`.
+        *   Implement `Campaign(ctx, leaderID, leaseTTLSeconds)`:
+            *   Use `concurrency.NewSession()` to create a lease.
+            *   Use `concurrency.NewElection()` and `election.Campaign()`.
+            *   Return a context that is cancelled when leadership is lost (e.g., by watching the campaign context or session done channel).
+        *   Implement `Resign(ctx)`: Use `election.Resign()`.
+        *   Implement `GetLeader(ctx)`: Observe the election or query the leader key.
+        *   Implement `DoTransaction(ctx, checks, onSuccess, onFailure)`: Use `client.Txn()` with `clientv3.Compare` and `clientv3.Op`.
+    *   **Potential Challenges**: Correctly handling etcd transaction semantics, context propagation, and error translation. Efficiently managing watches.
+    *   **Verification**:
+        *   Unit tests for each `StateStore` method using a real embedded etcd instance (test-scoped).
+        *   Verify `Put` then `Get` retrieves the correct value and version.
+        *   Verify `List` with prefix.
+        *   Verify `Delete` removes the key.
+        *   Verify `Watch` receives correct events for puts/deletes.
+        *   Verify `DoTransaction` commits on success and rolls back on failure.
+
+4.  **Integrate Leader Election into `kat-agent` (`cmd/kat-agent/main.go`, `internal/leader/election.go` - new file maybe)**
+    *   **Purpose**: Enable an agent instance to attempt to become the cluster leader.
+    *   **Details**:
+        *   `kat-agent` main function will initialize its `StateStore` client.
+        *   A dedicated goroutine will call `StateStore.Campaign()`.
+        *   The outcome of `Campaign` (e.g., leadership acquired, context for leadership duration) will determine if the agent activates its Leader-specific logic (Phase 2+).
+        *   Leader ID could be `nodeName` or a UUID. Lease TTL from `cluster.kat`.
+    *   **Verification**:
+        *   Start one `kat-agent` with etcd enabled; it should log "became leader".
+        *   Start a second `kat-agent` configured to connect to the first's etcd; it should log "observing leader <leaderID>" or similar, but not become leader itself.
+        *   If the first agent (leader) is stopped, the second agent should eventually log "became leader".
+
+5.  **Implement Basic `kat-agent init` Command (`cmd/kat-agent/main.go`, `internal/config/parse.go`)**
+    *   **Purpose**: Initialize a new KAT cluster (single node initially).
+    *   **Details**:
+        *   Define `init` subcommand in `kat-agent` using a CLI library (e.g., `cobra`).
+        *   Flag: `--config <path_to_cluster.kat>`.
+        *   Parse `cluster.kat` (from Phase 0, now used to extract etcd peer/client URLs, data dir, backup paths etc.).
+        *   Generate a persistent Cluster UID and store it in etcd (e.g., `/kat/config/cluster_uid`).
+        *   Store `cluster.kat` relevant parameters (or the whole sanitized config) into etcd (e.g., under `/kat/config/cluster_config`).
+        *   Start the embedded etcd server using parsed configurations.
+        *   Initiate leader election.
+    *   **Potential Challenges**: Ensuring `cluster.kat` parsing is robust. Handling existing data directories.
+    *   **Milestone Verification**:
+        *   Running `kat-agent init --config examples/cluster.kat` on a clean system:
+            *   Starts the `kat-agent` process.
+            *   Creates the etcd data directory.
+            *   Logs "Successfully initialized etcd".
+            *   Logs "Became leader: <nodeName>".
+            *   Using `etcdctl` (or a simple `StateStore.Get` test client):
+                *   Verify `/kat/config/cluster_uid` exists and has a UUID.
+                *   Verify `/kat/config/cluster_config` (or similar keys) contains data from `cluster.kat` (e.g., `clusterCIDR`, `serviceCIDR`, `agentPort`, `apiPort`).
+                *   Verify the leader election key exists for the current leader.
--- a/docs/plan/phase2.md
+++ b/docs/plan/phase2.md
@@ -0,0 +1,98 @@
+# **Phase 2: Basic Agent & Node Lifecycle (Init, Join, PKI)**
+
+*   **Goal**: Implement the secure registration of a new agent node to an existing leader, including PKI for mTLS, and establish periodic heartbeating for status updates and failure detection.
+*   **RFC Sections Primarily Used**: 2.3 (Node Communication Protocol), 4.1.1 (Initial Leader Setup - CA), 4.1.2 (Agent Node Join - CSR), 10.1 (API Security - mTLS), 10.6 (Internal PKI), 4.1.3 (Node Heartbeat), 4.1.4 (Node Departure and Failure Detection - basic).
+
+**Tasks & Sub-Tasks:**
+
+1.  **Implement Internal PKI Utilities (`internal/pki/ca.go`, `internal/pki/certs.go`)**
+    *   **Purpose**: Create and manage the Certificate Authority and sign certificates for mTLS.
+    *   **Details**:
+        *   `GenerateCA()`: Creates a new RSA key pair and a self-signed X.509 CA certificate. Saves to disk (e.g., `/var/lib/kat/pki/ca.key`, `/var/lib/kat/pki/ca.crt`). Path from `cluster.kat` `backupPath` parent dir, or a new `pkiPath`.
+        *   `GenerateCertificateRequest(commonName, keyOutPath, csrOutPath)`: Agent uses this. Generates RSA key, creates a CSR.
+        *   `SignCertificateRequest(caKeyPath, caCertPath, csrData, certOutPath, duration)`: Leader uses this. Loads CA key/cert, parses CSR, issues a signed certificate.
+        *   Helper functions to load keys and certs from disk.
+    *   **Potential Challenges**: Handling cryptographic operations correctly and securely. Permissions for key storage.
+    *   **Verification**: Unit tests for `GenerateCA`, `GenerateCertificateRequest`, `SignCertificateRequest`. Generated certs should be verifiable against the CA.
+
+2.  **Leader: Initialize CA & Its Own mTLS Certs on `init` (`cmd/kat-agent/main.go`)**
+    *   **Purpose**: The first leader needs to establish the PKI and secure its own API endpoint.
+    *   **Details**:
+        *   During `kat-agent init`, after etcd is up and leadership is confirmed:
+            *   Call `pki.GenerateCA()` if CA files don't exist.
+            *   Generate its own server key and CSR (e.g., for `leader.kat.cluster.local`).
+            *   Sign its own CSR using the CA to get its server certificate.
+            *   Configure its (future) API HTTP server to use these server key/cert for TLS and require client certs (mTLS).
+    *   **Verification**: After `kat-agent init`, CA key/cert and leader's server key/cert exist in the configured PKI path.
+
+3.  **Implement Basic API Server with mTLS on Leader (`internal/api/server.go`, `internal/api/router.go`)**
+    *   **Purpose**: Provide the initial HTTP endpoints required for agent join, secured with mTLS.
+    *   **Details**:
+        *   Setup `http.Server` with `tls.Config`:
+            *   `Certificates`: Leader's server key/cert.
+            *   `ClientAuth: tls.RequireAndVerifyClientCert`.
+            *   `ClientCAs`: Pool containing the cluster CA certificate.
+        *   Minimal router (e.g., `gorilla/mux` or `http.ServeMux`) for:
+            *   `POST /internal/v1alpha1/join`: Endpoint for agent to submit CSR. (Internal as it's part of bootstrap).
+    *   **Verification**: An HTTPS client (e.g., `curl` with appropriate client certs) can connect to the leader's API port if it presents a cert signed by the cluster CA. Connection fails without a client cert or with a cert from a different CA.
+
+4.  **Agent: `join` Command & CSR Submission (`cmd/kat-agent/main.go`, `internal/cli/join.go` - or similar for agent logic)**
+    *   **Purpose**: Allow a new agent to request to join the cluster and obtain its mTLS credentials.
+    *   **Details**:
+        *   `kat-agent join` subcommand:
+            *   Flags: `--leader-api <ip:port>`, `--advertise-address <ip_or_interface_name>`, `--node-name <name>` (optional, leader can generate).
+            *   Generate its own key pair and CSR using `pki.GenerateCertificateRequest()`.
+            *   Make an HTTP POST to Leader's `/internal/v1alpha1/join` endpoint:
+                *   Payload: CSR data, advertise address, requested node name, initial WireGuard public key (placeholder for now).
+                *   For this *initial* join, the client may need to trust the leader's CA cert via an out-of-band mechanism or `--leader-ca-cert` flag, or use a token for initial auth if mTLS is strictly enforced from the start. *RFC implies mTLS is mandatory, so agent needs CA cert to trust leader, and leader needs to accept CSR perhaps based on a pre-shared token initially before agent has its own signed cert.* For simplicity in V1, the initial join POST might happen over HTTPS where the agent trusts the leader's self-signed cert (if leader has one before CA is used for client auth) or a pre-shared token authorizes the CSR signing. RFC 4.1.2 states "Leader, upon validating the join request (V1 has no strong token validation, relies on network trust)". This needs clarification. *Assume network trust for now: agent connects, sends CSR, leader signs.*
+            *   Receive signed certificate and CA certificate from Leader. Store them locally.
+    *   **Potential Challenges**: Securely bootstrapping trust for the very first communication to the leader to submit the CSR.
+    *   **Verification**: `kat-agent join` command:
+        *   Generates key/CSR.
+        *   Successfully POSTs CSR to leader.
+        *   Receives and saves its signed certificate and the CA certificate.
+
+5.  **Leader: CSR Signing & Node Registration (Handler for `/internal/v1alpha1/join`)**
+    *   **Purpose**: Validate joining agent, sign its CSR, and record its registration.
+    *   **Details**:
+        *   Handler for `/internal/v1alpha1/join`:
+            *   Parse CSR, advertise address, WG pubkey from request.
+            *   Validate (minimal for now).
+            *   Generate a unique Node Name if not provided. Assign a Node UID.
+            *   Sign the CSR using `pki.SignCertificateRequest()`.
+            *   Store Node registration data in etcd via `StateStore` (`/kat/nodes/registration/{nodeName}`: UID, advertise address, WG pubkey placeholder, join timestamp).
+            *   Return the signed agent certificate and the cluster CA certificate to the agent.
+    *   **Verification**:
+        *   After agent joins, its certificate is signed by the cluster CA.
+        *   Node registration data appears correctly in etcd under `/kat/nodes/registration/{nodeName}`.
+
+6.  **Agent: Establish mTLS Client for Subsequent Comms & Implement Heartbeating (`internal/agent/agent.go`)**
+    *   **Purpose**: Agent uses its new mTLS certs to communicate status to the Leader.
+    *   **Details**:
+        *   Agent configures its HTTP client to use its signed key/cert and the cluster CA cert for all future Leader communications.
+        *   Periodic Heartbeat (RFC 4.1.3):
+            *   Ticker (e.g., every `agentTickSeconds` from `cluster.kat`, default 15s).
+            *   On tick, gather basic node status (node name, timestamp, initial resource capacity stubs).
+            *   HTTP `POST` to Leader's `/v1alpha1/nodes/{nodeName}/status` endpoint using the mTLS-configured client.
+    *   **Verification**: Agent logs successful heartbeat POSTs.
+
+7.  **Leader: Receive Heartbeats & Basic Failure Detection (Handler for `/v1alpha1/nodes/{nodeName}/status`, `internal/leader/leader.go`)**
+    *   **Purpose**: Leader tracks agent status and detects failures.
+    *   **Details**:
+        *   API endpoint `/v1alpha1/nodes/{nodeName}/status` (mTLS required):
+            *   Receives status update from agent.
+            *   Updates node's actual state in etcd (`/kat/nodes/status/{nodeName}/heartbeat`: timestamp, reported status). Could use an etcd lease for this key, renewed by agent heartbeats.
+        *   Failure Detection (RFC 4.1.4):
+            *   Leader has a reconciliation loop or periodic check.
+            *   Scans `/kat/nodes/status/` in etcd.
+            *   If a node's last heartbeat timestamp is older than `nodeLossTimeoutSeconds` (from `cluster.kat`), update its status in etcd to `NotReady` (e.g., `/kat/nodes/status/{nodeName}/condition: NotReady`).
+    *   **Potential Challenges**: Efficiently scanning for dead nodes without excessive etcd load.
+    *   **Milestone Verification**:
+        *   `kat-agent init` runs as Leader, CA created, its API is up with mTLS.
+        *   A second `kat-agent join ...` process successfully:
+            *   Generates CSR, gets it signed by Leader.
+            *   Saves its cert and CA cert.
+            *   Starts sending heartbeats to Leader using mTLS.
+        *   Leader logs receipt of heartbeats from the joined Agent.
+        *   Node status (last heartbeat time) is updated in etcd by the Leader.
+        *   If the joined Agent process is stopped, after `nodeLossTimeoutSeconds`, the Leader updates the node's status in etcd to `NotReady`. This can be verified using `etcdctl` or a `StateStore.Get` call.
--- a/docs/plan/phase3.md
+++ b/docs/plan/phase3.md
@@ -0,0 +1,102 @@
+# **Phase 3: Container Runtime Interface & Local Podman Management**
+
+*   **Goal**: Abstract container management operations behind a `ContainerRuntime` interface and implement it using Podman CLI, enabling an agent to manage containers rootlessly based on (mocked) instructions.
+*   **RFC Sections Primarily Used**: 6.1 (Runtime Interface Definition), 6.2 (Default Implementation: Podman), 6.3 (Rootless Execution Strategy).
+
+**Tasks & Sub-Tasks:**
+
+1.  **Define `ContainerRuntime` Go Interface (`internal/runtime/interface.go`)**
+    *   **Purpose**: Abstract all container operations (build, pull, run, stop, inspect, logs, etc.).
+    *   **Details**: Transcribe the Go interface from RFC 6.1 precisely. Include all specified structs (`ImageSummary`, `ContainerStatus`, `BuildOptions`, `PortMapping`, `VolumeMount`, `ResourceSpec`, `ContainerCreateOptions`, `ContainerHealthCheck`) and enums (`ContainerState`, `HealthState`).
+    *   **Verification**: Code compiles. Interface and type definitions match RFC.
+
+2.  **Implement Podman Backend for `ContainerRuntime` (`internal/runtime/podman.go`) - Core Lifecycle Methods**
+    *   **Purpose**: Translate `ContainerRuntime` calls into `podman` CLI commands.
+    *   **Details (for each method, focus on these first):**
+        *   `PullImage(ctx, imageName, platform)`:
+            *   Cmd: `podman pull {imageName}` (add `--platform` if specified).
+            *   Parse output to get image ID (e.g., from `podman inspect {imageName} --format '{{.Id}}'`).
+        *   `CreateContainer(ctx, opts ContainerCreateOptions)`:
+            *   Cmd: `podman create ...`
+            *   Translate `ContainerCreateOptions` into `podman create` flags:
+                *   `--name {opts.InstanceID}` (KAT's unique ID for the instance).
+                *   `--hostname {opts.Hostname}`.
+                *   `--env` for `opts.Env`.
+                *   `--label` for `opts.Labels` (include KAT ownership labels like `kat.dws.rip/workload-name`, `kat.dws.rip/namespace`, `kat.dws.rip/instance-id`).
+                *   `--restart {opts.RestartPolicy}` (map to Podman's "no", "on-failure", "always").
+                *   Resource mapping: `--cpus` (for quota), `--cpu-shares`, `--memory`.
+                *   `--publish` for `opts.Ports`.
+                *   `--volume` for `opts.Volumes` (source will be host path, destination is container path).
+                *   `--network {opts.NetworkName}` and `--ip {opts.IPAddress}` if specified.
+                *   `--user {opts.User}`.
+                *   `--cap-add`, `--cap-drop`, `--security-opt`.
+                *   Podman native healthcheck flags from `opts.HealthCheck`.
+                *   `--systemd={opts.Systemd}`.
+            *   Parse output for container ID.
+        *   `StartContainer(ctx, containerID)`: Cmd: `podman start {containerID}`.
+        *   `StopContainer(ctx, containerID, timeoutSeconds)`: Cmd: `podman stop -t {timeoutSeconds} {containerID}`.
+        *   `RemoveContainer(ctx, containerID, force, removeVolumes)`: Cmd: `podman rm {containerID}` (add `--force`, `--volumes`).
+        *   `GetContainerStatus(ctx, containerOrName)`:
+            *   Cmd: `podman inspect {containerOrName}`.
+            *   Parse JSON output to populate `ContainerStatus` struct (State, ExitCode, StartedAt, FinishedAt, Health, ImageID, ImageName, OverlayIP if available from inspect).
+            *   Podman health status needs to be mapped to `HealthState`.
+        *   `StreamContainerLogs(ctx, containerID, follow, since, stdout, stderr)`:
+            *   Cmd: `podman logs {containerID}` (add `--follow`, `--since`).
+            *   Stream `os/exec.Cmd.Stdout` and `os/exec.Cmd.Stderr` to the provided `io.Writer`s.
+    *   **Helper**: A utility function to run `podman` commands as a specific rootless user (see Rootless Execution below).
+    *   **Potential Challenges**: Correctly mapping all `ContainerCreateOptions` to Podman flags. Parsing varied `podman inspect` output. Managing `os/exec` for logs. Robust error handling from CLI output.
+    *   **Verification**:
+        *   Unit tests for each implemented method, mocking `os/exec` calls to verify command construction and output parsing.
+        *   *Requires Podman installed for integration-style unit tests*: Tests that actually execute `podman` commands (e.g., pull alpine, create, start, inspect, stop, rm) and verify state changes.
+
+3.  **Implement Rootless Execution Strategy (`internal/runtime/podman.go` helpers, `internal/agent/runtime.go`)**
+    *   **Purpose**: Ensure containers are run by unprivileged users using systemd for supervision.
+    *   **Details**:
+        *   **User Assumption**: For Phase 3, *assume* the dedicated user (e.g., `kat_wl_mywebapp`) already exists on the system and `loginctl enable-linger <username>` has been run manually. The username could be passed in `ContainerCreateOptions.User` or derived.
+        *   **Podman Command Execution Context**:
+            *   The `kat-agent` process itself might run as root or a privileged user.
+            *   When executing `podman` commands for a workload, it MUST run them as the target unprivileged user.
+            *   This can be achieved using `sudo -u {username} podman ...` or more directly via `nsenter`/`setuid` if the agent has capabilities, or by setting `XDG_RUNTIME_DIR` and `DBUS_SESSION_BUS_ADDRESS` appropriately for the target user if invoking `podman` via systemd user session D-Bus API. *Simplest for now might be `sudo -u {username} podman ...` if agent is root, or ensuring agent itself runs as a user who can switch to other `kat_wl_*` users.*
+            *   The RFC prefers "systemd user sessions". This usually means `systemctl --user ...`. To control another user's systemd session, the agent process (if root) can use `machinectl shell {username}@.host /bin/bash -c "systemctl --user ..."` or `systemd-run --user --machine={username}@.host ...`. If the agent is not root, it cannot directly control other users' systemd sessions. *This is a critical design point: how does the agent (potentially root) interact with user-level systemd?*
+            *   RFC: "Agent uses `systemctl --user --machine={username}@.host ...`". This implies agent has permissions to do this (likely running as root or with specific polkit rules).
+        *   **Systemd Unit Generation & Management**:
+            *   After `podman create ...` (or instead of direct create, if `podman generate systemd` is used to create the definition), generate systemd unit:
+                `podman generate systemd --new --name {opts.InstanceID} --files --time 10 {imageNameUsedInCreate}`. This creates a `{opts.InstanceID}.service` file.
+            *   The `ContainerRuntime` implementation needs to:
+                1.  Execute `podman create` to establish the container definition (this allows Podman to manage its internal state for the container ID).
+                2.  Execute `podman generate systemd --name {containerID}` (using the ID from create) to get the unit file content.
+                3.  Place this unit file in the target user's systemd path (e.g., `/home/{username}/.config/systemd/user/{opts.InstanceID}.service` or `/etc/systemd/user/{opts.InstanceID}.service` if agent is root and wants to enable for any user).
+                4.  Run `systemctl --user --machine={username}@.host daemon-reload`.
+                5.  Start/Enable: `systemctl --user --machine={username}@.host enable --now {opts.InstanceID}.service`.
+            *   To stop: `systemctl --user --machine={username}@.host stop {opts.InstanceID}.service`.
+            *   To remove: `systemctl --user --machine={username}@.host disable {opts.InstanceID}.service`, then `podman rm {opts.InstanceID}`, then remove the unit file.
+            *   Status: `systemctl --user --machine={username}@.host status {opts.InstanceID}.service` (parse output), or rely on `podman inspect` which should reflect systemd-managed state.
+    *   **Potential Challenges**: Managing permissions for interacting with other users' systemd sessions. Correctly placing and cleaning up systemd unit files. Ensuring `XDG_RUNTIME_DIR` is set correctly for rootless Podman if not using systemd units for direct `podman run`. Systemd unit generation nuances.
+    *   **Verification**:
+        *   A test in `internal/agent/runtime_test.go` (or similar) can take mock `ContainerCreateOptions`.
+        *   It calls the (mocked or real) `ContainerRuntime` implementation.
+        *   Verify:
+            *   Podman commands are constructed to run as the target unprivileged user.
+            *   A systemd unit file is generated for the container.
+            *   `systemctl --user --machine...` commands are invoked correctly to manage the service.
+            *   The container is actually started (verify with `podman ps -a --filter label=kat.dws.rip/instance-id={instanceID}` as the target user).
+            *   Logs can be retrieved.
+            *   The container can be stopped and removed, including its systemd unit.
+
+*   **Milestone Verification**:
+    *   The `ContainerRuntime` Go interface is fully defined as per RFC 6.1.
+    *   The Podman implementation for core lifecycle methods (`PullImage`, `CreateContainer` (leading to systemd unit generation), `StartContainer` (via systemd enable/start), `StopContainer` (via systemd stop), `RemoveContainer` (via systemd disable + podman rm + unit file removal), `GetContainerStatus`, `StreamContainerLogs`) is functional.
+    *   An `internal/agent` test (or a temporary `main.go` test harness) can:
+        1.  Define `ContainerCreateOptions` for a simple image like `docker.io/library/alpine` with a command like `sleep 30`.
+        2.  Specify a (manually pre-created and linger-enabled) unprivileged username.
+        3.  Call the `ContainerRuntime` methods.
+        4.  **Result**:
+            *   The alpine image is pulled (if not present).
+            *   A systemd user service unit is generated and placed correctly for the specified user.
+            *   The service is started using `systemctl --user --machine...`.
+            *   `podman ps --all --filter label=kat.dws.rip/instance-id=...` (run as the target user or by root seeing all containers) shows the container running or having run.
+            *   Logs can be retrieved using the `StreamContainerLogs` method.
+            *   The container can be stopped and removed (including its systemd unit file).
+    *   All container operations are verifiably performed by the specified unprivileged user.
+
+This detailed plan should provide a clearer path for implementing these initial crucial phases. Remember to keep testing iterative and focused on the RFC specifications.