kat/docs/plan/overview.md

# Implementation Plan

This plan breaks down the implementation into manageable phases, each with a testable milestone.

**Phase 0: Project Setup & Core Types**
*   **Goal**: Basic project structure, version control, build system, and core data type definitions.
*   **Tasks**:
    1.  Initialize Git repository, `go.mod`.
    2.  Create initial directory structure (as above).
    3.  Define core Proto3 messages in `api/v1alpha1/kat.proto` for: `Workload`, `VirtualLoadBalancer`, `JobDefinition`, `BuildDefinition`, `Namespace`, `Node` (internal representation), `ClusterConfiguration`.
    4.  Set up `scripts/gen-proto.sh` and generate initial Go types.
    5.  Implement parsing and basic validation for `cluster.kat` (`internal/config/parse.go`).
    6.  Implement parsing and basic validation for Quadlet files (`workload.kat`, etc.) and their `tar.gz` packaging/unpackaging.
*   **Milestone**:
    *   `make generate` successfully creates Go types from protos.
    *   Unit tests pass for parsing `cluster.kat` and a sample Quadlet directory (as `tar.gz`) into their respective Go structs.

**Phase 1: State Management & Leader Election**
*   **Goal**: A functional embedded etcd and leader election mechanism.
*   **Tasks**:
    1.  Implement the `StateStore` interface (RFC 5.1) with an etcd backend (`internal/store/etcd.go`).
    2.  Integrate embedded etcd server into `kat-agent` (RFC 2.2, 5.2), configurable via `cluster.kat` parameters.
    3.  Implement leader election using `go.etcd.io/etcd/client/v3/concurrency` (RFC 5.3).
    4.  Basic `kat-agent init` functionality:
        *   Parse `cluster.kat`.
        *   Start single-node embedded etcd.
        *   Campaign for and become leader.
        *   Store initial cluster configuration (UID, CIDRs from `cluster.kat`) in etcd.
*   **Milestone**:
    *   A single `kat-agent init --config cluster.kat` process starts, initializes etcd, and logs that it has become the leader.
    *   The cluster configuration from `cluster.kat` can be verified in etcd using an etcd client.
    *   `StateStore` interface methods (`Put`, `Get`, `Delete`, `List`) are testable against the embedded etcd.

**Phase 2: Basic Agent & Node Lifecycle (Init, Join, PKI)**
*   **Goal**: Initial Leader setup, a second Agent joining with mTLS, and heartbeating.
*   **Tasks**:
    1.  Implement Internal PKI (RFC 10.6) in `internal/pki/`:
        *   CA key/cert generation on `kat-agent init`.
        *   CSR generation by agent on join.
        *   CSR signing by Leader.
    2.  Implement initial Node Communication Protocol (RFC 2.3) for join:
        *   Agent (`kat-agent join --leader-api <...> --advertise-address <...>`) sends CSR to Leader.
        *   Leader validates, signs, returns certs & CA. Stores node registration (name, UID, advertise addr, WG pubkey placeholder) in etcd.
    3.  Implement basic mTLS for this join communication.
    4.  Implement Node Heartbeat (`POST /v1alpha1/nodes/{nodeName}/status`) from Agent to Leader (RFC 4.1.3). Leader updates node status in etcd.
    5.  Leader implements basic failure detection (marks Node `NotReady` in etcd if heartbeats cease) (RFC 4.1.4).
*   **Milestone**:
    *   `kat-agent init` establishes a Leader with a CA.
    *   `kat-agent join` allows a second agent to securely register with the Leader, obtain certificates, and store its info in etcd.
    *   Leader's API receives heartbeats from the joined Agent.
    *   If a joined Agent is stopped, the Leader marks its status as `NotReady` in etcd after `nodeLossTimeoutSeconds`.

**Phase 3: Container Runtime Interface & Local Podman Management**
*   **Goal**: Agent can manage containers locally via Podman using the CRI.
*   **Tasks**:
    1.  Define `ContainerRuntime` interface in `internal/runtime/interface.go` (RFC 6.1).
    2.  Implement the Podman backend for `ContainerRuntime` in `internal/runtime/podman.go` (RFC 6.2). Focus on: `CreateContainer`, `StartContainer`, `StopContainer`, `RemoveContainer`, `GetContainerStatus`, `PullImage`, `StreamContainerLogs`.
    3.  Implement rootless execution strategy (RFC 6.3):
        *   Mechanism to ensure dedicated user accounts (initially, assume pre-existing or manual creation for tests).
        *   Podman systemd unit generation (`podman generate systemd`).
        *   Managing units via `systemctl --user`.
*   **Milestone**:
    *   Agent process (upon a mocked internal command) can pull a specified image (e.g., `nginx`) and run it rootlessly using Podman and systemd user services.
    *   Agent can stop, remove, and get the status/logs of this container.
    *   All operations are performed via the `ContainerRuntime` interface.

**Phase 4: Basic Workload Deployment (Single Node, Image Source Only, No Networking)**
*   **Goal**: Leader can instruct an Agent to run a simple `Service` workload (single container, image source) on itself (if leader is also an agent) or a single joined agent.
*   **Tasks**:
    1.  Implement basic API endpoints on Leader for Workload CRUD (`POST/PUT /v1alpha1/n/{ns}/workloads` accepting `tar.gz`) (RFC 8.3, 4.2). Leader stores Quadlet files in etcd.
    2.  Simplistic scheduling (RFC 4.4): If only one agent node, assign workload to it. Leader creates an "assignment" or "task" for the agent in etcd.
    3.  Agent watches for assigned tasks from etcd.
    4.  On receiving a task, Agent uses `ContainerRuntime` to deploy the container (image from `workload.kat`).
    5.  Agent reports container instance status in its heartbeat. Leader updates overall workload status in etcd.
    6.  Basic `katcall apply -f <dir>` and `katcall get workload <name>` functionality.
*   **Milestone**:
    *   User can deploy a simple single-container `Service` (e.g., `nginx`) using `katcall apply`.
    *   The container runs on the designated Agent node.
    *   `katcall get workload my-service` shows its status as running.
    *   `katcall logs <instanceID>` streams container logs.

**Phase 5: Overlay Networking (WireGuard) & IPAM**
*   **Goal**: Nodes establish a WireGuard overlay network. Leader allocates IPs for containers.
*   **Tasks**:
    1.  Implement WireGuard setup on Agents (`internal/network/wireguard.go`) (RFC 7.1):
        *   Key generation, public key reporting to Leader during join/heartbeat.
        *   Leader stores Node WireGuard public keys and advertise endpoints in etcd.
        *   Agent configures its `kat0` interface and peers by watching etcd.
    2.  Implement IPAM in Leader (`internal/leader/ipam.go`) (RFC 7.2):
        *   Node subnet allocation from `clusterCIDR` (from `cluster.kat`).
        *   Container IP allocation from the node's subnet when a workload instance is scheduled.
    3.  Agent uses the Leader-assigned IP when creating the container network/container with Podman.
*   **Milestone**:
    *   All joined KAT nodes form a WireGuard mesh; `wg show` on nodes confirms peer connections.
    *   Leader allocates a unique overlay IP for each container instance.
    *   Containers on different nodes can ping each other using their overlay IPs.

**Phase 6: Distributed Agent DNS & Service Discovery**
*   **Goal**: Basic service discovery using agent-local DNS for deployed services.
*   **Tasks**:
    1.  Implement Agent-local DNS server (`internal/agent/dns_resolver.go`) using `miekg/dns` (RFC 7.3).
    2.  Leader writes DNS `A` records to etcd (e.g., `<workloadName>.<namespace>.<clusterDomain> -> <containerOverlayIP>`) when service instances become healthy/active.
    3.  Agent DNS server watches etcd for DNS records and updates its local zones.
    4.  Agent configures `/etc/resolv.conf` in managed containers to use its `kat0` IP as nameserver.
*   **Milestone**:
    *   A service (`service-a`) deployed on one node can be resolved by its DNS name (e.g., `service-a.default.kat.cluster.local`) by a container on another node.
    *   DNS resolution provides the correct overlay IP(s) of `service-a` instances.

**Phase 7: Advanced Workload Features & Full Scheduling**
*   **Goal**: Implement `Job`, `DaemonService`, richer scheduling, health checks, volumes, and restart policies.
*   **Tasks**:
    1.  Implement `Job` type (RFC 3.4, 4.8): scheduling, completion tracking, backoff.
    2.  Implement `DaemonService` type (RFC 3.2): ensures one instance per eligible node.
    3.  Implement full scheduling logic in Leader (RFC 4.4): resource requests (`cpu`, `memory`), `nodeSelector`, Taint/Toleration, GPU (basic), "most empty" scoring.
    4.  Implement `VirtualLoadBalancer.kat` parsing and Agent-side health checks (RFC 3.3, 4.6.3). Leader uses health status for service readiness and DNS.
    5.  Implement container `restartPolicy` (RFC 3.2, 4.6.4) via systemd unit configuration.
    6.  Implement `volumeMounts` and `volumes` (RFC 3.2, 4.7): `HostMount`, `SimpleClusterStorage`. Agent ensures paths are set up.
*   **Milestone**:
    *   `Job`s run to completion and their status is tracked.
    *   `DaemonService`s run one instance on all eligible nodes.
    *   Services are scheduled according to resource requests, selectors, and taints.
    *   Unhealthy service instances are identified by health checks and reflected in status.
    *   Containers restart based on their policy.
    *   Workloads can mount host paths and simple cluster storage.

**Phase 8: Git-Native Builds & Workload Updates/Rollbacks**
*   **Goal**: Enable on-agent builds from Git sources and implement workload update strategies.
*   **Tasks**:
    1.  Implement `BuildDefinition.kat` parsing (RFC 3.5).
    2.  Implement Git-native build process on Agent (`internal/agent/build.go`) using Podman (RFC 4.3).
    3.  Implement `cacheImage` pull/push for build caching (Agent needs registry credentials configured locally).
    4.  Implement workload update strategies in Leader (RFC 4.5): `Simultaneous`, `Rolling` (with `maxSurge`).
    5.  Implement manual rollback mechanism (`katcall rollback workload <name>`) (RFC 4.5).
*   **Milestone**:
    *   A workload can be successfully deployed from a Git repository source, with the image built on the agent.
    *   A deployed service can be updated using the `Rolling` strategy with observable incremental instance replacement.
    *   A workload can be rolled back to its previous version.

**Phase 9: Full API Implementation & CLI (`katcall`) Polish**
*   **Goal**: A robust and comprehensive HTTP API and `katcall` CLI.
*   **Tasks**:
    1.  Implement all remaining API endpoints and features as per RFC Section 8. Ensure Proto3/JSON contracts are met.
    2.  Implement API authentication: bearer token for `katcall` (RFC 8.1, 10.1).
    3.  Flesh out `katcall` with all necessary commands and options (RFC 1.5 Terminology - katcall, RFC 8.3 hints):
        *   `drain <nodeName>`, `get nodes/namespaces`, `describe <resource>`, etc.
    4.  Improve error reporting and user feedback in CLI and API.
*   **Milestone**:
    *   All functionalities defined in the RFC can be managed and introspected via the `katcall` CLI interacting with the secure KAT API.
    *   API documentation (e.g., Swagger/OpenAPI generated from protos or code) is available.

**Phase 10: Observability, Backup/Restore, Advanced Features & Security**
*   **Goal**: Implement observability features, state backup/restore, and other advanced functionalities.
*   **Tasks**:
    1.  Implement Agent & Leader logging to systemd journal/files; API for streaming container logs already in Phase 4/Milestone (RFC 9.1).
    2.  Implement basic Metrics exposure (`/metrics` JSON endpoint on Leader/Agent) (RFC 9.2).
    3.  Implement Events system: Leader records significant events in etcd, API to query events (RFC 9.3).
    4.  Implement Leader-driven etcd state backup (`etcdctl snapshot save`) (RFC 5.4).
    5.  Document and test the etcd state restore procedure (RFC 5.5).
    6.  Implement Detached Node Operation and Rejoin (RFC 4.9).
    7.  Provide standard Quadlet files and documentation for the Traefik Ingress recipe (RFC 7.4).
    8.  Review and harden security aspects: API security, build security, network security, secrets handling (document current limitations as per RFC 10.5).
*   **Milestone**:
    *   Container logs are streamable via `katcall logs`. Agent/Leader logs are accessible.
    *   Basic metrics are available via API. Cluster events can be listed.
    *   Automated etcd backups are created by the Leader. Restore procedure is tested.
    *   Detached node can operate locally and rejoin the main cluster.
    *   Traefik can be deployed using provided Quadlets to achieve ingress.

**Phase 11: Testing, Documentation, and Release Preparation**
*   **Goal**: Ensure KAT v1.0 is robust, well-documented, and ready for release.
*   **Tasks**:
    1.  Write comprehensive unit tests for all core logic.
    2.  Develop integration tests for component interactions (e.g., Leader-Agent, Agent-Podman).
    3.  Create an E2E test suite using `katcall` to simulate real user scenarios.
    4.  Write detailed user documentation: installation, configuration, tutorials for all features, troubleshooting.
    5.  Perform performance testing on key operations (e.g., deployment speed, agent density).
    6.  Conduct a thorough security review/audit against RFC security considerations.
    7.  Establish a release process: versioning, changelog, building release artifacts.
*   **Milestone**:
    *   High test coverage.
    *   Comprehensive user and API documentation is complete.
    *   Known critical bugs are fixed.
    *   KAT v1.0 is packaged and ready for its first official release.