183 lines
13 KiB
Markdown
183 lines
13 KiB
Markdown
# Implementation Plan
|
|
|
|
This plan breaks down the implementation into manageable phases, each with a testable milestone.
|
|
|
|
**Phase 0: Project Setup & Core Types**
|
|
* **Goal**: Basic project structure, version control, build system, and core data type definitions.
|
|
* **Tasks**:
|
|
1. Initialize Git repository, `go.mod`.
|
|
2. Create initial directory structure (as above).
|
|
3. Define core Proto3 messages in `api/v1alpha1/kat.proto` for: `Workload`, `VirtualLoadBalancer`, `JobDefinition`, `BuildDefinition`, `Namespace`, `Node` (internal representation), `ClusterConfiguration`.
|
|
4. Set up `scripts/gen-proto.sh` and generate initial Go types.
|
|
5. Implement parsing and basic validation for `cluster.kat` (`internal/config/parse.go`).
|
|
6. Implement parsing and basic validation for Quadlet files (`workload.kat`, etc.) and their `tar.gz` packaging/unpackaging.
|
|
* **Milestone**:
|
|
* `make generate` successfully creates Go types from protos.
|
|
* Unit tests pass for parsing `cluster.kat` and a sample Quadlet directory (as `tar.gz`) into their respective Go structs.
|
|
|
|
**Phase 1: State Management & Leader Election**
|
|
* **Goal**: A functional embedded etcd and leader election mechanism.
|
|
* **Tasks**:
|
|
1. Implement the `StateStore` interface (RFC 5.1) with an etcd backend (`internal/store/etcd.go`).
|
|
2. Integrate embedded etcd server into `kat-agent` (RFC 2.2, 5.2), configurable via `cluster.kat` parameters.
|
|
3. Implement leader election using `go.etcd.io/etcd/client/v3/concurrency` (RFC 5.3).
|
|
4. Basic `kat-agent init` functionality:
|
|
* Parse `cluster.kat`.
|
|
* Start single-node embedded etcd.
|
|
* Campaign for and become leader.
|
|
* Store initial cluster configuration (UID, CIDRs from `cluster.kat`) in etcd.
|
|
* **Milestone**:
|
|
* A single `kat-agent init --config cluster.kat` process starts, initializes etcd, and logs that it has become the leader.
|
|
* The cluster configuration from `cluster.kat` can be verified in etcd using an etcd client.
|
|
* `StateStore` interface methods (`Put`, `Get`, `Delete`, `List`) are testable against the embedded etcd.
|
|
|
|
**Phase 2: Basic Agent & Node Lifecycle (Init, Join, PKI)**
|
|
* **Goal**: Initial Leader setup, a second Agent joining with mTLS, and heartbeating.
|
|
* **Tasks**:
|
|
1. Implement Internal PKI (RFC 10.6) in `internal/pki/`:
|
|
* CA key/cert generation on `kat-agent init`.
|
|
* CSR generation by agent on join.
|
|
* CSR signing by Leader.
|
|
2. Implement initial Node Communication Protocol (RFC 2.3) for join:
|
|
* Agent (`kat-agent join --leader-api <...> --advertise-address <...>`) sends CSR to Leader.
|
|
* Leader validates, signs, returns certs & CA. Stores node registration (name, UID, advertise addr, WG pubkey placeholder) in etcd.
|
|
3. Implement basic mTLS for this join communication.
|
|
4. Implement Node Heartbeat (`POST /v1alpha1/nodes/{nodeName}/status`) from Agent to Leader (RFC 4.1.3). Leader updates node status in etcd.
|
|
5. Leader implements basic failure detection (marks Node `NotReady` in etcd if heartbeats cease) (RFC 4.1.4).
|
|
* **Milestone**:
|
|
* `kat-agent init` establishes a Leader with a CA.
|
|
* `kat-agent join` allows a second agent to securely register with the Leader, obtain certificates, and store its info in etcd.
|
|
* Leader's API receives heartbeats from the joined Agent.
|
|
* If a joined Agent is stopped, the Leader marks its status as `NotReady` in etcd after `nodeLossTimeoutSeconds`.
|
|
|
|
**Phase 3: Container Runtime Interface & Local Podman Management**
|
|
* **Goal**: Agent can manage containers locally via Podman using the CRI.
|
|
* **Tasks**:
|
|
1. Define `ContainerRuntime` interface in `internal/runtime/interface.go` (RFC 6.1).
|
|
2. Implement the Podman backend for `ContainerRuntime` in `internal/runtime/podman.go` (RFC 6.2). Focus on: `CreateContainer`, `StartContainer`, `StopContainer`, `RemoveContainer`, `GetContainerStatus`, `PullImage`, `StreamContainerLogs`.
|
|
3. Implement rootless execution strategy (RFC 6.3):
|
|
* Mechanism to ensure dedicated user accounts (initially, assume pre-existing or manual creation for tests).
|
|
* Podman systemd unit generation (`podman generate systemd`).
|
|
* Managing units via `systemctl --user`.
|
|
* **Milestone**:
|
|
* Agent process (upon a mocked internal command) can pull a specified image (e.g., `nginx`) and run it rootlessly using Podman and systemd user services.
|
|
* Agent can stop, remove, and get the status/logs of this container.
|
|
* All operations are performed via the `ContainerRuntime` interface.
|
|
|
|
**Phase 4: Basic Workload Deployment (Single Node, Image Source Only, No Networking)**
|
|
* **Goal**: Leader can instruct an Agent to run a simple `Service` workload (single container, image source) on itself (if leader is also an agent) or a single joined agent.
|
|
* **Tasks**:
|
|
1. Implement basic API endpoints on Leader for Workload CRUD (`POST/PUT /v1alpha1/n/{ns}/workloads` accepting `tar.gz`) (RFC 8.3, 4.2). Leader stores Quadlet files in etcd.
|
|
2. Simplistic scheduling (RFC 4.4): If only one agent node, assign workload to it. Leader creates an "assignment" or "task" for the agent in etcd.
|
|
3. Agent watches for assigned tasks from etcd.
|
|
4. On receiving a task, Agent uses `ContainerRuntime` to deploy the container (image from `workload.kat`).
|
|
5. Agent reports container instance status in its heartbeat. Leader updates overall workload status in etcd.
|
|
6. Basic `katcall apply -f <dir>` and `katcall get workload <name>` functionality.
|
|
* **Milestone**:
|
|
* User can deploy a simple single-container `Service` (e.g., `nginx`) using `katcall apply`.
|
|
* The container runs on the designated Agent node.
|
|
* `katcall get workload my-service` shows its status as running.
|
|
* `katcall logs <instanceID>` streams container logs.
|
|
|
|
**Phase 5: Overlay Networking (WireGuard) & IPAM**
|
|
* **Goal**: Nodes establish a WireGuard overlay network. Leader allocates IPs for containers.
|
|
* **Tasks**:
|
|
1. Implement WireGuard setup on Agents (`internal/network/wireguard.go`) (RFC 7.1):
|
|
* Key generation, public key reporting to Leader during join/heartbeat.
|
|
* Leader stores Node WireGuard public keys and advertise endpoints in etcd.
|
|
* Agent configures its `kat0` interface and peers by watching etcd.
|
|
2. Implement IPAM in Leader (`internal/leader/ipam.go`) (RFC 7.2):
|
|
* Node subnet allocation from `clusterCIDR` (from `cluster.kat`).
|
|
* Container IP allocation from the node's subnet when a workload instance is scheduled.
|
|
3. Agent uses the Leader-assigned IP when creating the container network/container with Podman.
|
|
* **Milestone**:
|
|
* All joined KAT nodes form a WireGuard mesh; `wg show` on nodes confirms peer connections.
|
|
* Leader allocates a unique overlay IP for each container instance.
|
|
* Containers on different nodes can ping each other using their overlay IPs.
|
|
|
|
**Phase 6: Distributed Agent DNS & Service Discovery**
|
|
* **Goal**: Basic service discovery using agent-local DNS for deployed services.
|
|
* **Tasks**:
|
|
1. Implement Agent-local DNS server (`internal/agent/dns_resolver.go`) using `miekg/dns` (RFC 7.3).
|
|
2. Leader writes DNS `A` records to etcd (e.g., `<workloadName>.<namespace>.<clusterDomain> -> <containerOverlayIP>`) when service instances become healthy/active.
|
|
3. Agent DNS server watches etcd for DNS records and updates its local zones.
|
|
4. Agent configures `/etc/resolv.conf` in managed containers to use its `kat0` IP as nameserver.
|
|
* **Milestone**:
|
|
* A service (`service-a`) deployed on one node can be resolved by its DNS name (e.g., `service-a.default.kat.cluster.local`) by a container on another node.
|
|
* DNS resolution provides the correct overlay IP(s) of `service-a` instances.
|
|
|
|
**Phase 7: Advanced Workload Features & Full Scheduling**
|
|
* **Goal**: Implement `Job`, `DaemonService`, richer scheduling, health checks, volumes, and restart policies.
|
|
* **Tasks**:
|
|
1. Implement `Job` type (RFC 3.4, 4.8): scheduling, completion tracking, backoff.
|
|
2. Implement `DaemonService` type (RFC 3.2): ensures one instance per eligible node.
|
|
3. Implement full scheduling logic in Leader (RFC 4.4): resource requests (`cpu`, `memory`), `nodeSelector`, Taint/Toleration, GPU (basic), "most empty" scoring.
|
|
4. Implement `VirtualLoadBalancer.kat` parsing and Agent-side health checks (RFC 3.3, 4.6.3). Leader uses health status for service readiness and DNS.
|
|
5. Implement container `restartPolicy` (RFC 3.2, 4.6.4) via systemd unit configuration.
|
|
6. Implement `volumeMounts` and `volumes` (RFC 3.2, 4.7): `HostMount`, `SimpleClusterStorage`. Agent ensures paths are set up.
|
|
* **Milestone**:
|
|
* `Job`s run to completion and their status is tracked.
|
|
* `DaemonService`s run one instance on all eligible nodes.
|
|
* Services are scheduled according to resource requests, selectors, and taints.
|
|
* Unhealthy service instances are identified by health checks and reflected in status.
|
|
* Containers restart based on their policy.
|
|
* Workloads can mount host paths and simple cluster storage.
|
|
|
|
**Phase 8: Git-Native Builds & Workload Updates/Rollbacks**
|
|
* **Goal**: Enable on-agent builds from Git sources and implement workload update strategies.
|
|
* **Tasks**:
|
|
1. Implement `BuildDefinition.kat` parsing (RFC 3.5).
|
|
2. Implement Git-native build process on Agent (`internal/agent/build.go`) using Podman (RFC 4.3).
|
|
3. Implement `cacheImage` pull/push for build caching (Agent needs registry credentials configured locally).
|
|
4. Implement workload update strategies in Leader (RFC 4.5): `Simultaneous`, `Rolling` (with `maxSurge`).
|
|
5. Implement manual rollback mechanism (`katcall rollback workload <name>`) (RFC 4.5).
|
|
* **Milestone**:
|
|
* A workload can be successfully deployed from a Git repository source, with the image built on the agent.
|
|
* A deployed service can be updated using the `Rolling` strategy with observable incremental instance replacement.
|
|
* A workload can be rolled back to its previous version.
|
|
|
|
**Phase 9: Full API Implementation & CLI (`katcall`) Polish**
|
|
* **Goal**: A robust and comprehensive HTTP API and `katcall` CLI.
|
|
* **Tasks**:
|
|
1. Implement all remaining API endpoints and features as per RFC Section 8. Ensure Proto3/JSON contracts are met.
|
|
2. Implement API authentication: bearer token for `katcall` (RFC 8.1, 10.1).
|
|
3. Flesh out `katcall` with all necessary commands and options (RFC 1.5 Terminology - katcall, RFC 8.3 hints):
|
|
* `drain <nodeName>`, `get nodes/namespaces`, `describe <resource>`, etc.
|
|
4. Improve error reporting and user feedback in CLI and API.
|
|
* **Milestone**:
|
|
* All functionalities defined in the RFC can be managed and introspected via the `katcall` CLI interacting with the secure KAT API.
|
|
* API documentation (e.g., Swagger/OpenAPI generated from protos or code) is available.
|
|
|
|
**Phase 10: Observability, Backup/Restore, Advanced Features & Security**
|
|
* **Goal**: Implement observability features, state backup/restore, and other advanced functionalities.
|
|
* **Tasks**:
|
|
1. Implement Agent & Leader logging to systemd journal/files; API for streaming container logs already in Phase 4/Milestone (RFC 9.1).
|
|
2. Implement basic Metrics exposure (`/metrics` JSON endpoint on Leader/Agent) (RFC 9.2).
|
|
3. Implement Events system: Leader records significant events in etcd, API to query events (RFC 9.3).
|
|
4. Implement Leader-driven etcd state backup (`etcdctl snapshot save`) (RFC 5.4).
|
|
5. Document and test the etcd state restore procedure (RFC 5.5).
|
|
6. Implement Detached Node Operation and Rejoin (RFC 4.9).
|
|
7. Provide standard Quadlet files and documentation for the Traefik Ingress recipe (RFC 7.4).
|
|
8. Review and harden security aspects: API security, build security, network security, secrets handling (document current limitations as per RFC 10.5).
|
|
* **Milestone**:
|
|
* Container logs are streamable via `katcall logs`. Agent/Leader logs are accessible.
|
|
* Basic metrics are available via API. Cluster events can be listed.
|
|
* Automated etcd backups are created by the Leader. Restore procedure is tested.
|
|
* Detached node can operate locally and rejoin the main cluster.
|
|
* Traefik can be deployed using provided Quadlets to achieve ingress.
|
|
|
|
**Phase 11: Testing, Documentation, and Release Preparation**
|
|
* **Goal**: Ensure KAT v1.0 is robust, well-documented, and ready for release.
|
|
* **Tasks**:
|
|
1. Write comprehensive unit tests for all core logic.
|
|
2. Develop integration tests for component interactions (e.g., Leader-Agent, Agent-Podman).
|
|
3. Create an E2E test suite using `katcall` to simulate real user scenarios.
|
|
4. Write detailed user documentation: installation, configuration, tutorials for all features, troubleshooting.
|
|
5. Perform performance testing on key operations (e.g., deployment speed, agent density).
|
|
6. Conduct a thorough security review/audit against RFC security considerations.
|
|
7. Establish a release process: versioning, changelog, building release artifacts.
|
|
* **Milestone**:
|
|
* High test coverage.
|
|
* Comprehensive user and API documentation is complete.
|
|
* Known critical bugs are fixed.
|
|
* KAT v1.0 is packaged and ready for its first official release. |