13 KiB
Implementation Plan
This plan breaks down the implementation into manageable phases, each with a testable milestone.
Phase 0: Project Setup & Core Types
- Goal: Basic project structure, version control, build system, and core data type definitions.
- Tasks:
- Initialize Git repository,
go.mod
. - Create initial directory structure (as above).
- Define core Proto3 messages in
api/v1alpha1/kat.proto
for:Workload
,VirtualLoadBalancer
,JobDefinition
,BuildDefinition
,Namespace
,Node
(internal representation),ClusterConfiguration
. - Set up
scripts/gen-proto.sh
and generate initial Go types. - Implement parsing and basic validation for
cluster.kat
(internal/config/parse.go
). - Implement parsing and basic validation for Quadlet files (
workload.kat
, etc.) and theirtar.gz
packaging/unpackaging.
- Initialize Git repository,
- Milestone:
make generate
successfully creates Go types from protos.- Unit tests pass for parsing
cluster.kat
and a sample Quadlet directory (astar.gz
) into their respective Go structs.
Phase 1: State Management & Leader Election
- Goal: A functional embedded etcd and leader election mechanism.
- Tasks:
- Implement the
StateStore
interface (RFC 5.1) with an etcd backend (internal/store/etcd.go
). - Integrate embedded etcd server into
kat-agent
(RFC 2.2, 5.2), configurable viacluster.kat
parameters. - Implement leader election using
go.etcd.io/etcd/client/v3/concurrency
(RFC 5.3). - Basic
kat-agent init
functionality:- Parse
cluster.kat
. - Start single-node embedded etcd.
- Campaign for and become leader.
- Store initial cluster configuration (UID, CIDRs from
cluster.kat
) in etcd.
- Parse
- Implement the
- Milestone:
- A single
kat-agent init --config cluster.kat
process starts, initializes etcd, and logs that it has become the leader. - The cluster configuration from
cluster.kat
can be verified in etcd using an etcd client. StateStore
interface methods (Put
,Get
,Delete
,List
) are testable against the embedded etcd.
- A single
Phase 2: Basic Agent & Node Lifecycle (Init, Join, PKI)
- Goal: Initial Leader setup, a second Agent joining with mTLS, and heartbeating.
- Tasks:
- Implement Internal PKI (RFC 10.6) in
internal/pki/
:- CA key/cert generation on
kat-agent init
. - CSR generation by agent on join.
- CSR signing by Leader.
- CA key/cert generation on
- Implement initial Node Communication Protocol (RFC 2.3) for join:
- Agent (
kat-agent join --leader-api <...> --advertise-address <...>
) sends CSR to Leader. - Leader validates, signs, returns certs & CA. Stores node registration (name, UID, advertise addr, WG pubkey placeholder) in etcd.
- Agent (
- Implement basic mTLS for this join communication.
- Implement Node Heartbeat (
POST /v1alpha1/nodes/{nodeName}/status
) from Agent to Leader (RFC 4.1.3). Leader updates node status in etcd. - Leader implements basic failure detection (marks Node
NotReady
in etcd if heartbeats cease) (RFC 4.1.4).
- Implement Internal PKI (RFC 10.6) in
- Milestone:
kat-agent init
establishes a Leader with a CA.kat-agent join
allows a second agent to securely register with the Leader, obtain certificates, and store its info in etcd.- Leader's API receives heartbeats from the joined Agent.
- If a joined Agent is stopped, the Leader marks its status as
NotReady
in etcd afternodeLossTimeoutSeconds
.
Phase 3: Container Runtime Interface & Local Podman Management
- Goal: Agent can manage containers locally via Podman using the CRI.
- Tasks:
- Define
ContainerRuntime
interface ininternal/runtime/interface.go
(RFC 6.1). - Implement the Podman backend for
ContainerRuntime
ininternal/runtime/podman.go
(RFC 6.2). Focus on:CreateContainer
,StartContainer
,StopContainer
,RemoveContainer
,GetContainerStatus
,PullImage
,StreamContainerLogs
. - Implement rootless execution strategy (RFC 6.3):
- Mechanism to ensure dedicated user accounts (initially, assume pre-existing or manual creation for tests).
- Podman systemd unit generation (
podman generate systemd
). - Managing units via
systemctl --user
.
- Define
- Milestone:
- Agent process (upon a mocked internal command) can pull a specified image (e.g.,
nginx
) and run it rootlessly using Podman and systemd user services. - Agent can stop, remove, and get the status/logs of this container.
- All operations are performed via the
ContainerRuntime
interface.
- Agent process (upon a mocked internal command) can pull a specified image (e.g.,
Phase 4: Basic Workload Deployment (Single Node, Image Source Only, No Networking)
- Goal: Leader can instruct an Agent to run a simple
Service
workload (single container, image source) on itself (if leader is also an agent) or a single joined agent. - Tasks:
- Implement basic API endpoints on Leader for Workload CRUD (
POST/PUT /v1alpha1/n/{ns}/workloads
acceptingtar.gz
) (RFC 8.3, 4.2). Leader stores Quadlet files in etcd. - Simplistic scheduling (RFC 4.4): If only one agent node, assign workload to it. Leader creates an "assignment" or "task" for the agent in etcd.
- Agent watches for assigned tasks from etcd.
- On receiving a task, Agent uses
ContainerRuntime
to deploy the container (image fromworkload.kat
). - Agent reports container instance status in its heartbeat. Leader updates overall workload status in etcd.
- Basic
katcall apply -f <dir>
andkatcall get workload <name>
functionality.
- Implement basic API endpoints on Leader for Workload CRUD (
- Milestone:
- User can deploy a simple single-container
Service
(e.g.,nginx
) usingkatcall apply
. - The container runs on the designated Agent node.
katcall get workload my-service
shows its status as running.katcall logs <instanceID>
streams container logs.
- User can deploy a simple single-container
Phase 5: Overlay Networking (WireGuard) & IPAM
- Goal: Nodes establish a WireGuard overlay network. Leader allocates IPs for containers.
- Tasks:
- Implement WireGuard setup on Agents (
internal/network/wireguard.go
) (RFC 7.1):- Key generation, public key reporting to Leader during join/heartbeat.
- Leader stores Node WireGuard public keys and advertise endpoints in etcd.
- Agent configures its
kat0
interface and peers by watching etcd.
- Implement IPAM in Leader (
internal/leader/ipam.go
) (RFC 7.2):- Node subnet allocation from
clusterCIDR
(fromcluster.kat
). - Container IP allocation from the node's subnet when a workload instance is scheduled.
- Node subnet allocation from
- Agent uses the Leader-assigned IP when creating the container network/container with Podman.
- Implement WireGuard setup on Agents (
- Milestone:
- All joined KAT nodes form a WireGuard mesh;
wg show
on nodes confirms peer connections. - Leader allocates a unique overlay IP for each container instance.
- Containers on different nodes can ping each other using their overlay IPs.
- All joined KAT nodes form a WireGuard mesh;
Phase 6: Distributed Agent DNS & Service Discovery
- Goal: Basic service discovery using agent-local DNS for deployed services.
- Tasks:
- Implement Agent-local DNS server (
internal/agent/dns_resolver.go
) usingmiekg/dns
(RFC 7.3). - Leader writes DNS
A
records to etcd (e.g.,<workloadName>.<namespace>.<clusterDomain> -> <containerOverlayIP>
) when service instances become healthy/active. - Agent DNS server watches etcd for DNS records and updates its local zones.
- Agent configures
/etc/resolv.conf
in managed containers to use itskat0
IP as nameserver.
- Implement Agent-local DNS server (
- Milestone:
- A service (
service-a
) deployed on one node can be resolved by its DNS name (e.g.,service-a.default.kat.cluster.local
) by a container on another node. - DNS resolution provides the correct overlay IP(s) of
service-a
instances.
- A service (
Phase 7: Advanced Workload Features & Full Scheduling
- Goal: Implement
Job
,DaemonService
, richer scheduling, health checks, volumes, and restart policies. - Tasks:
- Implement
Job
type (RFC 3.4, 4.8): scheduling, completion tracking, backoff. - Implement
DaemonService
type (RFC 3.2): ensures one instance per eligible node. - Implement full scheduling logic in Leader (RFC 4.4): resource requests (
cpu
,memory
),nodeSelector
, Taint/Toleration, GPU (basic), "most empty" scoring. - Implement
VirtualLoadBalancer.kat
parsing and Agent-side health checks (RFC 3.3, 4.6.3). Leader uses health status for service readiness and DNS. - Implement container
restartPolicy
(RFC 3.2, 4.6.4) via systemd unit configuration. - Implement
volumeMounts
andvolumes
(RFC 3.2, 4.7):HostMount
,SimpleClusterStorage
. Agent ensures paths are set up.
- Implement
- Milestone:
Job
s run to completion and their status is tracked.DaemonService
s run one instance on all eligible nodes.- Services are scheduled according to resource requests, selectors, and taints.
- Unhealthy service instances are identified by health checks and reflected in status.
- Containers restart based on their policy.
- Workloads can mount host paths and simple cluster storage.
Phase 8: Git-Native Builds & Workload Updates/Rollbacks
- Goal: Enable on-agent builds from Git sources and implement workload update strategies.
- Tasks:
- Implement
BuildDefinition.kat
parsing (RFC 3.5). - Implement Git-native build process on Agent (
internal/agent/build.go
) using Podman (RFC 4.3). - Implement
cacheImage
pull/push for build caching (Agent needs registry credentials configured locally). - Implement workload update strategies in Leader (RFC 4.5):
Simultaneous
,Rolling
(withmaxSurge
). - Implement manual rollback mechanism (
katcall rollback workload <name>
) (RFC 4.5).
- Implement
- Milestone:
- A workload can be successfully deployed from a Git repository source, with the image built on the agent.
- A deployed service can be updated using the
Rolling
strategy with observable incremental instance replacement. - A workload can be rolled back to its previous version.
Phase 9: Full API Implementation & CLI (katcall
) Polish
- Goal: A robust and comprehensive HTTP API and
katcall
CLI. - Tasks:
- Implement all remaining API endpoints and features as per RFC Section 8. Ensure Proto3/JSON contracts are met.
- Implement API authentication: bearer token for
katcall
(RFC 8.1, 10.1). - Flesh out
katcall
with all necessary commands and options (RFC 1.5 Terminology - katcall, RFC 8.3 hints):drain <nodeName>
,get nodes/namespaces
,describe <resource>
, etc.
- Improve error reporting and user feedback in CLI and API.
- Milestone:
- All functionalities defined in the RFC can be managed and introspected via the
katcall
CLI interacting with the secure KAT API. - API documentation (e.g., Swagger/OpenAPI generated from protos or code) is available.
- All functionalities defined in the RFC can be managed and introspected via the
Phase 10: Observability, Backup/Restore, Advanced Features & Security
- Goal: Implement observability features, state backup/restore, and other advanced functionalities.
- Tasks:
- Implement Agent & Leader logging to systemd journal/files; API for streaming container logs already in Phase 4/Milestone (RFC 9.1).
- Implement basic Metrics exposure (
/metrics
JSON endpoint on Leader/Agent) (RFC 9.2). - Implement Events system: Leader records significant events in etcd, API to query events (RFC 9.3).
- Implement Leader-driven etcd state backup (
etcdctl snapshot save
) (RFC 5.4). - Document and test the etcd state restore procedure (RFC 5.5).
- Implement Detached Node Operation and Rejoin (RFC 4.9).
- Provide standard Quadlet files and documentation for the Traefik Ingress recipe (RFC 7.4).
- Review and harden security aspects: API security, build security, network security, secrets handling (document current limitations as per RFC 10.5).
- Milestone:
- Container logs are streamable via
katcall logs
. Agent/Leader logs are accessible. - Basic metrics are available via API. Cluster events can be listed.
- Automated etcd backups are created by the Leader. Restore procedure is tested.
- Detached node can operate locally and rejoin the main cluster.
- Traefik can be deployed using provided Quadlets to achieve ingress.
- Container logs are streamable via
Phase 11: Testing, Documentation, and Release Preparation
- Goal: Ensure KAT v1.0 is robust, well-documented, and ready for release.
- Tasks:
- Write comprehensive unit tests for all core logic.
- Develop integration tests for component interactions (e.g., Leader-Agent, Agent-Podman).
- Create an E2E test suite using
katcall
to simulate real user scenarios. - Write detailed user documentation: installation, configuration, tutorials for all features, troubleshooting.
- Perform performance testing on key operations (e.g., deployment speed, agent density).
- Conduct a thorough security review/audit against RFC security considerations.
- Establish a release process: versioning, changelog, building release artifacts.
- Milestone:
- High test coverage.
- Comprehensive user and API documentation is complete.
- Known critical bugs are fixed.
- KAT v1.0 is packaged and ready for its first official release.