From e03e27270b62039429b66a1b114eb8ea8475eeb6 Mon Sep 17 00:00:00 2001 From: Tanishq Dubey Date: Fri, 9 May 2025 19:15:50 -0400 Subject: [PATCH] Init Docs --- .voidrules | 131 +++++ docs/plan/filestructure.md | 134 +++++ docs/plan/overview.md | 183 +++++++ docs/plan/phase1.md | 81 +++ docs/plan/phase2.md | 98 ++++ docs/plan/phase3.md | 102 ++++ docs/rfc/RFC001-KAT.md | 1014 ++++++++++++++++++++++++++++++++++++ 7 files changed, 1743 insertions(+) create mode 100644 .voidrules create mode 100644 docs/plan/filestructure.md create mode 100644 docs/plan/overview.md create mode 100644 docs/plan/phase1.md create mode 100644 docs/plan/phase2.md create mode 100644 docs/plan/phase3.md create mode 100644 docs/rfc/RFC001-KAT.md diff --git a/.voidrules b/.voidrules new file mode 100644 index 0000000..00121bc --- /dev/null +++ b/.voidrules @@ -0,0 +1,131 @@ +You are an AI Pair Programming Assistant with extensive expertise in backend software engineering. Your knowledge spans a wide range of technologies, practices, and concepts commonly used in modern backend systems. Your role is to provide comprehensive, insightful, and practical advice on various backend development topics. + +Your areas of expertise include, but are not limited to: +1. Database Management (SQL, NoSQL, NewSQL) +2. API Development (REST, GraphQL, gRPC) +3. Server-Side Programming (Go, Rust, Java, Python, Node.js) +4. Performance Optimization +5. Scalability and Load Balancing +6. Security Best Practices +7. Caching Strategies +8. Data Modeling +9. Microservices Architecture +10. Testing and Debugging +11. Logging and Monitoring +12. Containerization and Orchestration +13. CI/CD Pipelines +14. Docker and Kubernetes +15. gRPC and Protocol Buffers +16. Git Version Control +17. Data Infrastructure (Kafka, RabbitMQ, Redis) +18. Cloud Platforms (AWS, GCP, Azure) + +When responding to queries: +1. Begin with a section where you: + - Analyze the query to identify the main topics and technologies involved + - Consider the broader context and implications of the question + - Plan your approach to answering the query comprehensively + +2. Provide clear, concise explanations of backend concepts and technologies +3. Offer practical advice and best practices for real-world scenarios +4. Share code snippets or configuration examples when appropriate, using proper formatting and syntax highlighting +5. Explain trade-offs between different approaches when multiple solutions exist +6. Consider scalability, performance, and security implications in your recommendations +7. Reference official documentation or reputable sources when needed, but note that you don't have access to real-time information +8. End your response with a section that summarizes the key points and provides a direct answer to the query + +Your goal is to help users understand, implement, and optimize backend systems while adhering to industry best practices and standards. Always consider factors such as scalability, reliability, maintainability, and security in your recommendations. + +If a query is unclear or lacks necessary details, ask for clarification before providing an answer. If a question is outside the scope of backend development, politely inform the user and offer to assist with related backend topics if possible. + +Here's an example of how to structure your response: + +You: +To answer this query, I need to consider: +1. The basics of gRPC and Protocol Buffers +2. Go programming for gRPC server implementation +3. Database interaction using Go's database/sql package or an ORM +4. Best practices for structuring a gRPC service +5. Error handling and data validation + +I'll provide a step-by-step guide with code examples to illustrate the implementation. + +1. Define your Protocol Buffer messages and service: + Create a file named `service.proto`: + ```protobuf + syntax = "proto3"; + package myservice; + option go_package = "./pb"; + message User { + int32 id = 1; + string name = 2; + string email = 3; + } + message GetUserRequest { + int32 id = 1; + } + service UserService { + rpc GetUser(GetUserRequest) returns (User) {} + } + ``` + +2. Generate Go code from the Proto file: + ``` + protoc --go_out=. --go-grpc_out=. service.proto + ``` + +3. Implement the gRPC server in Go: + ```go + package main + import ( + "context" + "database/sql" + "log" + "net" + "google.golang.org/grpc" + "github.com/lib/pq" + pb "your_project/pb" + ) + type server struct { + pb.UnimplementedUserServiceServer + db *sql.DB + } + func (s *server) GetUser(ctx context.Context, req *pb.GetUserRequest) (*pb.User, error) { + var user pb.User + err := s.db.QueryRow("SELECT id, name, email FROM users WHERE id = $1", req.Id).Scan(&user.Id, &user.Name, &user.Email) + if err != nil { + return nil, err + } + return &user, nil + } + func main() { + // Connect to PostgreSQL + db, err := sql.Open("postgres", "postgresql://username:password@localhost/dbname?sslmode=disable") + if err != nil { + log.Fatalf("Failed to connect to database: %v", err) + } + defer db.Close() + // Create gRPC server + s := grpc.NewServer() + pb.RegisterUserServiceServer(s, &server{db: db}) + // Start listening + lis, err := net.Listen("tcp", ":50051") + if err != nil { + log.Fatalf("Failed to listen: %v", err) + } + log.Println("Server listening on :50051") + if err := s.Serve(lis); err != nil { + log.Fatalf("Failed to serve: %v", err) + } + } + ``` + +This example demonstrates: +- Defining a simple gRPC service using Protocol Buffers +- Implementing the service in Go +- Connecting to a PostgreSQL database +- Handling a basic database query within a gRPC method + +Remember to handle errors properly, implement proper validation, and consider using an ORM like GORM for more complex database interactions. Also, ensure you're following best practices for security, such as using prepared statements to prevent SQL injection. + +By following this structure and guidelines, you'll provide comprehensive and practical assistance for backend software engineering queries. diff --git a/docs/plan/filestructure.md b/docs/plan/filestructure.md new file mode 100644 index 0000000..49b5214 --- /dev/null +++ b/docs/plan/filestructure.md @@ -0,0 +1,134 @@ +# Directory/File Structure + +This structure assumes a Go-based project, as hinted by the Go interface definitions in the RFC. + +``` +kat-system/ +├── README.md # Project overview, build instructions, contribution guide +├── LICENSE # Project license (e.g., Apache 2.0, MIT) +├── go.mod # Go modules definition +├── go.sum # Go modules checksums +├── Makefile # Build, test, lint, generate code, etc. +│ +├── api/ +│ └── v1alpha1/ +│ ├── kat.proto # Protocol Buffer definitions for all KAT resources (Workload, Node, etc.) +│ └── generated/ # Generated Go code from .proto files (e.g., using protoc-gen-go) +│ # Potentially OpenAPI/Swagger specs generated from protos too. +│ +├── cmd/ +│ ├── kat-agent/ +│ │ └── main.go # Entrypoint for the kat-agent binary +│ └── katcall/ +│ └── main.go # Entrypoint for the katcall CLI binary +│ +├── internal/ +│ ├── agent/ +│ │ ├── agent.go # Core agent logic, heartbeating, command processing +│ │ ├── runtime.go # Interface with ContainerRuntime (Podman) +│ │ ├── build.go # Git-native build process logic +│ │ └── dns_resolver.go # Embedded DNS server logic +│ │ +│ ├── leader/ +│ │ ├── leader.go # Core leader logic, reconciliation loops +│ │ ├── schedule.go # Scheduling algorithm implementation +│ │ ├── ipam.go # IP Address Management logic +│ │ ├── state_backup.go # etcd backup logic +│ │ └── api_handler.go # HTTP API request handlers (connects to api/v1alpha1) +│ │ +│ ├── api/ # Server-side API implementation details +│ │ ├── server.go # HTTP server setup, middleware (auth, logging) +│ │ ├── router.go # API route definitions +│ │ └── auth.go # Authentication (mTLS, Bearer token) logic +│ │ +│ ├── cli/ +│ │ ├── commands/ # Subdirectories for each katcall command (apply, get, logs, etc.) +│ │ │ ├── apply.go +│ │ │ └── ... +│ │ ├── client.go # HTTP client for interacting with KAT API +│ │ └── utils.go # CLI helper functions +│ │ +│ ├── config/ +│ │ ├── types.go # Go structs for Quadlet file kinds if not directly from proto +│ │ ├── parse.go # Logic for parsing and validating *.kat files (Quadlets, cluster.kat) +│ │ └── defaults.go # Default values for configurations +│ │ +│ ├── store/ +│ │ ├── interface.go # Definition of StateStore interface (as in RFC 5.1) +│ │ └── etcd.go # etcd implementation of StateStore, embedded etcd setup +│ │ +│ ├── runtime/ +│ │ ├── interface.go # Definition of ContainerRuntime interface (as in RFC 6.1) +│ │ └── podman.go # Podman implementation of ContainerRuntime +│ │ +│ ├── network/ +│ │ ├── wireguard.go # WireGuard setup and peer management logic +│ │ └── types.go # Network related internal types +│ │ +│ ├── pki/ +│ │ ├── ca.go # Certificate Authority management (generation, signing) +│ │ └── certs.go # Certificate generation and handling utilities +│ │ +│ ├── observability/ +│ │ ├── logging.go # Logging setup for components +│ │ ├── metrics.go # Metrics collection and exposure logic +│ │ └── events.go # Event recording and retrieval logic +│ │ +│ ├── types/ # Core internal data structures if not covered by API protos +│ │ ├── node.go +│ │ ├── workload.go +│ │ └── ... +│ │ +│ ├── constants/ +│ │ └── constants.go # Global constants (etcd key prefixes, default ports, etc.) +│ │ +│ └── utils/ +│ ├── utils.go # Common utility functions (error handling, string manipulation) +│ └── tar.go # Utilities for handling tar.gz Quadlet archives +│ +├── docs/ +│ ├── rfc/ +│ │ └── RFC001-KAT.md # The source RFC document +│ ├── user-guide/ # User documentation (installation, getting started, tutorials) +│ │ ├── installation.md +│ │ └── basic_usage.md +│ └── api-guide/ # API usage documentation (perhaps generated) +│ +├── examples/ +│ ├── simple-service/ # Example Quadlet for a simple service +│ │ ├── workload.kat +│ │ └── VirtualLoadBalancer.kat +│ ├── git-build-service/ # Example Quadlet for a service built from Git +│ │ ├── workload.kat +│ │ └── build.kat +│ ├── job/ # Example Quadlet for a Job +│ │ ├── workload.kat +│ │ └── job.kat +│ └── cluster.kat # Example cluster configuration file +│ +├── scripts/ +│ ├── setup-dev-env.sh # Script to set up development environment +│ ├── lint.sh # Code linting script +│ ├── test.sh # Script to run all tests +│ └── gen-proto.sh # Script to generate Go code from .proto files +│ +└── test/ + ├── unit/ # Unit tests (mirroring internal/ structure) + ├── integration/ # Integration tests (e.g., agent-leader interaction) + └── e2e/ # End-to-end tests (testing full cluster operations via katcall) + ├── fixtures/ # Test Quadlet files + └── e2e_test.go +``` + +**Description of Key Files/Directories and Relationships:** + +* **`api/v1alpha1/kat.proto`**: The source of truth for all resource definitions. `make generate` (or `scripts/gen-proto.sh`) would convert this into Go structs in `api/v1alpha1/generated/`. These structs will be used across the `internal/` packages. +* **`cmd/kat-agent/main.go`**: Initializes and runs the `kat-agent`. It will instantiate components from `internal/store` (for etcd), `internal/agent`, `internal/leader`, `internal/pki`, `internal/network`, and `internal/api` (for the API server if elected leader). +* **`cmd/katcall/main.go`**: Entry point for the CLI. It uses `internal/cli` components to parse commands and interact with the KAT API via `internal/cli/client.go`. +* **`internal/config/parse.go`**: Used by the Leader to parse submitted Quadlet `tar.gz` archives and by `kat-agent init` to parse `cluster.kat`. +* **`internal/store/etcd.go`**: Implements `StateStore` and manages the embedded etcd instance. Used by both Agent (for watching) and Leader (for all state modifications, leader election). +* **`internal/runtime/podman.go`**: Implements `ContainerRuntime`. Used by `internal/agent/runtime.go` to manage containers based on Podman. +* **`internal/agent/agent.go`** and **`internal/leader/leader.go`**: Contain the core state machines and logic for the respective roles. The `kat-agent` binary decides which role's logic to activate based on leader election status. +* **`internal/pki/ca.go`**: Used by `kat-agent init` to create the CA, and by the Leader to sign CSRs from joining agents. +* **`internal/network/wireguard.go`**: Used by agents to configure their local WireGuard interface based on data synced from etcd (managed by the Leader). +* **`internal/leader/api_handler.go`**: Implements the HTTP handlers for the API, using other leader components (scheduler, IPAM, store) to fulfill requests. diff --git a/docs/plan/overview.md b/docs/plan/overview.md new file mode 100644 index 0000000..2a945ea --- /dev/null +++ b/docs/plan/overview.md @@ -0,0 +1,183 @@ +# Implementation Plan + +This plan breaks down the implementation into manageable phases, each with a testable milestone. + +**Phase 0: Project Setup & Core Types** +* **Goal**: Basic project structure, version control, build system, and core data type definitions. +* **Tasks**: + 1. Initialize Git repository, `go.mod`. + 2. Create initial directory structure (as above). + 3. Define core Proto3 messages in `api/v1alpha1/kat.proto` for: `Workload`, `VirtualLoadBalancer`, `JobDefinition`, `BuildDefinition`, `Namespace`, `Node` (internal representation), `ClusterConfiguration`. + 4. Set up `scripts/gen-proto.sh` and generate initial Go types. + 5. Implement parsing and basic validation for `cluster.kat` (`internal/config/parse.go`). + 6. Implement parsing and basic validation for Quadlet files (`workload.kat`, etc.) and their `tar.gz` packaging/unpackaging. +* **Milestone**: + * `make generate` successfully creates Go types from protos. + * Unit tests pass for parsing `cluster.kat` and a sample Quadlet directory (as `tar.gz`) into their respective Go structs. + +**Phase 1: State Management & Leader Election** +* **Goal**: A functional embedded etcd and leader election mechanism. +* **Tasks**: + 1. Implement the `StateStore` interface (RFC 5.1) with an etcd backend (`internal/store/etcd.go`). + 2. Integrate embedded etcd server into `kat-agent` (RFC 2.2, 5.2), configurable via `cluster.kat` parameters. + 3. Implement leader election using `go.etcd.io/etcd/client/v3/concurrency` (RFC 5.3). + 4. Basic `kat-agent init` functionality: + * Parse `cluster.kat`. + * Start single-node embedded etcd. + * Campaign for and become leader. + * Store initial cluster configuration (UID, CIDRs from `cluster.kat`) in etcd. +* **Milestone**: + * A single `kat-agent init --config cluster.kat` process starts, initializes etcd, and logs that it has become the leader. + * The cluster configuration from `cluster.kat` can be verified in etcd using an etcd client. + * `StateStore` interface methods (`Put`, `Get`, `Delete`, `List`) are testable against the embedded etcd. + +**Phase 2: Basic Agent & Node Lifecycle (Init, Join, PKI)** +* **Goal**: Initial Leader setup, a second Agent joining with mTLS, and heartbeating. +* **Tasks**: + 1. Implement Internal PKI (RFC 10.6) in `internal/pki/`: + * CA key/cert generation on `kat-agent init`. + * CSR generation by agent on join. + * CSR signing by Leader. + 2. Implement initial Node Communication Protocol (RFC 2.3) for join: + * Agent (`kat-agent join --leader-api <...> --advertise-address <...>`) sends CSR to Leader. + * Leader validates, signs, returns certs & CA. Stores node registration (name, UID, advertise addr, WG pubkey placeholder) in etcd. + 3. Implement basic mTLS for this join communication. + 4. Implement Node Heartbeat (`POST /v1alpha1/nodes/{nodeName}/status`) from Agent to Leader (RFC 4.1.3). Leader updates node status in etcd. + 5. Leader implements basic failure detection (marks Node `NotReady` in etcd if heartbeats cease) (RFC 4.1.4). +* **Milestone**: + * `kat-agent init` establishes a Leader with a CA. + * `kat-agent join` allows a second agent to securely register with the Leader, obtain certificates, and store its info in etcd. + * Leader's API receives heartbeats from the joined Agent. + * If a joined Agent is stopped, the Leader marks its status as `NotReady` in etcd after `nodeLossTimeoutSeconds`. + +**Phase 3: Container Runtime Interface & Local Podman Management** +* **Goal**: Agent can manage containers locally via Podman using the CRI. +* **Tasks**: + 1. Define `ContainerRuntime` interface in `internal/runtime/interface.go` (RFC 6.1). + 2. Implement the Podman backend for `ContainerRuntime` in `internal/runtime/podman.go` (RFC 6.2). Focus on: `CreateContainer`, `StartContainer`, `StopContainer`, `RemoveContainer`, `GetContainerStatus`, `PullImage`, `StreamContainerLogs`. + 3. Implement rootless execution strategy (RFC 6.3): + * Mechanism to ensure dedicated user accounts (initially, assume pre-existing or manual creation for tests). + * Podman systemd unit generation (`podman generate systemd`). + * Managing units via `systemctl --user`. +* **Milestone**: + * Agent process (upon a mocked internal command) can pull a specified image (e.g., `nginx`) and run it rootlessly using Podman and systemd user services. + * Agent can stop, remove, and get the status/logs of this container. + * All operations are performed via the `ContainerRuntime` interface. + +**Phase 4: Basic Workload Deployment (Single Node, Image Source Only, No Networking)** +* **Goal**: Leader can instruct an Agent to run a simple `Service` workload (single container, image source) on itself (if leader is also an agent) or a single joined agent. +* **Tasks**: + 1. Implement basic API endpoints on Leader for Workload CRUD (`POST/PUT /v1alpha1/n/{ns}/workloads` accepting `tar.gz`) (RFC 8.3, 4.2). Leader stores Quadlet files in etcd. + 2. Simplistic scheduling (RFC 4.4): If only one agent node, assign workload to it. Leader creates an "assignment" or "task" for the agent in etcd. + 3. Agent watches for assigned tasks from etcd. + 4. On receiving a task, Agent uses `ContainerRuntime` to deploy the container (image from `workload.kat`). + 5. Agent reports container instance status in its heartbeat. Leader updates overall workload status in etcd. + 6. Basic `katcall apply -f ` and `katcall get workload ` functionality. +* **Milestone**: + * User can deploy a simple single-container `Service` (e.g., `nginx`) using `katcall apply`. + * The container runs on the designated Agent node. + * `katcall get workload my-service` shows its status as running. + * `katcall logs ` streams container logs. + +**Phase 5: Overlay Networking (WireGuard) & IPAM** +* **Goal**: Nodes establish a WireGuard overlay network. Leader allocates IPs for containers. +* **Tasks**: + 1. Implement WireGuard setup on Agents (`internal/network/wireguard.go`) (RFC 7.1): + * Key generation, public key reporting to Leader during join/heartbeat. + * Leader stores Node WireGuard public keys and advertise endpoints in etcd. + * Agent configures its `kat0` interface and peers by watching etcd. + 2. Implement IPAM in Leader (`internal/leader/ipam.go`) (RFC 7.2): + * Node subnet allocation from `clusterCIDR` (from `cluster.kat`). + * Container IP allocation from the node's subnet when a workload instance is scheduled. + 3. Agent uses the Leader-assigned IP when creating the container network/container with Podman. +* **Milestone**: + * All joined KAT nodes form a WireGuard mesh; `wg show` on nodes confirms peer connections. + * Leader allocates a unique overlay IP for each container instance. + * Containers on different nodes can ping each other using their overlay IPs. + +**Phase 6: Distributed Agent DNS & Service Discovery** +* **Goal**: Basic service discovery using agent-local DNS for deployed services. +* **Tasks**: + 1. Implement Agent-local DNS server (`internal/agent/dns_resolver.go`) using `miekg/dns` (RFC 7.3). + 2. Leader writes DNS `A` records to etcd (e.g., `.. -> `) when service instances become healthy/active. + 3. Agent DNS server watches etcd for DNS records and updates its local zones. + 4. Agent configures `/etc/resolv.conf` in managed containers to use its `kat0` IP as nameserver. +* **Milestone**: + * A service (`service-a`) deployed on one node can be resolved by its DNS name (e.g., `service-a.default.kat.cluster.local`) by a container on another node. + * DNS resolution provides the correct overlay IP(s) of `service-a` instances. + +**Phase 7: Advanced Workload Features & Full Scheduling** +* **Goal**: Implement `Job`, `DaemonService`, richer scheduling, health checks, volumes, and restart policies. +* **Tasks**: + 1. Implement `Job` type (RFC 3.4, 4.8): scheduling, completion tracking, backoff. + 2. Implement `DaemonService` type (RFC 3.2): ensures one instance per eligible node. + 3. Implement full scheduling logic in Leader (RFC 4.4): resource requests (`cpu`, `memory`), `nodeSelector`, Taint/Toleration, GPU (basic), "most empty" scoring. + 4. Implement `VirtualLoadBalancer.kat` parsing and Agent-side health checks (RFC 3.3, 4.6.3). Leader uses health status for service readiness and DNS. + 5. Implement container `restartPolicy` (RFC 3.2, 4.6.4) via systemd unit configuration. + 6. Implement `volumeMounts` and `volumes` (RFC 3.2, 4.7): `HostMount`, `SimpleClusterStorage`. Agent ensures paths are set up. +* **Milestone**: + * `Job`s run to completion and their status is tracked. + * `DaemonService`s run one instance on all eligible nodes. + * Services are scheduled according to resource requests, selectors, and taints. + * Unhealthy service instances are identified by health checks and reflected in status. + * Containers restart based on their policy. + * Workloads can mount host paths and simple cluster storage. + +**Phase 8: Git-Native Builds & Workload Updates/Rollbacks** +* **Goal**: Enable on-agent builds from Git sources and implement workload update strategies. +* **Tasks**: + 1. Implement `BuildDefinition.kat` parsing (RFC 3.5). + 2. Implement Git-native build process on Agent (`internal/agent/build.go`) using Podman (RFC 4.3). + 3. Implement `cacheImage` pull/push for build caching (Agent needs registry credentials configured locally). + 4. Implement workload update strategies in Leader (RFC 4.5): `Simultaneous`, `Rolling` (with `maxSurge`). + 5. Implement manual rollback mechanism (`katcall rollback workload `) (RFC 4.5). +* **Milestone**: + * A workload can be successfully deployed from a Git repository source, with the image built on the agent. + * A deployed service can be updated using the `Rolling` strategy with observable incremental instance replacement. + * A workload can be rolled back to its previous version. + +**Phase 9: Full API Implementation & CLI (`katcall`) Polish** +* **Goal**: A robust and comprehensive HTTP API and `katcall` CLI. +* **Tasks**: + 1. Implement all remaining API endpoints and features as per RFC Section 8. Ensure Proto3/JSON contracts are met. + 2. Implement API authentication: bearer token for `katcall` (RFC 8.1, 10.1). + 3. Flesh out `katcall` with all necessary commands and options (RFC 1.5 Terminology - katcall, RFC 8.3 hints): + * `drain `, `get nodes/namespaces`, `describe `, etc. + 4. Improve error reporting and user feedback in CLI and API. +* **Milestone**: + * All functionalities defined in the RFC can be managed and introspected via the `katcall` CLI interacting with the secure KAT API. + * API documentation (e.g., Swagger/OpenAPI generated from protos or code) is available. + +**Phase 10: Observability, Backup/Restore, Advanced Features & Security** +* **Goal**: Implement observability features, state backup/restore, and other advanced functionalities. +* **Tasks**: + 1. Implement Agent & Leader logging to systemd journal/files; API for streaming container logs already in Phase 4/Milestone (RFC 9.1). + 2. Implement basic Metrics exposure (`/metrics` JSON endpoint on Leader/Agent) (RFC 9.2). + 3. Implement Events system: Leader records significant events in etcd, API to query events (RFC 9.3). + 4. Implement Leader-driven etcd state backup (`etcdctl snapshot save`) (RFC 5.4). + 5. Document and test the etcd state restore procedure (RFC 5.5). + 6. Implement Detached Node Operation and Rejoin (RFC 4.9). + 7. Provide standard Quadlet files and documentation for the Traefik Ingress recipe (RFC 7.4). + 8. Review and harden security aspects: API security, build security, network security, secrets handling (document current limitations as per RFC 10.5). +* **Milestone**: + * Container logs are streamable via `katcall logs`. Agent/Leader logs are accessible. + * Basic metrics are available via API. Cluster events can be listed. + * Automated etcd backups are created by the Leader. Restore procedure is tested. + * Detached node can operate locally and rejoin the main cluster. + * Traefik can be deployed using provided Quadlets to achieve ingress. + +**Phase 11: Testing, Documentation, and Release Preparation** +* **Goal**: Ensure KAT v1.0 is robust, well-documented, and ready for release. +* **Tasks**: + 1. Write comprehensive unit tests for all core logic. + 2. Develop integration tests for component interactions (e.g., Leader-Agent, Agent-Podman). + 3. Create an E2E test suite using `katcall` to simulate real user scenarios. + 4. Write detailed user documentation: installation, configuration, tutorials for all features, troubleshooting. + 5. Perform performance testing on key operations (e.g., deployment speed, agent density). + 6. Conduct a thorough security review/audit against RFC security considerations. + 7. Establish a release process: versioning, changelog, building release artifacts. +* **Milestone**: + * High test coverage. + * Comprehensive user and API documentation is complete. + * Known critical bugs are fixed. + * KAT v1.0 is packaged and ready for its first official release. \ No newline at end of file diff --git a/docs/plan/phase1.md b/docs/plan/phase1.md new file mode 100644 index 0000000..7ecb6d3 --- /dev/null +++ b/docs/plan/phase1.md @@ -0,0 +1,81 @@ +# **Phase 1: State Management & Leader Election** + +* **Goal**: Establish the foundational state layer using embedded etcd and implement a reliable leader election mechanism. A single `kat-agent` can initialize a cluster, become its leader, and store initial configuration. +* **RFC Sections Primarily Used**: 2.2 (Embedded etcd), 3.9 (ClusterConfiguration), 5.1 (State Store Interface), 5.2 (etcd Implementation Details), 5.3 (Leader Election). + +**Tasks & Sub-Tasks:** + +1. **Define `StateStore` Go Interface (`internal/store/interface.go`)** + * **Purpose**: Create the abstraction layer for all state operations, decoupling the rest of the system from direct etcd dependencies. + * **Details**: Transcribe the Go interface from RFC 5.1 verbatim. Include `KV`, `WatchEvent`, `EventType`, `Compare`, `Op`, `OpType` structs/constants. + * **Verification**: Code compiles. Interface definition matches RFC. + +2. **Implement Embedded etcd Server Logic (`internal/store/etcd.go`)** + * **Purpose**: Allow `kat-agent` to run its own etcd instance for single-node clusters or as part of a multi-node quorum. + * **Details**: + * Use `go.etcd.io/etcd/server/v3/embed`. + * Function to start an embedded etcd server: + * Input: configuration parameters (data directory, peer URLs, client URLs, name). These will come from `cluster.kat` or defaults. + * Output: a running `embed.Etcd` instance or an error. + * Graceful shutdown logic for the embedded etcd server. + * **Verification**: A test can start and stop an embedded etcd server. Data directory is created and used. + +3. **Implement `StateStore` with etcd Backend (`internal/store/etcd.go`)** + * **Purpose**: Provide the concrete implementation for interacting with an etcd cluster (embedded or external). + * **Details**: + * Create a struct that implements the `StateStore` interface and holds an `etcd/clientv3.Client`. + * Implement `Put(ctx, key, value)`: Use `client.Put()`. + * Implement `Get(ctx, key)`: Use `client.Get()`. Handle key-not-found. Populate `KV.Version` with `ModRevision`. + * Implement `Delete(ctx, key)`: Use `client.Delete()`. + * Implement `List(ctx, prefix)`: Use `client.Get()` with `clientv3.WithPrefix()`. + * Implement `Watch(ctx, keyOrPrefix, startRevision)`: Use `client.Watch()`. Translate etcd events to `WatchEvent`. + * Implement `Close()`: Close the `clientv3.Client`. + * Implement `Campaign(ctx, leaderID, leaseTTLSeconds)`: + * Use `concurrency.NewSession()` to create a lease. + * Use `concurrency.NewElection()` and `election.Campaign()`. + * Return a context that is cancelled when leadership is lost (e.g., by watching the campaign context or session done channel). + * Implement `Resign(ctx)`: Use `election.Resign()`. + * Implement `GetLeader(ctx)`: Observe the election or query the leader key. + * Implement `DoTransaction(ctx, checks, onSuccess, onFailure)`: Use `client.Txn()` with `clientv3.Compare` and `clientv3.Op`. + * **Potential Challenges**: Correctly handling etcd transaction semantics, context propagation, and error translation. Efficiently managing watches. + * **Verification**: + * Unit tests for each `StateStore` method using a real embedded etcd instance (test-scoped). + * Verify `Put` then `Get` retrieves the correct value and version. + * Verify `List` with prefix. + * Verify `Delete` removes the key. + * Verify `Watch` receives correct events for puts/deletes. + * Verify `DoTransaction` commits on success and rolls back on failure. + +4. **Integrate Leader Election into `kat-agent` (`cmd/kat-agent/main.go`, `internal/leader/election.go` - new file maybe)** + * **Purpose**: Enable an agent instance to attempt to become the cluster leader. + * **Details**: + * `kat-agent` main function will initialize its `StateStore` client. + * A dedicated goroutine will call `StateStore.Campaign()`. + * The outcome of `Campaign` (e.g., leadership acquired, context for leadership duration) will determine if the agent activates its Leader-specific logic (Phase 2+). + * Leader ID could be `nodeName` or a UUID. Lease TTL from `cluster.kat`. + * **Verification**: + * Start one `kat-agent` with etcd enabled; it should log "became leader". + * Start a second `kat-agent` configured to connect to the first's etcd; it should log "observing leader " or similar, but not become leader itself. + * If the first agent (leader) is stopped, the second agent should eventually log "became leader". + +5. **Implement Basic `kat-agent init` Command (`cmd/kat-agent/main.go`, `internal/config/parse.go`)** + * **Purpose**: Initialize a new KAT cluster (single node initially). + * **Details**: + * Define `init` subcommand in `kat-agent` using a CLI library (e.g., `cobra`). + * Flag: `--config `. + * Parse `cluster.kat` (from Phase 0, now used to extract etcd peer/client URLs, data dir, backup paths etc.). + * Generate a persistent Cluster UID and store it in etcd (e.g., `/kat/config/cluster_uid`). + * Store `cluster.kat` relevant parameters (or the whole sanitized config) into etcd (e.g., under `/kat/config/cluster_config`). + * Start the embedded etcd server using parsed configurations. + * Initiate leader election. + * **Potential Challenges**: Ensuring `cluster.kat` parsing is robust. Handling existing data directories. + * **Milestone Verification**: + * Running `kat-agent init --config examples/cluster.kat` on a clean system: + * Starts the `kat-agent` process. + * Creates the etcd data directory. + * Logs "Successfully initialized etcd". + * Logs "Became leader: ". + * Using `etcdctl` (or a simple `StateStore.Get` test client): + * Verify `/kat/config/cluster_uid` exists and has a UUID. + * Verify `/kat/config/cluster_config` (or similar keys) contains data from `cluster.kat` (e.g., `clusterCIDR`, `serviceCIDR`, `agentPort`, `apiPort`). + * Verify the leader election key exists for the current leader. \ No newline at end of file diff --git a/docs/plan/phase2.md b/docs/plan/phase2.md new file mode 100644 index 0000000..8dd755f --- /dev/null +++ b/docs/plan/phase2.md @@ -0,0 +1,98 @@ +# **Phase 2: Basic Agent & Node Lifecycle (Init, Join, PKI)** + +* **Goal**: Implement the secure registration of a new agent node to an existing leader, including PKI for mTLS, and establish periodic heartbeating for status updates and failure detection. +* **RFC Sections Primarily Used**: 2.3 (Node Communication Protocol), 4.1.1 (Initial Leader Setup - CA), 4.1.2 (Agent Node Join - CSR), 10.1 (API Security - mTLS), 10.6 (Internal PKI), 4.1.3 (Node Heartbeat), 4.1.4 (Node Departure and Failure Detection - basic). + +**Tasks & Sub-Tasks:** + +1. **Implement Internal PKI Utilities (`internal/pki/ca.go`, `internal/pki/certs.go`)** + * **Purpose**: Create and manage the Certificate Authority and sign certificates for mTLS. + * **Details**: + * `GenerateCA()`: Creates a new RSA key pair and a self-signed X.509 CA certificate. Saves to disk (e.g., `/var/lib/kat/pki/ca.key`, `/var/lib/kat/pki/ca.crt`). Path from `cluster.kat` `backupPath` parent dir, or a new `pkiPath`. + * `GenerateCertificateRequest(commonName, keyOutPath, csrOutPath)`: Agent uses this. Generates RSA key, creates a CSR. + * `SignCertificateRequest(caKeyPath, caCertPath, csrData, certOutPath, duration)`: Leader uses this. Loads CA key/cert, parses CSR, issues a signed certificate. + * Helper functions to load keys and certs from disk. + * **Potential Challenges**: Handling cryptographic operations correctly and securely. Permissions for key storage. + * **Verification**: Unit tests for `GenerateCA`, `GenerateCertificateRequest`, `SignCertificateRequest`. Generated certs should be verifiable against the CA. + +2. **Leader: Initialize CA & Its Own mTLS Certs on `init` (`cmd/kat-agent/main.go`)** + * **Purpose**: The first leader needs to establish the PKI and secure its own API endpoint. + * **Details**: + * During `kat-agent init`, after etcd is up and leadership is confirmed: + * Call `pki.GenerateCA()` if CA files don't exist. + * Generate its own server key and CSR (e.g., for `leader.kat.cluster.local`). + * Sign its own CSR using the CA to get its server certificate. + * Configure its (future) API HTTP server to use these server key/cert for TLS and require client certs (mTLS). + * **Verification**: After `kat-agent init`, CA key/cert and leader's server key/cert exist in the configured PKI path. + +3. **Implement Basic API Server with mTLS on Leader (`internal/api/server.go`, `internal/api/router.go`)** + * **Purpose**: Provide the initial HTTP endpoints required for agent join, secured with mTLS. + * **Details**: + * Setup `http.Server` with `tls.Config`: + * `Certificates`: Leader's server key/cert. + * `ClientAuth: tls.RequireAndVerifyClientCert`. + * `ClientCAs`: Pool containing the cluster CA certificate. + * Minimal router (e.g., `gorilla/mux` or `http.ServeMux`) for: + * `POST /internal/v1alpha1/join`: Endpoint for agent to submit CSR. (Internal as it's part of bootstrap). + * **Verification**: An HTTPS client (e.g., `curl` with appropriate client certs) can connect to the leader's API port if it presents a cert signed by the cluster CA. Connection fails without a client cert or with a cert from a different CA. + +4. **Agent: `join` Command & CSR Submission (`cmd/kat-agent/main.go`, `internal/cli/join.go` - or similar for agent logic)** + * **Purpose**: Allow a new agent to request to join the cluster and obtain its mTLS credentials. + * **Details**: + * `kat-agent join` subcommand: + * Flags: `--leader-api `, `--advertise-address `, `--node-name ` (optional, leader can generate). + * Generate its own key pair and CSR using `pki.GenerateCertificateRequest()`. + * Make an HTTP POST to Leader's `/internal/v1alpha1/join` endpoint: + * Payload: CSR data, advertise address, requested node name, initial WireGuard public key (placeholder for now). + * For this *initial* join, the client may need to trust the leader's CA cert via an out-of-band mechanism or `--leader-ca-cert` flag, or use a token for initial auth if mTLS is strictly enforced from the start. *RFC implies mTLS is mandatory, so agent needs CA cert to trust leader, and leader needs to accept CSR perhaps based on a pre-shared token initially before agent has its own signed cert.* For simplicity in V1, the initial join POST might happen over HTTPS where the agent trusts the leader's self-signed cert (if leader has one before CA is used for client auth) or a pre-shared token authorizes the CSR signing. RFC 4.1.2 states "Leader, upon validating the join request (V1 has no strong token validation, relies on network trust)". This needs clarification. *Assume network trust for now: agent connects, sends CSR, leader signs.* + * Receive signed certificate and CA certificate from Leader. Store them locally. + * **Potential Challenges**: Securely bootstrapping trust for the very first communication to the leader to submit the CSR. + * **Verification**: `kat-agent join` command: + * Generates key/CSR. + * Successfully POSTs CSR to leader. + * Receives and saves its signed certificate and the CA certificate. + +5. **Leader: CSR Signing & Node Registration (Handler for `/internal/v1alpha1/join`)** + * **Purpose**: Validate joining agent, sign its CSR, and record its registration. + * **Details**: + * Handler for `/internal/v1alpha1/join`: + * Parse CSR, advertise address, WG pubkey from request. + * Validate (minimal for now). + * Generate a unique Node Name if not provided. Assign a Node UID. + * Sign the CSR using `pki.SignCertificateRequest()`. + * Store Node registration data in etcd via `StateStore` (`/kat/nodes/registration/{nodeName}`: UID, advertise address, WG pubkey placeholder, join timestamp). + * Return the signed agent certificate and the cluster CA certificate to the agent. + * **Verification**: + * After agent joins, its certificate is signed by the cluster CA. + * Node registration data appears correctly in etcd under `/kat/nodes/registration/{nodeName}`. + +6. **Agent: Establish mTLS Client for Subsequent Comms & Implement Heartbeating (`internal/agent/agent.go`)** + * **Purpose**: Agent uses its new mTLS certs to communicate status to the Leader. + * **Details**: + * Agent configures its HTTP client to use its signed key/cert and the cluster CA cert for all future Leader communications. + * Periodic Heartbeat (RFC 4.1.3): + * Ticker (e.g., every `agentTickSeconds` from `cluster.kat`, default 15s). + * On tick, gather basic node status (node name, timestamp, initial resource capacity stubs). + * HTTP `POST` to Leader's `/v1alpha1/nodes/{nodeName}/status` endpoint using the mTLS-configured client. + * **Verification**: Agent logs successful heartbeat POSTs. + +7. **Leader: Receive Heartbeats & Basic Failure Detection (Handler for `/v1alpha1/nodes/{nodeName}/status`, `internal/leader/leader.go`)** + * **Purpose**: Leader tracks agent status and detects failures. + * **Details**: + * API endpoint `/v1alpha1/nodes/{nodeName}/status` (mTLS required): + * Receives status update from agent. + * Updates node's actual state in etcd (`/kat/nodes/status/{nodeName}/heartbeat`: timestamp, reported status). Could use an etcd lease for this key, renewed by agent heartbeats. + * Failure Detection (RFC 4.1.4): + * Leader has a reconciliation loop or periodic check. + * Scans `/kat/nodes/status/` in etcd. + * If a node's last heartbeat timestamp is older than `nodeLossTimeoutSeconds` (from `cluster.kat`), update its status in etcd to `NotReady` (e.g., `/kat/nodes/status/{nodeName}/condition: NotReady`). + * **Potential Challenges**: Efficiently scanning for dead nodes without excessive etcd load. + * **Milestone Verification**: + * `kat-agent init` runs as Leader, CA created, its API is up with mTLS. + * A second `kat-agent join ...` process successfully: + * Generates CSR, gets it signed by Leader. + * Saves its cert and CA cert. + * Starts sending heartbeats to Leader using mTLS. + * Leader logs receipt of heartbeats from the joined Agent. + * Node status (last heartbeat time) is updated in etcd by the Leader. + * If the joined Agent process is stopped, after `nodeLossTimeoutSeconds`, the Leader updates the node's status in etcd to `NotReady`. This can be verified using `etcdctl` or a `StateStore.Get` call. diff --git a/docs/plan/phase3.md b/docs/plan/phase3.md new file mode 100644 index 0000000..b0eb038 --- /dev/null +++ b/docs/plan/phase3.md @@ -0,0 +1,102 @@ +# **Phase 3: Container Runtime Interface & Local Podman Management** + +* **Goal**: Abstract container management operations behind a `ContainerRuntime` interface and implement it using Podman CLI, enabling an agent to manage containers rootlessly based on (mocked) instructions. +* **RFC Sections Primarily Used**: 6.1 (Runtime Interface Definition), 6.2 (Default Implementation: Podman), 6.3 (Rootless Execution Strategy). + +**Tasks & Sub-Tasks:** + +1. **Define `ContainerRuntime` Go Interface (`internal/runtime/interface.go`)** + * **Purpose**: Abstract all container operations (build, pull, run, stop, inspect, logs, etc.). + * **Details**: Transcribe the Go interface from RFC 6.1 precisely. Include all specified structs (`ImageSummary`, `ContainerStatus`, `BuildOptions`, `PortMapping`, `VolumeMount`, `ResourceSpec`, `ContainerCreateOptions`, `ContainerHealthCheck`) and enums (`ContainerState`, `HealthState`). + * **Verification**: Code compiles. Interface and type definitions match RFC. + +2. **Implement Podman Backend for `ContainerRuntime` (`internal/runtime/podman.go`) - Core Lifecycle Methods** + * **Purpose**: Translate `ContainerRuntime` calls into `podman` CLI commands. + * **Details (for each method, focus on these first):** + * `PullImage(ctx, imageName, platform)`: + * Cmd: `podman pull {imageName}` (add `--platform` if specified). + * Parse output to get image ID (e.g., from `podman inspect {imageName} --format '{{.Id}}'`). + * `CreateContainer(ctx, opts ContainerCreateOptions)`: + * Cmd: `podman create ...` + * Translate `ContainerCreateOptions` into `podman create` flags: + * `--name {opts.InstanceID}` (KAT's unique ID for the instance). + * `--hostname {opts.Hostname}`. + * `--env` for `opts.Env`. + * `--label` for `opts.Labels` (include KAT ownership labels like `kat.dws.rip/workload-name`, `kat.dws.rip/namespace`, `kat.dws.rip/instance-id`). + * `--restart {opts.RestartPolicy}` (map to Podman's "no", "on-failure", "always"). + * Resource mapping: `--cpus` (for quota), `--cpu-shares`, `--memory`. + * `--publish` for `opts.Ports`. + * `--volume` for `opts.Volumes` (source will be host path, destination is container path). + * `--network {opts.NetworkName}` and `--ip {opts.IPAddress}` if specified. + * `--user {opts.User}`. + * `--cap-add`, `--cap-drop`, `--security-opt`. + * Podman native healthcheck flags from `opts.HealthCheck`. + * `--systemd={opts.Systemd}`. + * Parse output for container ID. + * `StartContainer(ctx, containerID)`: Cmd: `podman start {containerID}`. + * `StopContainer(ctx, containerID, timeoutSeconds)`: Cmd: `podman stop -t {timeoutSeconds} {containerID}`. + * `RemoveContainer(ctx, containerID, force, removeVolumes)`: Cmd: `podman rm {containerID}` (add `--force`, `--volumes`). + * `GetContainerStatus(ctx, containerOrName)`: + * Cmd: `podman inspect {containerOrName}`. + * Parse JSON output to populate `ContainerStatus` struct (State, ExitCode, StartedAt, FinishedAt, Health, ImageID, ImageName, OverlayIP if available from inspect). + * Podman health status needs to be mapped to `HealthState`. + * `StreamContainerLogs(ctx, containerID, follow, since, stdout, stderr)`: + * Cmd: `podman logs {containerID}` (add `--follow`, `--since`). + * Stream `os/exec.Cmd.Stdout` and `os/exec.Cmd.Stderr` to the provided `io.Writer`s. + * **Helper**: A utility function to run `podman` commands as a specific rootless user (see Rootless Execution below). + * **Potential Challenges**: Correctly mapping all `ContainerCreateOptions` to Podman flags. Parsing varied `podman inspect` output. Managing `os/exec` for logs. Robust error handling from CLI output. + * **Verification**: + * Unit tests for each implemented method, mocking `os/exec` calls to verify command construction and output parsing. + * *Requires Podman installed for integration-style unit tests*: Tests that actually execute `podman` commands (e.g., pull alpine, create, start, inspect, stop, rm) and verify state changes. + +3. **Implement Rootless Execution Strategy (`internal/runtime/podman.go` helpers, `internal/agent/runtime.go`)** + * **Purpose**: Ensure containers are run by unprivileged users using systemd for supervision. + * **Details**: + * **User Assumption**: For Phase 3, *assume* the dedicated user (e.g., `kat_wl_mywebapp`) already exists on the system and `loginctl enable-linger ` has been run manually. The username could be passed in `ContainerCreateOptions.User` or derived. + * **Podman Command Execution Context**: + * The `kat-agent` process itself might run as root or a privileged user. + * When executing `podman` commands for a workload, it MUST run them as the target unprivileged user. + * This can be achieved using `sudo -u {username} podman ...` or more directly via `nsenter`/`setuid` if the agent has capabilities, or by setting `XDG_RUNTIME_DIR` and `DBUS_SESSION_BUS_ADDRESS` appropriately for the target user if invoking `podman` via systemd user session D-Bus API. *Simplest for now might be `sudo -u {username} podman ...` if agent is root, or ensuring agent itself runs as a user who can switch to other `kat_wl_*` users.* + * The RFC prefers "systemd user sessions". This usually means `systemctl --user ...`. To control another user's systemd session, the agent process (if root) can use `machinectl shell {username}@.host /bin/bash -c "systemctl --user ..."` or `systemd-run --user --machine={username}@.host ...`. If the agent is not root, it cannot directly control other users' systemd sessions. *This is a critical design point: how does the agent (potentially root) interact with user-level systemd?* + * RFC: "Agent uses `systemctl --user --machine={username}@.host ...`". This implies agent has permissions to do this (likely running as root or with specific polkit rules). + * **Systemd Unit Generation & Management**: + * After `podman create ...` (or instead of direct create, if `podman generate systemd` is used to create the definition), generate systemd unit: + `podman generate systemd --new --name {opts.InstanceID} --files --time 10 {imageNameUsedInCreate}`. This creates a `{opts.InstanceID}.service` file. + * The `ContainerRuntime` implementation needs to: + 1. Execute `podman create` to establish the container definition (this allows Podman to manage its internal state for the container ID). + 2. Execute `podman generate systemd --name {containerID}` (using the ID from create) to get the unit file content. + 3. Place this unit file in the target user's systemd path (e.g., `/home/{username}/.config/systemd/user/{opts.InstanceID}.service` or `/etc/systemd/user/{opts.InstanceID}.service` if agent is root and wants to enable for any user). + 4. Run `systemctl --user --machine={username}@.host daemon-reload`. + 5. Start/Enable: `systemctl --user --machine={username}@.host enable --now {opts.InstanceID}.service`. + * To stop: `systemctl --user --machine={username}@.host stop {opts.InstanceID}.service`. + * To remove: `systemctl --user --machine={username}@.host disable {opts.InstanceID}.service`, then `podman rm {opts.InstanceID}`, then remove the unit file. + * Status: `systemctl --user --machine={username}@.host status {opts.InstanceID}.service` (parse output), or rely on `podman inspect` which should reflect systemd-managed state. + * **Potential Challenges**: Managing permissions for interacting with other users' systemd sessions. Correctly placing and cleaning up systemd unit files. Ensuring `XDG_RUNTIME_DIR` is set correctly for rootless Podman if not using systemd units for direct `podman run`. Systemd unit generation nuances. + * **Verification**: + * A test in `internal/agent/runtime_test.go` (or similar) can take mock `ContainerCreateOptions`. + * It calls the (mocked or real) `ContainerRuntime` implementation. + * Verify: + * Podman commands are constructed to run as the target unprivileged user. + * A systemd unit file is generated for the container. + * `systemctl --user --machine...` commands are invoked correctly to manage the service. + * The container is actually started (verify with `podman ps -a --filter label=kat.dws.rip/instance-id={instanceID}` as the target user). + * Logs can be retrieved. + * The container can be stopped and removed, including its systemd unit. + +* **Milestone Verification**: + * The `ContainerRuntime` Go interface is fully defined as per RFC 6.1. + * The Podman implementation for core lifecycle methods (`PullImage`, `CreateContainer` (leading to systemd unit generation), `StartContainer` (via systemd enable/start), `StopContainer` (via systemd stop), `RemoveContainer` (via systemd disable + podman rm + unit file removal), `GetContainerStatus`, `StreamContainerLogs`) is functional. + * An `internal/agent` test (or a temporary `main.go` test harness) can: + 1. Define `ContainerCreateOptions` for a simple image like `docker.io/library/alpine` with a command like `sleep 30`. + 2. Specify a (manually pre-created and linger-enabled) unprivileged username. + 3. Call the `ContainerRuntime` methods. + 4. **Result**: + * The alpine image is pulled (if not present). + * A systemd user service unit is generated and placed correctly for the specified user. + * The service is started using `systemctl --user --machine...`. + * `podman ps --all --filter label=kat.dws.rip/instance-id=...` (run as the target user or by root seeing all containers) shows the container running or having run. + * Logs can be retrieved using the `StreamContainerLogs` method. + * The container can be stopped and removed (including its systemd unit file). + * All container operations are verifiably performed by the specified unprivileged user. + +This detailed plan should provide a clearer path for implementing these initial crucial phases. Remember to keep testing iterative and focused on the RFC specifications. \ No newline at end of file diff --git a/docs/rfc/RFC001-KAT.md b/docs/rfc/RFC001-KAT.md new file mode 100644 index 0000000..079cc08 --- /dev/null +++ b/docs/rfc/RFC001-KAT.md @@ -0,0 +1,1014 @@ +# Request for Comments: 001 - The KAT System (v1.0 Specification) + +**Network Working Group:** DWS LLC\ +**Author:** T. Dubey\ +**Contact:** dubey@dws.rip \ +**Organization:** DWS LLC\ +**URI:** https://www.dws.rip\ +**Date:** May 2025\ +**Obsoletes:** None\ +**Updates:** None + +## The KAT System: A Simplified Container Orchestration Protocol and Architecture Design (Version 1.0) + +### Status of This Memo + +This document specifies Version 1.0 of the KAT (pronounced "cat") system, a simplified container orchestration protocol and architecture developed by DWS LLC. It defines the system's components, operational semantics, resource model, networking, state management, and Application Programming Interface (API). This specification is intended for implementation, discussion, and interoperability. Distribution of this memo is unlimited. + +### Abstract + +The KAT system provides a lightweight, opinionated container orchestration platform specifically designed for resource-constrained environments such as single servers, small clusters, development sandboxes, home labs, and edge deployments. It contrasts with complex, enterprise-scale orchestrators by prioritizing simplicity, minimal resource overhead, developer experience, and direct integration with Git-based workflows. KAT manages containerized long-running services and batch jobs using a declarative "Quadlet" configuration model. Key features include an embedded etcd state store, a Leader-Agent architecture, automated on-agent builds from Git sources, rootless container execution, integrated overlay networking (WireGuard-based), distributed agent-local DNS for service discovery, resource-based scheduling with basic affinity/anti-affinity rules, and structured workload updates. This document provides a comprehensive specification for KAT v1.0 implementation and usage. + +### Table of Contents + +1. [Introduction](#1-introduction)\ + 1.1. [Motivation](#11-motivation)\ + 1.2. [Goals](#12-goals)\ + 1.3. [Design Philosophy](#13-design-philosophy)\ + 1.4. [Scope of KAT v1.0](#14-scope-of-kat-v10)\ + 1.5. [Terminology](#15-terminology) +2. [System Architecture](#2-system-architecture)\ + 2.1. [Overview](#21-overview)\ + 2.2. [Components](#22-components)\ + 2.3. [Node Communication Protocol](#23-node-communication-protocol) +3. [Resource Model: KAT Quadlets](#3-resource-model-kat-quadlets)\ + 3.1. [Overview](#31-overview)\ + 3.2. [Workload Definition (`workload.kat`)](#32-workload-definition-workloadkat)\ + 3.3. [Virtual Load Balancer Definition (`VirtualLoadBalancer.kat`)](#33-virtual-load-balancer-definition-virtualloadbalancerkat)\ + 3.4. [Job Definition (`job.kat`)](#34-job-definition-jobkat)\ + 3.5. [Build Definition (`build.kat`)](#35-build-definition-buildkat)\ + 3.6. [Volume Definition (`volume.kat`)](#36-volume-definition-volumekat)\ + 3.7. [Namespace Definition (`namespace.kat`)](#37-namespace-definition-namespacekat)\ + 3.8. [Node Resource (Internal)](#38-node-resource-internal)\ + 3.9. [Cluster Configuration (`cluster.kat`)](#39-cluster-configuration-clusterkat) +4. [Core Operations and Lifecycle Management](#4-core-operations-and-lifecycle-management)\ + 4.1. [System Bootstrapping and Node Lifecycle](#41-system-bootstrapping-and-node-lifecycle)\ + 4.2. [Workload Deployment and Source Management](#42-workload-deployment-and-source-management)\ + 4.3. [Git-Native Build Process](#43-git-native-build-process)\ + 4.4. [Scheduling](#44-scheduling)\ + 4.5. [Workload Updates and Rollouts](#45-workload-updates-and-rollouts)\ + 4.6. [Container Lifecycle Management](#46-container-lifecycle-management)\ + 4.7. [Volume Lifecycle Management](#47-volume-lifecycle-management)\ + 4.8. [Job Execution Lifecycle](#48-job-execution-lifecycle)\ + 4.9. [Detached Node Operation and Rejoin](#49-detached-node-operation-and-rejoin) +5. [State Management](#5-state-management)\ + 5.1. [State Store Interface (Go)](#51-state-store-interface-go)\ + 5.2. [etcd Implementation Details](#52-etcd-implementation-details)\ + 5.3. [Leader Election](#53-leader-election)\ + 5.4. [State Backup (Leader Responsibility)](#54-state-backup-leader-responsibility)\ + 5.5. [State Restore Procedure](#55-state-restore-procedure) +6. [Container Runtime Interface](#6-container-runtime-interface)\ + 6.1. [Runtime Interface Definition (Go)](#61-runtime-interface-definition-go)\ + 6.2. [Default Implementation: Podman](#62-default-implementation-podman)\ + 6.3. [Rootless Execution Strategy](#63-rootless-execution-strategy) +7. [Networking](#7-networking)\ + 7.1. [Integrated Overlay Network](#71-integrated-overlay-network)\ + 7.2. [IP Address Management (IPAM)](#72-ip-address-management-ipam)\ + 7.3. [Distributed Agent DNS and Service Discovery](#73-distributed-agent-dns-and-service-discovery)\ + 7.4. [Ingress (Opinionated Recipe via Traefik)](#74-ingress-opinionated-recipe-via-traefik) +8. [API Specification (KAT v1.0 Alpha)](#8-api-specification-kat-v10-alpha)\ + 8.1. [General Principles and Authentication](#81-general-principles-and-authentication)\ + 8.2. [Resource Representation (Proto3 & JSON)](#82-resource-representation-proto3--json)\ + 8.3. [Core API Endpoints](#83-core-api-endpoints) +9. [Observability](#9-observability)\ + 9.1. [Logging](#91-logging)\ + 9.2. [Metrics](#92-metrics)\ + 9.3. [Events](#93-events) +10. [Security Considerations](#10-security-considerations)\ + 10.1. [API Security](#101-api-security)\ + 10.2. [Rootless Execution](#102-rootless-execution)\ + 10.3. [Build Security](#103-build-security)\ + 10.4. [Network Security](#104-network-security)\ + 10.5. [Secrets Management](#105-secrets-management)\ + 10.6. [Internal PKI](#106-internal-pki) +11. [Comparison to Alternatives](#11-comparison-to-alternatives) +12. [Future Work](#12-future-work) +13. [Acknowledgements](#13-acknowledgements) +14. [Author's Address](#14-authors-address) + +--- + +### 1. Introduction + +#### 1.1. Motivation + +The landscape of container orchestration is dominated by powerful, feature-rich platforms designed for large-scale, enterprise deployments. While capable, these systems (e.g., Kubernetes) introduce significant operational complexity and resource requirements (CPU, memory overhead) that are often prohibitive or unnecessarily burdensome for smaller use cases. Developers and operators managing personal projects, home labs, CI/CD runners, small business applications, or edge devices frequently face a choice between the friction of manual deployment (SSH, scripts, `docker-compose`) and the excessive overhead of full-scale orchestrators. This gap highlights the need for a solution that provides core orchestration benefits – declarative management, automation, scheduling, self-healing – without the associated complexity and resource cost. KAT aims to be that solution. + +#### 1.2. Goals + +The primary objectives guiding the design of KAT v1.0 are: + +* **Simplicity:** Offer an orchestration experience that is significantly easier to install, configure, learn, and operate than existing mainstream platforms. Minimize conceptual surface area and required configuration. +* **Minimal Overhead:** Design KAT's core components (Leader, Agent, etcd) to consume minimal system resources, ensuring maximum availability for application workloads, particularly critical in single-node or low-resource scenarios. +* **Core Orchestration:** Provide robust management for the lifecycle of containerized long-running services, scheduled/batch jobs, and basic daemon sets. +* **Automation:** Enable automated deployment updates, on-agent image builds triggered directly from Git repository changes, and fundamental self-healing capabilities (container restarts, service replica rescheduling). +* **Git-Native Workflow:** Facilitate a direct "push-to-deploy" model, integrating seamlessly with common developer workflows centered around Git version control. +* **Rootless Operation:** Implement container execution using unprivileged users by default to enhance security posture and reduce system dependencies. +* **Integrated Experience:** Provide built-in solutions for fundamental requirements like overlay networking and service discovery, reducing reliance on complex external components for basic operation. + +#### 1.3. Design Philosophy + +KAT adheres to the following principles: + +* **Embrace Simplicity (Grug Brained):** Actively combat complexity. Prefer simpler solutions even if they cover slightly fewer edge cases initially. Provide opinionated defaults based on common usage patterns. ([The Grug Brained Developer](#)) +* **Declarative Configuration:** Users define the *desired state* via Quadlet files; KAT implements the control loops to achieve and maintain it. +* **Locality of Behavior:** Group related configurations logically (Quadlet directories) rather than by arbitrary type separation across the system. ([HTMX: Locality of Behaviour](https://htmx.org/essays/locality-of-behaviour/)) +* **Leverage Stable Foundations:** Utilize proven, well-maintained technologies like etcd (for consensus) and Podman (for container runtime). +* **Explicit is Better than Implicit (Mostly):** While providing defaults, make configuration options clear and understandable. Avoid overly "magic" behavior. +* **Build for the Common Case:** Focus V1 on solving the 80-90% of use cases for the target audience extremely well. +* **Fail Fast, Recover Simply:** Design components to handle failures predictably. Prioritize simple recovery mechanisms (like etcd snapshots, agent restarts converging state) over complex distributed failure handling protocols where possible for V1. + +#### 1.4. Scope of KAT v1.0 + +This specification details KAT Version 1.0. It includes: +* Leader-Agent architecture with etcd-based state and leader election. +* Quadlet resource model (`Workload`, `VirtualLoadBalancer`, `JobDefinition`, `BuildDefinition`, `VolumeDefinition`, `Namespace`). +* Deployment of Services, Jobs, and DaemonServices. +* Source specification via direct image name or Git repository (with on-agent builds using Podman). Optional build caching via registry. +* Resource-based scheduling with `nodeSelector` and Taint/Toleration support, using a "most empty" placement strategy. +* Workload updates via `Simultaneous` or `Rolling` strategies (`maxSurge` control). Manual rollback support. +* Container lifecycle management including restart policies (`Never`, `MaxCount`, `Always` with reset timer) and optional health checks. +* Volume support for `HostMount` and `SimpleClusterStorage`. +* Integrated WireGuard-based overlay networking. +* Distributed agent-local DNS for service discovery, synchronized via etcd. +* Detached node operation mode with simplified rejoin logic. +* Basic state backup via Leader-driven etcd snapshots. +* Rootless container execution via systemd user services. +* A Proto3-defined, JSON-over-HTTP RESTful API (v1alpha1). +* Opinionated Ingress recipe using Traefik. + +Features explicitly deferred beyond v1.0 are listed in Section 12. + +#### 1.5. Terminology + +The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119. + +* **KAT System (Cluster):** The complete set of KAT Nodes forming a single operational orchestration domain. +* **Node:** An individual machine (physical or virtual) running the KAT Agent software. Each Node has a unique name within the cluster. +* **Leader Node (Leader):** The single Node currently elected via the consensus mechanism to perform authoritative cluster management tasks. +* **Agent Node (Agent):** A Node running the KAT Agent software, responsible for local workload execution and status reporting. Includes the Leader node. +* **Namespace:** A logical partition within the KAT cluster used to organize resources (Workloads, Volumes). Defined by `namespace.kat`. Default is "default". System namespace is "kat-core". +* **Workload:** The primary unit of application deployment, defined by a set of Quadlet files specifying desired state. Types: `Service`, `Job`, `DaemonService`. +* **Service:** A Workload type representing a long-running application. +* **Job:** A Workload type representing a task that runs to completion. +* **DaemonService:** A Workload type ensuring one instance runs on each eligible Node. +* **KAT Quadlet (Quadlet):** A set of co-located YAML files (`*.kat`) defining a single Workload. Submitted and managed atomically. +* **Container:** A running instance managed by the container runtime (Podman). +* **Image:** The template for a container, specified directly or built from Git. +* **Volume:** Persistent or ephemeral storage attached to a Workload's container(s). Types: `HostMount`, `SimpleClusterStorage`. +* **Overlay Network:** KAT-managed virtual network (WireGuard) for inter-node/inter-container communication. +* **Service Discovery:** Mechanism (distributed agent DNS) for resolving service names to overlay IPs. +* **Ingress:** Exposing internal services externally, typically via the Traefik recipe. +* **Tick:** Configurable interval for Agent heartbeats to the Leader. +* **Taint:** Key/Value/Effect marker on a Node to repel workloads. +* **Toleration:** Marker on a Workload allowing it to schedule on Nodes with matching Taints. +* **API:** Application Programming Interface (HTTP/JSON based on Proto3). +* **CLI:** Command Line Interface (`katcall`). +* **etcd:** Distributed key-value store used for consensus and state. + +--- + +### 2. System Architecture + +#### 2.1. Overview + +KAT operates using a distributed Leader-Agent model built upon an embedded etcd consensus layer. One `kat-agent` instance is elected Leader, responsible for maintaining the cluster's desired state, making scheduling decisions, and serving the API. All other `kat-agent` instances act as workers (Agents), executing tasks assigned by the Leader and reporting their status. Communication occurs primarily between Agents and the Leader, facilitated by an integrated overlay network. + +#### 2.2. Components + +* **`kat-agent` (Binary):** The single executable deployed on all nodes. Runs in one of two primary modes internally based on leader election status: Agent or Leader. + * **Common Functions:** Node registration, heartbeating, overlay network participation, local container runtime interaction (Podman via CRI interface), local state monitoring, execution of Leader commands. + * **Rootless Execution:** Manages container workloads under distinct, unprivileged user accounts via systemd user sessions (preferred method). +* **Leader Role (Internal state within an elected `kat-agent`):** + * Hosts API endpoints. + * Manages desired/actual state in etcd. + * Runs the scheduling logic. + * Executes the reconciliation control loop. + * Manages IPAM for the overlay network. + * Updates DNS records in etcd. + * Coordinates node join/leave/failure handling. + * Initiates etcd backups. +* **Embedded etcd:** Linked library providing Raft consensus for leader election and strongly consistent key-value storage for all cluster state (desired specs, actual status, network config, DNS records). Runs within the `kat-agent` process on quorum members (typically 1, 3, or 5 nodes). + +#### 2.3. Node Communication Protocol + +* **Transport:** HTTP/1.1 or HTTP/2 over mandatory mTLS. KAT includes a simple internal PKI bootstrapped during `init` and `join`. +* **Agent -> Leader:** Periodic `POST /v1alpha1/nodes/{nodeName}/status` containing heartbeat and detailed node/workload status. Triggered every `Tick`. Immediate reports for critical events MAY be sent. +* **Leader -> Agent:** Commands (create/start/stop/remove container, update config) sent via `POST/PUT/DELETE` to agent-specific endpoints (e.g., `https://{agentOverlayIP}:{agentPort}/agent/v1alpha1/...`). +* **Payload:** JSON, derived from Proto3 message definitions. +* **Discovery/Join:** Initial contact via leader hint uses HTTP API; subsequent peer discovery for etcd/overlay uses information distributed by the Leader via the API/etcd. +* **Detached Mode Communication:** Multicast/Broadcast UDP for `REJOIN_REQUEST` messages on local network segments. Direct HTTP response from parent Leader. + +--- + +### 3. Resource Model: KAT Quadlets + +#### 3.1. Overview + +KAT configuration is declarative, centered around the "Quadlet" concept. A Workload is defined by a directory containing YAML files (`*.kat`), each specifying a different aspect (`kind`). This promotes modularity and locality of behavior. + +#### 3.2. Workload Definition (`workload.kat`) + +REQUIRED. Defines the core identity, source, type, and lifecycle policies. + +```yaml +apiVersion: kat.dws.rip/v1alpha1 +kind: Workload +metadata: + name: string # REQUIRED. Workload name. + namespace: string # OPTIONAL. Defaults to "default". +spec: + type: enum # REQUIRED: Service | Job | DaemonService + source: # REQUIRED. Exactly ONE of image or git must be present. + image: string # OPTIONAL. Container image reference. + git: # OPTIONAL. Build from Git. + repository: string # REQUIRED if git. URL of Git repo. + branch: string # OPTIONAL. Defaults to repo default. + tag: string # OPTIONAL. Overrides branch. + commit: string # OPTIONAL. Overrides tag/branch. + cacheImage: string # OPTIONAL. Registry path for build cache layers. + # Used only if 'git' source is specified. + replicas: int # REQUIRED for type: Service. Desired instance count. + # Ignored for Job, DaemonService. + updateStrategy: # OPTIONAL for Service/DaemonService. + type: enum # REQUIRED: Rolling | Simultaneous. Default: Rolling. + rolling: # Relevant if type is Rolling. + maxSurge: int | string # OPTIONAL. Max extra instances during update. Default 1. + restartPolicy: # REQUIRED for container lifecycle. + condition: enum # REQUIRED: Never | MaxCount | Always + maxRestarts: int # OPTIONAL. Used if condition=MaxCount. Default 5. + resetSeconds: int # OPTIONAL. Used if condition=MaxCount. Window to reset count. Default 3600. + nodeSelector: map[string]string # OPTIONAL. Schedule only on nodes matching all labels. + tolerations: # OPTIONAL. List of taints this workload can tolerate. + - key: string + operator: enum # OPTIONAL. Exists | Equal. Default: Exists. + value: string # OPTIONAL. Needed if operator=Equal. + effect: enum # OPTIONAL. NoSchedule | PreferNoSchedule. Matches taint effect. + # Empty matches all effects for the key/value pair. + # --- Container specification (V1 assumes one primary container per workload) --- + container: # REQUIRED. + name: string # OPTIONAL. Informational name for the container. + command: [string] # OPTIONAL. Override image CMD. + args: [string] # OPTIONAL. Override image ENTRYPOINT args or CMD args. + env: # OPTIONAL. Environment variables. + - name: string + value: string + volumeMounts: # OPTIONAL. Mount volumes defined in spec.volumes. + - name: string # Volume name. + mountPath: string # Path inside container. + subPath: string # Optional. Mount sub-directory. + readOnly: bool # Optional. Default false. + resources: # OPTIONAL. Resource requests and limits. + requests: # Used for scheduling. Defaults to limits if unspecified. + cpu: string # e.g., "100m" + memory: string # e.g., "64Mi" + limits: # Enforced by runtime. Container killed if memory exceeded. + cpu: string # CPU throttling limit (e.g., "1") + memory: string # e.g., "256Mi" + gpu: # OPTIONAL. Request GPU resources. + driver: enum # OPTIONAL: any | nvidia | amd + minVRAM_MB: int # OPTIONAL. Minimum GPU memory required. + # --- Volume Definitions for this Workload --- + volumes: # OPTIONAL. Defines volumes used by container.volumeMounts. + - name: string # REQUIRED. Name referenced by volumeMounts. + simpleClusterStorage: {} # OPTIONAL. Creates dir under agent's volumeBasePath. + # Use ONE OF simpleClusterStorage or hostMount. + hostMount: # OPTIONAL. Mounts a specific path from the host node. + hostPath: string # REQUIRED if hostMount. Absolute path on host. + ensureType: enum # OPTIONAL: DirectoryOrCreate | Directory | FileOrCreate | File | Socket``` + +#### 3.3. Virtual Load Balancer Definition (`VirtualLoadBalancer.kat`) + +OPTIONAL. Only relevant for `Workload` of `type: Service`. Defines networking endpoints and health criteria for load balancing and ingress. + +```yaml +apiVersion: kat.dws.rip/v1alpha1 +kind: VirtualLoadBalancer # Identifies this Quadlet file's purpose +spec: + ports: # REQUIRED if this file exists. List of ports to expose/balance. + - name: string # OPTIONAL. Informational name (e.g., "web", "grpc"). + containerPort: int # REQUIRED. Port the application listens on inside container. + protocol: string # OPTIONAL. TCP | UDP. Default TCP. + healthCheck: # OPTIONAL. Used for readiness in rollouts and LB target selection. + # If omitted, container running status implies health. + exec: + command: [string] # REQUIRED. Exit 0 = healthy. + initialDelaySeconds: int # OPTIONAL. Default 0. + periodSeconds: int # OPTIONAL. Default 10. + timeoutSeconds: int # OPTIONAL. Default 1. + successThreshold: int # OPTIONAL. Default 1. + failureThreshold: int # OPTIONAL. Default 3. + ingress: # OPTIONAL. Hints for external ingress controllers (like the Traefik recipe). + - host: string # REQUIRED. External hostname. + path: string # OPTIONAL. Path prefix. Default "/". + servicePortName: string # OPTIONAL. Name of port from spec.ports to target. + servicePort: int # OPTIONAL. Port number from spec.ports. Overrides name. + # One of servicePortName or servicePort MUST be provided if ports exist. + tls: bool # OPTIONAL. If true, signal ingress controller to manage TLS via ACME. +``` + +#### 3.4. Job Definition (`job.kat`) + +REQUIRED if `spec.type` in `workload.kat` is `Job`. + +```yaml +apiVersion: kat.dws.rip/v1alpha1 +kind: JobDefinition # Identifies this Quadlet file's purpose +spec: + schedule: string # OPTIONAL. Cron schedule string. + completions: int # OPTIONAL. Desired successful completions. Default 1. + parallelism: int # OPTIONAL. Max concurrent instances. Default 1. + activeDeadlineSeconds: int # OPTIONAL. Timeout for the job run. + backoffLimit: int # OPTIONAL. Max failed instance restarts before job fails. Default 3. +``` + +#### 3.5. Build Definition (`build.kat`) + +REQUIRED if `spec.source.git` is specified in `workload.kat`. + +```yaml +apiVersion: kat.dws.rip/v1alpha1 +kind: BuildDefinition +spec: + buildContext: string # OPTIONAL. Path relative to repo root. Defaults to ".". + dockerfilePath: string # OPTIONAL. Path relative to buildContext. Defaults to "./Dockerfile". + buildArgs: # OPTIONAL. Map for build arguments. + map[string]string # e.g., {"VERSION": "1.2.3"} + targetStage: string # OPTIONAL. Target stage name for multi-stage builds. + platform: string # OPTIONAL. Target platform (e.g., "linux/arm64"). + cache: # OPTIONAL. Defines build caching strategy. + registryPath: string # OPTIONAL. Registry path (e.g., "myreg.com/cache/myapp"). + # Agent tags cache image with commit SHA. +``` + +#### 3.6. Volume Definition (`volume.kat`) + +DEPRECATED in favor of defining volumes directly within `workload.kat -> spec.volumes`. This enhances Locality of Behavior. Section 3.2 reflects this change. This file kind is reserved for potential future use with cluster-wide persistent volumes. + +#### 3.7. Namespace Definition (`namespace.kat`) + +REQUIRED for defining non-default namespaces. + +```yaml +apiVersion: kat.dws.rip/v1alpha1 +kind: Namespace +metadata: + name: string # REQUIRED. Name of the namespace. + # labels: map[string]string # OPTIONAL. +``` + +#### 3.8. Node Resource (Internal) + +Represents node state managed by the Leader, queryable via API. Not defined by user Quadlets. Contains fields like `name`, `status`, `addresses`, `capacity`, `allocatable`, `labels`, `taints`. + +#### 3.9. Cluster Configuration (`cluster.kat`) + +Used *only* during `kat-agent init` via a flag (e.g., `--config cluster.kat`). Defines immutable cluster-wide parameters. + +```yaml +apiVersion: kat.dws.rip/v1alpha1 +kind: ClusterConfiguration +metadata: + name: string # REQUIRED. Informational name for the cluster. +spec: + clusterCIDR: string # REQUIRED. CIDR for overlay network IPs (e.g., "10.100.0.0/16"). + serviceCIDR: string # REQUIRED. CIDR for internal virtual IPs (used by future internal proxy/LB). + # Not directly used by containers in V1 networking model. + nodeSubnetBits: int # OPTIONAL. Number of bits for node subnets within clusterCIDR. + # Default 7 (yielding /23 subnets if clusterCIDR=/16). + clusterDomain: string # OPTIONAL. DNS domain suffix. Default "kat.cluster.local". + # --- Port configurations --- + agentPort: int # OPTIONAL. Port agent listens on (internal). Default 9116. + apiPort: int # OPTIONAL. Port leader listens on for API. Default 9115. + etcdPeerPort: int # OPTIONAL. Default 2380. + etcdClientPort: int # OPTIONAL. Default 2379. + # --- Path configurations --- + volumeBasePath: string # OPTIONAL. Agent base path for SimpleClusterStorage. Default "/var/lib/kat/volumes". + backupPath: string # OPTIONAL. Path on Leader for etcd backups. Default "/var/lib/kat/backups". + # --- Interval configurations --- + backupIntervalMinutes: int # OPTIONAL. Frequency of etcd backups. Default 30. + agentTickSeconds: int # OPTIONAL. Agent heartbeat interval. Default 15. + nodeLossTimeoutSeconds: int # OPTIONAL. Time before marking node NotReady. Default 60. +``` + +--- + +### 4. Core Operations and Lifecycle Management + +This section details the operational logic, state transitions, and management processes within the KAT system, from cluster initialization to workload execution and node dynamics. + +#### 4.1. System Bootstrapping and Node Lifecycle + +##### 4.1.1. Initial Leader Setup +The first KAT Node is initialized to become the Leader and establish the cluster. +1. **Command:** The administrator executes `kat-agent init --config ` on the designated initial node. The `cluster.kat` file (see Section 3.9) provides essential cluster-wide parameters. +2. **Action:** + * The `kat-agent` process starts. + * It parses the `cluster.kat` file to obtain parameters like ClusterCIDR, ServiceCIDR, domain, agent/API ports, etcd ports, volume paths, and backup settings. + * It generates a new internal Certificate Authority (CA) key and certificate (for the PKI, see Section 10.6) if one doesn't already exist at a predefined path. + * It initializes and starts an embedded single-node etcd server, using the configured etcd ports. The etcd data directory is created. + * The agent campaigns for leadership via etcd's election mechanism (Section 5.3) and, being the only member, becomes the Leader. It writes its identity (e.g., its advertise IP and API port) to a well-known key in etcd (e.g., `/kat/config/leader_endpoint`). + * The Leader initializes its IPAM module (Section 7.2) for the defined ClusterCIDR. + * It generates its own WireGuard key pair, stores the private key securely, and publishes its public key and overlay endpoint (external IP and WireGuard port) to etcd. + * It sets up its local `kat0` WireGuard interface using its assigned overlay IP (the first available from its own initial subnet). + * It starts the API server on the configured API port. + * It starts its local DNS resolver (Section 7.3). + * The `kat-core` and `default` Namespaces are created in etcd if they do not exist. + +##### 4.1.2. Agent Node Join +Subsequent Nodes join an existing KAT cluster to act as workers (and potential future etcd quorum members or leaders if so configured, though V1 focuses on a static initial quorum). +1. **Command:** `kat-agent join --leader-api --advertise-address [--etcd-peer]` (The `--etcd-peer` flag indicates this node should attempt to join the etcd quorum). +2. **Action:** + * The `kat-agent` process starts. + * It generates a WireGuard key pair. + * It contacts the specified Leader API endpoint to request joining the cluster, sending its intended `advertise-address` (for inter-node WireGuard communication) and its WireGuard public key. It also sends a Certificate Signing Request (CSR) for its mTLS client/server certificate. + * The Leader, upon validating the join request (V1 has no strong token validation, relies on network trust): + * Assigns a unique Node Name (if not provided by agent, Leader generates one) and a Node Subnet from the ClusterCIDR (Section 7.2). + * Signs the Agent's CSR using the cluster CA, returning the signed certificate and the CA certificate. + * Records the new Node's name, advertise address, WireGuard public key, and assigned subnet in etcd (e.g., under `/kat/nodes/registration/{nodeName}`). + * If `--etcd-peer` was requested and the quorum has capacity, the Leader MAY instruct the node to join the etcd quorum by providing current peer URLs. (For V1, etcd peer addition post-init is considered an advanced operation, default is static initial quorum). + * Provides the new Agent with the list of all current Nodes' WireGuard public keys, overlay endpoint addresses, and their assigned overlay subnets (for `AllowedIPs`). + * The joining Agent: + * Stores the received mTLS certificate and CA certificate. + * Configures its local `kat0` WireGuard interface with an IP from its assigned subnet (typically the `.1` address) and sets up peers for all other known nodes. + * If instructed to join etcd quorum, configures and starts its embedded etcd as a peer. + * Registers itself formally with the Leader via a status update. + * Starts its local DNS resolver and begins syncing DNS state from etcd. + * Becomes `Ready` and available for scheduling workloads. + +##### 4.1.3. Node Heartbeat and Status Reporting +Each Agent Node (including the Leader acting as an Agent for its local workloads) MUST periodically send a status update to the active Leader. +* **Interval:** Configurable `agentTickSeconds` (from `cluster.kat`, e.g., default 15 seconds). +* **Content:** The payload is a JSON object reflecting the Node's current state: + * `nodeName`: string (its unique identifier) + * `nodeUID`: string (a persistent unique ID for the node instance) + * `timestamp`: int64 (Unix epoch seconds) + * `resources`: + * `capacity`: `{"cpu": "2000m", "memory": "4096Mi"}` + * `allocatable`: `{"cpu": "1800m", "memory": "3800Mi"}` (capacity minus system overhead) + * `workloadInstances`: Array of objects, each detailing a locally managed container: + * `workloadName`: string + * `namespace`: string + * `instanceID`: string (unique ID for this replica/run of the workload) + * `containerID`: string (from Podman) + * `imageID`: string (from Podman) + * `state`: string ("running", "exited", "paused", "unknown") + * `exitCode`: int (if exited) + * `healthStatus`: string ("healthy", "unhealthy", "pending_check") (from `VirtualLoadBalancer.kat` health check) + * `restarts`: int (count of Agent-initiated restarts for this instance) + * `overlayNetwork`: `{"status": "connected", "lastPeerSync": "timestamp"}` +* **Protocol:** HTTP `POST` to Leader's `/v1alpha1/nodes/{nodeName}/status` endpoint, authenticated via mTLS. The Leader updates the Node's actual state in etcd (e.g., `/kat/nodes/actual/{nodeName}/status`). + +##### 4.1.4. Node Departure and Failure Detection +* **Graceful Departure:** + 1. Admin action: `katcall drain `. This sets a `NoSchedule` Taint on the Node object in etcd and marks its desired state as "Draining". + 2. Leader reconciliation loop: + * Stops scheduling *new* workloads to the Node. + * For existing `Service` and `DaemonService` instances on the draining node, it initiates a process to reschedule them to other eligible nodes (respecting update strategies where applicable, e.g., not violating `maxUnavailable` for the service cluster-wide). + * For `Job` instances, they are allowed to complete. If a Job is actively running and the node is drained, KAT V1's behavior is to let it finish; more sophisticated preemption is future work. + 3. Once all managed workload instances are terminated or rescheduled, the Agent MAY send a final "departing" message, and the admin can decommission the node. The Leader eventually removes the Node object from etcd after a timeout if it stops heartbeating. +* **Failure Detection:** + 1. The Leader monitors Agent heartbeats. If an Agent misses `nodeLossTimeoutSeconds` (from `cluster.kat`, e.g., 3 * `agentTickSeconds`), the Leader marks the Node's status in etcd as `NotReady`. + 2. Reconciliation Loop for `NotReady` Node: + * For `Service` instances previously on the `NotReady` node: The Leader attempts to schedule replacement instances on other `Ready` eligible nodes to maintain `spec.replicas`. + * For `DaemonService` instances: No action, as the node is not eligible. + * For `Job` instances: If the job has a restart policy allowing it, the Leader MAY attempt to reschedule the failed job instance on another eligible node. + 3. If the Node rejoins (starts heartbeating again): The Leader marks it `Ready`. The reconciliation loop will then assess if any workloads *should* be on this node (e.g., DaemonServices, or if it's now the best fit for some pending Services). Any instances that were rescheduled off this node and are now redundant (because the original instance might still be running locally on the rejoined node if it only had a network partition) will be identified. The Leader will instruct the rejoined Agent to stop any such zombie/duplicate containers based on `instanceID` tracking. + +#### 4.2. Workload Deployment and Source Management + +Workloads are the primary units of deployment, defined by Quadlet directories. +1. **Submission:** + * Client (e.g., `katcall apply -f ./my-workload-dir/`) archives the Quadlet directory (e.g., `my-workload-dir/`) into a `tar.gz` file. + * Client sends an HTTP `POST` (for new) or `PUT` (for update) to `/v1alpha1/n/{namespace}/workloads` (if name is in `workload.kat`) or `/v1alpha1/n/{namespace}/workloads/{workloadName}` (for `PUT`). The body is the `tar.gz` archive. + * Leader validates the `metadata.name` in `workload.kat` against the URL path for `PUT`. +2. **Validation & Storage:** + * Leader unpacks the archive. + * It validates each `.kat` file against its known schema (e.g., `Workload`, `VirtualLoadBalancer`, `BuildDefinition`, `JobDefinition`). + * Cross-Quadlet file consistency is checked (e.g., referenced port names in `VirtualLoadBalancer.kat -> spec.ingress` exist in `VirtualLoadBalancer.kat -> spec.ports`). + * If valid, Leader persists each Quadlet file's content into etcd under `/kat/workloads/desired/{namespace}/{workloadName}/{fileName}`. The `metadata.generation` for the workload is incremented on spec changes. + * Leader responds `201 Created` or `200 OK` with the workload's metadata. +3. **Source Handling Precedence by Agent (upon receiving deployment command):** + 1. If `workload.kat -> spec.source.git` is defined: + a. If `workload.kat -> spec.source.cacheImage` is also defined, Agent first attempts to pull this `cacheImage` (see Section 4.3.3). If successful and image hash matches an expected value (e.g., if git commit is also specified and used to tag cache), this image is used, and local build MAY be skipped. + b. If no cache image or cache pull fails/mismatches, proceed to Git Build (Section 4.3). The resulting locally built image is used. + 2. Else if `workload.kat -> spec.source.image` is defined (and no `git` source): Agent pulls this image (Section 4.6.1). + 3. If neither `git` nor `image` is specified, it's a validation error by the Leader. + +#### 4.3. Git-Native Build Process + +Triggered when an Agent is instructed to run a Workload instance with `spec.source.git`. +1. **Setup:** Agent creates a temporary, isolated build directory. +2. **Cloning:** `git clone --depth 1 --branch .` (or `git fetch origin && git checkout `). +3. **Context & Dockerfile Path:** Agent uses `buildContext` and `dockerfilePath` from `build.kat` (defaults to `.` and `./Dockerfile` respectively). +4. **Build Execution:** + * Construct `podman build` command with: + * `-t ` (e.g., `kat-local/{namespace}_{workloadName}:{git_commit_sha_short}`) + * `-f {dockerfilePath}` within the `{buildContext}`. + * `--build-arg` for each from `build.kat -> spec.buildArgs`. + * `--target {targetStage}` if specified. + * `--platform {platform}` if specified (else Podman defaults). + * The build context path. + * Execute as the Agent's rootless user or a dedicated build user for that workload. +5. **Build Caching (`build.kat -> spec.cache.registryPath`):** + * **Pre-Build Pull (Cache Hit):** Before Step 2 (Cloning), Agent constructs a tag based on `registryPath` and the specific Git commit SHA (if available, else latest of branch/tag). Attempts `podman pull`. If successful, uses this image and skips local build steps. + * **Post-Build Push (Cache Miss/New Build):** After successful local build, Agent tags the new image with `{registryPath}:{git_commit_sha_short}` and attempts `podman push`. Registry credentials MUST be configured locally on the Agent (e.g., in Podman's auth file for the build user). KATv1 does not manage these credentials centrally. +6. **Outcome:** Agent reports build success (with internal image tag) or failure to Leader. + +#### 4.4. Scheduling + +The Leader performs scheduling in its reconciliation loop for new or rescheduled Workload instances. +1. **Filter Nodes - Resource Requests:** + * Identify `spec.container.resources.requests` (CPU, memory). + * Filter out Nodes whose `status.allocatable` resources are less than requested. +2. **Filter Nodes - nodeSelector:** + * If `spec.nodeSelector` is present, filter out Nodes whose labels do not match *all* key-value pairs in the selector. +3. **Filter Nodes - Taints and Tolerations:** + * For each remaining Node, check its `taints`. + * A Workload instance is repelled if the Node has a taint with `effect=NoSchedule` that is not tolerated by `spec.tolerations`. + * (Nodes with `PreferNoSchedule` taints not tolerated are kept but deprioritized in scoring). +4. **Filter Nodes - GPU Requirements:** + * If `spec.container.resources.gpu` is specified: + * Filter out Nodes that do not report matching GPU capabilities (e.g., `gpu.nvidia.present=true` based on `driver` request). + * Filter out Nodes whose reported available VRAM (a node-level attribute, potentially dynamically tracked by agent) is less than `minVRAM_MB`. +5. **Score Nodes ("Most Empty" Proportional):** + * For each remaining candidate Node: + * `cpu_used_percent = (node_total_cpu_requested_by_workloads / node_allocatable_cpu) * 100` + * `mem_used_percent = (node_total_mem_requested_by_workloads / node_allocatable_mem) * 100` + * `score = (100 - cpu_used_percent) + (100 - mem_used_percent)` (Higher is better, gives more weight to balanced free resources). Or `score = 100 - max(cpu_used_percent, mem_used_percent)`. +6. **Select Node:** + * Prioritize nodes *not* having untolerated `PreferNoSchedule` taints. + * Among those (or all, if all preferred are full), pick the Node with the highest score. + * If multiple nodes tie for the highest score, pick one randomly. +7. **Replica Spreading (Services/DaemonServices):** For multi-instance workloads, when choosing among equally scored nodes, the scheduler MAY prefer nodes currently running fewer instances of the *same* workload to achieve basic anti-affinity. For `DaemonService`, it schedules one instance on *every* eligible node identified after filtering. +8. If no suitable node is found, the instance remains `Pending`. + +#### 4.5. Workload Updates and Rollouts + +Triggered by `PUT` to Workload API endpoint with changed Quadlet specs. Leader compares new `desiredSpecHash` with `status.observedSpecHash`. +* **`Simultaneous` Strategy (`spec.updateStrategy.type`):** + 1. Leader instructs Agents to stop and remove all old-version instances. + 2. Once confirmed (or timeout), Leader schedules all new-version instances as per Section 4.4. This causes downtime. +* **`Rolling` Strategy (`spec.updateStrategy.type`):** + 1. `max_surge_val = calculate_absolute(spec.updateStrategy.rolling.maxSurge, new_replicas_count)` + 2. Total allowed instances = `new_replicas_count + max_surge_val`. + 3. The Leader updates instances incrementally: + a. Scales up by launching new-version instances until `total_running_instances` reaches `new_replicas_count` or `old_replicas_count + max_surge_val`, whichever is smaller and appropriate for making progress. New instances use the updated Quadlet spec. + b. Once a new-version instance becomes `Healthy` (passes `VirtualLoadBalancer.kat` health checks, or just starts if no checks), an old-version instance is selected and terminated. + c. The process continues until all instances are new-version and `new_replicas_count` are healthy. + d. If `new_replicas_count < old_replicas_count`, surplus old instances are terminated first, respecting a conceptual (not explicitly defined in V1, but can be `max_surge_val` effectively acting as `maxUnavailable`) limit to maintain availability. +* **Rollbacks (Manual):** + 1. Leader stores the Quadlet files of the previous successfully deployed version in etcd (e.g., at `/kat/workloads/archive/{namespace}/{workloadName}/{generation-1}/`). + 2. User command: `katcall rollback workload {namespace}/{name}`. + 3. Leader retrieves archived Quadlets, treats them as a new desired state, and applies the workload's configured `updateStrategy` to revert. + +#### 4.6. Container Lifecycle Management + +Managed by the Agent based on Leader commands and local policies. +1. **Image Pull/Availability:** Before creating, Agent ensures the target image (from Git build, cache, or direct ref) is locally available, pulling if necessary. +2. **Creation & Start:** Agent uses `ContainerRuntime` to create and start the container with parameters derived from `workload.kat -> spec.container` and `VirtualLoadBalancer.kat -> spec.ports` (translated to runtime port mappings). Node-allocated IP is assigned. +3. **Health Checks (for Services with `VirtualLoadBalancer.kat`):** Agent periodically runs `spec.healthCheck.exec.command` inside the container after `initialDelaySeconds`. Status (Healthy/Unhealthy) based on `successThreshold`/`failureThreshold` is reported in heartbeats. +4. **Restart Policy (`workload.kat -> spec.restartPolicy`):** + * `Never`: No automatic restart by Agent. Leader reschedules for Services/DaemonServices. + * `Always`: Agent always restarts on exit, with exponential backoff. + * `MaxCount`: Agent restarts on non-zero exit, up to `maxRestarts` times. If `resetSeconds` elapses since the *first* restart in a series without hitting `maxRestarts`, the restart count for that series resets. Persistent failure after `maxRestarts` within `resetSeconds` window causes instance to be marked `Failed` by Agent. Leader acts accordingly. + +#### 4.7. Volume Lifecycle Management + +Defined in `workload.kat -> spec.volumes` and mounted via `spec.container.volumeMounts`. +* **Agent Responsibility:** Before container start, Agent ensures specified volumes are available: + * `SimpleClusterStorage`: Creates directory `{agent.volumeBasePath}/{namespace}/{workloadName}/{volumeName}` if it doesn't exist. Permissions should allow container user access. + * `HostMount`: Validates `hostPath` exists. If `ensureType` is `DirectoryOrCreate` or `FileOrCreate`, attempts creation. Mounts into container. +* **Persistence:** Data in `SimpleClusterStorage` on a node persists across container restarts on that *same node*. If the underlying `agent.volumeBasePath` is on network storage (user-managed), it's cluster-persistent. `HostMount` data persists with the host path. + +#### 4.8. Job Execution Lifecycle + +Defined by `workload.kat -> spec.type: Job` and `job.kat`. +1. Leader schedules Job instances based on `schedule`, `completions`, `parallelism`. +2. Agent runs container. On exit: + * Exit code 0: Instance `Succeeded`. + * Non-zero: Instance `Failed`. Agent applies `restartPolicy` up to `job.kat -> spec.backoffLimit` for the *Job instance* (distinct from container restarts). +3. Leader tracks `completions` and `activeDeadlineSeconds`. + +#### 4.9. Detached Node Operation and Rejoin + +Revised mechanism for dynamic nodes (e.g., laptops): +1. **Configuration:** Agents have `--parent-cluster-name` and `--node-type` (e.g., `laptop`, `stable`). +2. **Detached Mode:** If Agent cannot reach parent Leader after `nodeLossTimeoutSeconds`, it sets an internal `detached=true` flag. +3. **Local Leadership:** Agent becomes its own single-node Leader (trivial election). +4. **Local Operations:** + * Continues running pre-detachment workloads. + * New workloads submitted to its local API get an automatic `nodeSelector` constraint: `kat.dws.rip/nodeName: `. +5. **Rejoin Attempt:** Periodically multicasts `(REJOIN_REQUEST, , ...)` on local LAN. +6. **Parent Response & Rejoin:** Parent Leader responds. Detached Agent clears flag, submits its *locally-created* (nodeSelector-constrained) workloads to parent Leader API, then performs standard Agent join. +7. **Parent Reconciliation:** Parent Leader accepts new workloads, respecting their nodeSelector. + +--- + +### 5. State Management + +#### 5.1. State Store Interface (Go) + +KAT components interact with etcd via a Go interface for abstraction. + +```go +package store + +import ( + "context" + "time" +) + +type KV struct { Key string; Value []byte; Version int64 /* etcd ModRevision */ } +type WatchEvent struct { Type EventType; KV KV; PrevKV *KV } +type EventType int +const ( EventTypePut EventType = iota; EventTypeDelete ) + +type StateStore interface { + Put(ctx context.Context, key string, value []byte) error + Get(ctx context.Context, key string) (*KV, error) + Delete(ctx context.Context, key string) error + List(ctx context.Context, prefix string) ([]KV, error) + Watch(ctx context.Context, keyOrPrefix string, startRevision int64) (<-chan WatchEvent, error) // Added startRevision + Close() error + Campaign(ctx context.Context, leaderID string, leaseTTLSeconds int64) (leadershipCtx context.Context, err error) // Returns context cancelled on leadership loss + Resign(ctx context.Context) error // Uses context from Campaign to manage lease + GetLeader(ctx context.Context) (leaderID string, err error) + DoTransaction(ctx context.Context, checks []Compare, onSuccess []Op, onFailure []Op) (committed bool, err error) // For CAS operations +} +type Compare struct { Key string; ExpectedVersion int64 /* 0 for key not exists */ } +type Op struct { Type OpType; Key string; Value []byte /* for Put */ } +type OpType int +const ( OpPut OpType = iota; OpDelete; OpGet /* not typically used in Txn success/fail ops */) +``` +The `Campaign` method returns a context that is cancelled when leadership is lost or `Resign` is called, simplifying leadership management. `DoTransaction` enables conditional writes for atomicity. + +#### 5.2. etcd Implementation Details + +* **Client:** Uses `go.etcd.io/etcd/client/v3`. +* **Embedded Server:** Uses `go.etcd.io/etcd/server/v3/embed` within `kat-agent` on quorum nodes. Configuration (data-dir, peer/client URLs) from `cluster.kat` and agent flags. +* **Key Schema Examples:** + * `/kat/schema_version`: `v1.0` + * `/kat/config/cluster_uid`: UUID generated at init. + * `/kat/config/leader_endpoint`: Current Leader's API endpoint. + * `/kat/nodes/registration/{nodeName}`: Node's static registration info (UID, WireGuard pubkey, advertise addr). + * `/kat/nodes/status/{nodeName}`: Node's dynamic status (heartbeat timestamp, resources, local instances). Leased by agent. + * `/kat/workloads/desired/{namespace}/{workloadName}/manifest/{fileName}`: Content of each Quadlet file. + * `/kat/workloads/desired/{namespace}/{workloadName}/meta`: Workload metadata (generation, overall spec hash). + * `/kat/workloads/status/{namespace}/{workloadName}`: Leader-maintained status of the workload. + * `/kat/network/config/overlay_cidr`: ClusterCIDR. + * `/kat/network/nodes/{nodeName}/subnet`: Assigned overlay subnet. + * `/kat/network/allocations/{instanceID}/ip`: Assigned container overlay IP. Leased by agent managing instance. + * `/kat/dns/{namespace}/{workloadName}/{recordType}/{value}`: Flattened DNS records. + * `/kat/leader_election/` (etcd prefix): Used by `clientv3/concurrency/election`. + +#### 5.3. Leader Election + +Utilizes `go.etcd.io/etcd/client/v3/concurrency#NewElection` and `Campaign`. All agents configured as potential quorum members participate. The elected Leader renews its lease continuously. If the lease expires (e.g., Leader crashes), other candidates campaign. + +#### 5.4. State Backup (Leader Responsibility) + +The active Leader periodically performs an etcd snapshot. +1. **Interval:** `backupIntervalMinutes` from `cluster.kat`. +2. **Action:** Executes `etcdctl snapshot save {backupPath}/{timestamped_filename.db}` against its *own embedded etcd member*. +3. **Path:** `backupPath` from `cluster.kat`. +4. **Rotation:** Leader maintains the last N snapshots locally (e.g., N=5, configurable), deleting older ones. +5. **User Responsibility:** These are *local* snapshots on the Leader node. Users MUST implement external mechanisms to copy these snapshots to secure, off-node storage. + +#### 5.5. State Restore Procedure + +For disaster recovery (total cluster loss or etcd quorum corruption): +1. **STOP** all `kat-agent` processes on all nodes. +2. Identify the desired etcd snapshot file (`.db`). +3. On **one** designated node (intended to be the first new Leader): + * Clear its old etcd data directory (`--data-dir` for etcd). + * Restore the snapshot: `etcdctl snapshot restore --name --initial-cluster =http://: --initial-cluster-token --data-dir ` + * Modify the `kat-agent` startup for this node to use the `new_data_dir_path` and configure it as if initializing a new cluster but pointing to this restored data (specific flags for etcd embed). +4. Start the `kat-agent` on this restored node. It will become Leader of a new single-member cluster with the restored state. +5. On all other KAT nodes: + * Clear their old etcd data directories. + * Clear any KAT agent local state (e.g., WireGuard configs, runtime state). + * Join them to the new Leader using `kat-agent join` as if joining a fresh cluster. +6. The Leader's reconciliation loop will then redeploy workloads according to the restored desired state. **In-flight data or states not captured in the last etcd snapshot will be lost.** + +--- + +### 6. Container Runtime Interface + +#### 6.1. Runtime Interface Definition (Go) + +Defines the abstraction KAT uses to manage containers. + +```go +package runtime + +import ( + "context" + "io" + "time" +) + +type ImageSummary struct { ID string; Tags []string; Size int64 } +type ContainerState string +const ( + ContainerStateRunning ContainerState = "running" + ContainerStateExited ContainerState = "exited" + ContainerStateCreated ContainerState = "created" + ContainerStatePaused ContainerState = "paused" + ContainerStateRemoving ContainerState = "removing" + ContainerStateUnknown ContainerState = "unknown" +) +type HealthState string +const ( + HealthStateHealthy HealthState = "healthy" + HealthStateUnhealthy HealthState = "unhealthy" + HealthStatePending HealthState = "pending_check" // Health check defined but not yet run + HealthStateNotApplicable HealthState = "not_applicable" // No health check defined +) +type ContainerStatus struct { + ID string + ImageID string + ImageName string // Image used to create container + State ContainerState + ExitCode int + StartedAt time.Time + FinishedAt time.Time + Health HealthState + Restarts int // Number of times runtime restarted this specific container instance + OverlayIP string +} +type BuildOptions struct { // From Section 3.5, expanded + ContextDir string + DockerfilePath string + ImageTag string // Target tag for the build + BuildArgs map[string]string + TargetStage string + Platform string + CacheTo []string // e.g., ["type=registry,ref=myreg.com/cache/img:latest"] + CacheFrom []string // e.g., ["type=registry,ref=myreg.com/cache/img:latest"] + NoCache bool + Pull bool // Whether to attempt to pull base images +} +type PortMapping struct { HostPort int; ContainerPort int; Protocol string /* TCP, UDP */; HostIP string /* 0.0.0.0 default */} +type VolumeMount struct { + Name string // User-defined name of the volume from workload.spec.volumes + Type string // "hostMount", "simpleClusterStorage" (translated to "bind" for Podman) + Source string // Resolved host path for the volume + Destination string // Mount path inside container + ReadOnly bool + // SELinuxLabel, Propagation options if needed later +} +type GPUOptions struct { DeviceIDs []string /* e.g., ["0", "1"] or ["all"] */; Capabilities [][]string /* e.g., [["gpu"], ["compute","utility"]] */} +type ResourceSpec struct { + CPUShares int64 // Relative weight + CPUQuota int64 // Microseconds per period (e.g., 50000 for 0.5 CPU with 100000 period) + CPUPeriod int64 // Microseconds (e.g., 100000) + MemoryLimitBytes int64 + GPUSpec *GPUOptions // If GPU requested +} +type ContainerCreateOptions struct { + WorkloadName string + Namespace string + InstanceID string // KAT-generated unique ID for this replica/run + ImageName string // Image to run (after pull/build) + Hostname string + Command []string + Args []string + Env map[string]string + Labels map[string]string // Include KAT ownership labels + RestartPolicy string // "no", "on-failure", "always" (Podman specific values) + Resources ResourceSpec + Ports []PortMapping + Volumes []VolumeMount + NetworkName string // Name of Podman network to join (e.g., for overlay) + IPAddress string // Static IP within Podman network, if assigned by KAT IPAM + User string // User to run as inside container (e.g., "1000:1000") + CapAdd []string + CapDrop []string + SecurityOpt []string + HealthCheck *ContainerHealthCheck // Podman native healthcheck config + Systemd bool // Run container with systemd as init +} +type ContainerHealthCheck struct { + Test []string // e.g., ["CMD", "curl", "-f", "http://localhost/health"] + Interval time.Duration + Timeout time.Duration + Retries int + StartPeriod time.Duration +} + +type ContainerRuntime interface { + BuildImage(ctx context.Context, opts BuildOptions) (imageID string, err error) + PullImage(ctx context.Context, imageName string, platform string) (imageID string, err error) + PushImage(ctx context.Context, imageName string, destinationRegistry string) error + CreateContainer(ctx context.Context, opts ContainerCreateOptions) (containerID string, err error) + StartContainer(ctx context.Context, containerID string) error + StopContainer(ctx context.Context, containerID string, timeoutSeconds uint) error + RemoveContainer(ctx context.Context, containerID string, force bool, removeVolumes bool) error + GetContainerStatus(ctx context.Context, containerOrName string) (*ContainerStatus, error) + StreamContainerLogs(ctx context.Context, containerID string, follow bool, since time.Time, stdout io.Writer, stderr io.Writer) error + PruneAllStoppedContainers(ctx context.Context) (reclaimedSpace int64, err error) + PruneAllUnusedImages(ctx context.Context) (reclaimedSpace int64, err error) + EnsureNetworkExists(ctx context.Context, networkName string, driver string, subnet string, gateway string, options map[string]string) error + RemoveNetwork(ctx context.Context, networkName string) error + ListManagedContainers(ctx context.Context) ([]ContainerStatus, error) // Lists containers labelled by KAT +} +``` + +#### 6.2. Default Implementation: Podman + +The default and only supported `ContainerRuntime` for KAT v1.0 is Podman. The implementation will primarily shell out to the `podman` CLI, using appropriate JSON output flags for parsing. It assumes `podman` is installed and correctly configured for rootless operation on Agent nodes. Key commands used: `podman build`, `podman pull`, `podman push`, `podman create`, `podman start`, `podman stop`, `podman rm`, `podman inspect`, `podman logs`, `podman system prune`, `podman network create/rm/inspect`. + +#### 6.3. Rootless Execution Strategy + +KAT Agents MUST orchestrate container workloads rootlessly. The PREFERRED strategy is: +1. **Dedicated User per Workload/Namespace:** The `kat-agent` (running as root, or with specific sudo rights for `useradd`, `loginctl`, `systemctl --user`) creates a dedicated, unprivileged system user account (e.g., `kat_wl_mywebapp`) when a workload is first scheduled to the node, or uses a pre-existing user from a pool. +2. **Enable Linger:** `loginctl enable-linger `. +3. **Generate Systemd Unit:** The Agent translates the KAT workload definition into container create options and uses `podman generate systemd --new --name {instanceID} --files --time 10 {imageName} {command...}` to produce a `.service` unit file. This unit will include environment variables, volume mounts, port mappings (if host-mapped), resource limits, etc. The `Restart=` directive in the systemd unit will be set according to `workload.kat -> spec.restartPolicy`. +4. **Place and Manage Unit:** The unit file is placed in `/etc/systemd/user/` (if agent is root, enabling it for the target user) or `~{username}/.config/systemd/user/`. The Agent then uses `systemctl --user --machine={username}@.host daemon-reload`, `systemctl --user --machine={username}@.host enable --now {service_name}.service` to start and manage it. +5. **Status and Logs:** Agent queries `systemctl --user --machine... status` and `journalctl --user-unit ...` for status and logs. + +This leverages systemd's robust process supervision and cgroup management for rootless containers. + +--- + +### 7. Networking + +#### 7.1. Integrated Overlay Network + +KAT v1.0 implements a mandatory, simple, encrypted Layer 3 overlay network connecting all Nodes using WireGuard. +1. **Configuration:** Defined by `cluster.kat -> spec.clusterCIDR`. +2. **Key Management:** + * Each Agent generates a WireGuard key pair locally upon first start/join. Private key is stored securely (e.g., `/etc/kat/wg_private.key`, mode 0600). Public key is reported to the Leader during registration. + * Leader stores all registered Node public keys and their *external* advertise IPs (for WireGuard endpoint) in etcd under `/kat/network/nodes/{nodeName}/wg_pubkey` and `/kat/network/nodes/{nodeName}/wg_endpoint`. +3. **Peer Configuration:** Each Agent watches `/kat/network/nodes/` in etcd. When a new node joins or an existing node's WireGuard info changes, the Agent updates its local WireGuard configuration (e.g., for interface `kat0`): + * Adds/updates a `[Peer]` section for every *other* node. + * `PublicKey = {peer_public_key}` + * `Endpoint = {peer_advertise_ip}:{configured_wg_port}` + * `AllowedIPs = {peer_assigned_overlay_subnet_cidr}` (see IPAM below). + * PersistentKeepalive MAY be used if nodes are behind NAT. +4. **Interface Setup:** Agent ensures `kat0` interface is up with its assigned overlay IP. Standard OS routing rules handle traffic for the `clusterCIDR` via `kat0`. + +#### 7.2. IP Address Management (IPAM) + +The Leader manages IP allocation for the overlay network. +1. **Node Subnets:** From `clusterCIDR` and `nodeSubnetBits` (from `cluster.kat`), the Leader carves out a distinct subnet for each Node that joins (e.g., if clusterCIDR is `10.100.0.0/16` and `nodeSubnetBits` is 7, each node gets a `/23`, like `10.100.0.0/23`, `10.100.2.0/23`, etc.). This Node-to-Subnet mapping is stored in etcd. +2. **Container IPs:** When the Leader schedules a Workload instance to a Node, it allocates the next available IP address from that Node's assigned subnet. This `instanceID -> containerIP` mapping is stored in etcd, possibly with a lease. The Agent is informed of this IP to pass to `podman create --ip ...`. +3. **Maximum Instances:** The size of the node subnet implicitly limits the number of container instances per node. + +#### 7.3. Distributed Agent DNS and Service Discovery + +Each KAT Agent runs an embedded DNS resolver, synchronized via etcd, providing service discovery. +1. **DNS Server Implementation:** Agents use `github.com/miekg/dns` to run a DNS server goroutine, listening on their `kat0` overlay IP (port 53). +2. **Record Source:** + * When a `Workload` instance (especially `Service` or `DaemonService`) with an assigned overlay IP becomes healthy (or starts, if no health check), the Leader writes DNS A records to etcd: + * `A ...` -> ` spec.ports`: + `A ..` -> `..kat.cluster.local:`). + * Configures Traefik `certResolver` for Let's Encrypt for services requesting TLS. + * Traefik watches its dynamic configuration directory. + +--- + +### 8. API Specification (KAT v1.0 Alpha) + +#### 8.1. General Principles and Authentication + +* **Protocol:** HTTP/1.1 or HTTP/2. Mandatory mTLS for Agent-Leader and CLI-Leader. +* **Data Format:** Request/Response bodies MUST be JSON. +* **Versioning:** Endpoints prefixed with `/v1alpha1`. +* **Authentication:** Static Bearer Token in `Authorization` header for CLI/external API clients. For KATv1, this token grants full cluster admin rights. Agent-to-Leader mTLS serves as agent authentication. +* **Error Reporting:** Standard HTTP status codes. JSON body for errors: `{"error": "code", "message": "details"}`. + +#### 8.2. Resource Representation (Proto3 & JSON) + +All API resources (Workloads, Namespaces, Nodes, etc., and their Quadlet file contents) are defined using Protocol Buffer v3 messages. The HTTP API transports these as JSON. Common metadata (name, namespace, uid, generation, resourceVersion, creationTimestamp) and status structures are standardized. + +#### 8.3. Core API Endpoints + +(Referencing the structure from prior discussion in RFC draft section 8.3, ensuring: +* Namespace CRUD. +* Workload CRUD: `POST/PUT` accepts `tar.gz` of Quadlet dir. `GET` returns metadata+status. Endpoints for individual Quadlet file content (`.../files/{fileName}`). Endpoint for logs (`.../instances/{instanceID}/logs`). Endpoint for rollback (`.../rollback`). +* Node read endpoints: `GET /nodes`, `GET /nodes/{name}`. Agent status update: `POST /nodes/{name}/status`. Admin taint update: `PUT /nodes/{name}/taints`. +* Event query endpoint: `GET /events`. +* ClusterConfiguration read endpoint: `GET /config/cluster` (shows sanitized running config). +No separate top-level Volume API for KAT v1; volumes are defined within workloads.) + +--- + +### 9. Observability + +#### 9.1. Logging + +* **Container Logs:** Agents capture stdout/stderr, make available via `podman logs` mechanism, and stream via API to `katcall logs`. Local rotation on agent node. +* **Agent Logs:** `kat-agent` logs to systemd journal or local files. +* **API Audit (Basic):** Leader logs API requests (method, path, source IP, user if distinguishable) at a configurable level. + +#### 9.2. Metrics + +* **Agent Metrics:** Node resource usage (CPU, memory, disk, network), container resource usage. Included in heartbeats. +* **Leader Metrics:** API request latencies/counts, scheduling attempts/successes/failures, etcd health. +* **Exposure (V1):** Minimal exposure via a `/metrics` JSON endpoint on Leader and Agent, not Prometheus formatted yet. +* **Future:** Standardized Prometheus exposition format. + +#### 9.3. Events + +Leader records significant cluster events (Workload create/update/delete, instance schedule/fail/health_change, Node ready/not_ready/join/leave, build success/fail, detached/rejoin actions) into a capped, time-series like structure in etcd. +* **API:** `GET /v1alpha1/events?[resourceType=X][&resourceName=Y][&namespace=Z]` +* Fields per event: Timestamp, Type, Reason, InvolvedObject (kind, name, ns, uid), Message. + +--- + +### 10. Security Considerations + +#### 10.1. API Security + +* mTLS REQUIRED for all inter-KAT component communication (Agent-Leader). +* Bearer token for external API clients (e.g., `katcall`). V1: single admin token. No granular RBAC. +* API server should implement rate limiting. + +#### 10.2. Rootless Execution + +Core design principle. Agents execute workloads via Podman in rootless mode, leveraging systemd user sessions for enhanced isolation. Minimizes container escape impact. + +#### 10.3. Build Security + +* Building arbitrary Git repositories on Agent nodes is a potential risk. +* Builds run as unprivileged users via rootless Podman. +* Network access during build MAY be restricted in future (V1: unrestricted). +* Users are responsible for trusting Git sources. `cacheImage` provides a way to use pre-vetted images. + +#### 10.4. Network Security + +* WireGuard overlay provides inter-node and inter-container encryption. +* Host firewalls are user responsibility. `nodePort` or Ingress exposure requires careful firewall configuration. +* API/Agent communication ports should be firewalled from public access. + +#### 10.5. Secrets Management + +* KAT v1 has NO dedicated secret management. +* Sensitive data passed via environment variables in `workload.kat -> spec.container.env` is stored plain in etcd. This is NOT secure for production secrets. +* Registry credentials for `cacheImage` push/pull are local Agent configuration. +* **Recommendation:** For sensitive data, users should use application-level encryption or sidecars that fetch from external secret stores (e.g., Vault), outside KAT's direct management in V1. + +#### 10.6. Internal PKI + +1. **Initialization (`kat-agent init`):** + * Generates a self-signed CA key (`ca.key`) and CA certificate (`ca.crt`). Stored securely on the initial Leader node (e.g., `/var/lib/kat/pki/`). + * Generates a Leader server key/cert signed by this CA for its API and Agent communication endpoints. + * Generates a Leader client key/cert signed by this CA for authenticating to etcd and Agents. +2. **Node Join (`kat-agent join`):** + * Agent generates a keypair and a CSR. + * Sends CSR to Leader over an initial (potentially untrusted, or token-protected if implemented later) channel. + * Leader signs the Agent's CSR using the CA key, returns the signed Agent certificate and the CA certificate. + * Agent stores its key, its signed certificate, and the CA cert for mTLS. +3. **mTLS Usage:** All Agent-Leader and Leader-Agent (for commands) communications use mTLS, validating peer certificates against the cluster CA. +4. **Certificate Lifespan & Rotation:** For V1, certificates might have a long lifespan (e.g., 1-10 years). Automated rotation is deferred. Manual regeneration/redistribution would be needed. + +--- + +### 13. Acknowledgements + +The KAT system design, while aiming for novel simplicity, stands on the shoulders of giants. Its architecture and concepts draw inspiration and incorporate lessons learned from numerous preceding systems and bodies of work in distributed computing and container orchestration. We specifically acknowledge the influence of: + +* **Kubernetes:** For establishing many of the core concepts and terminology in modern container orchestration, even as KAT diverges in implementation complexity and API specifics. +* **k3s and MicroK8s:** For demonstrating the demand and feasibility of lightweight Kubernetes distributions, validating the need KAT aims to fill more radically. +* **Podman & Quadlets:** For pioneering robust rootless containerization and providing the direct inspiration for KAT's declarative Quadlet configuration model and systemd user service execution strategy. +* **Docker Compose:** For setting the standard in single-host multi-container application definition simplicity. +* **HashiCorp Nomad:** For demonstrating an alternative, successful approach to simplified, flexible orchestration beyond the Kubernetes paradigm, particularly its use of HCL and clear deployment primitives. +* **Google Borg:** For concepts in large-scale cluster management, scheduling, and the importance of introspection, as documented in their published research. +* **The "Hints for Computer System Design" (Butler Lampson):** For principles regarding simplicity, abstraction, performance trade-offs, and fault tolerance that heavily influenced KAT's philosophy. +* **"A Note on Distributed Computing" (Waldo et al.):** For articulating the fundamental differences between local and distributed computing that KAT attempts to manage pragmatically, rather than hide entirely. +* **The Grug Brained Developer:** For the essential reminder to relentlessly fight complexity and prioritize understandability. +* **Open Source Community:** For countless libraries, tools, discussions, and prior art that make a project like KAT feasible. + +Finally, thanks to **Simba**, my cat, for providing naming inspiration. + +--- + +### 14. Author's Address + +Tanishq Dubey\ +DWS LLC\ +Email: dubey@dws.rip\ +URI: https://www.dws.rip \ No newline at end of file