kat/docs/rfc/RFC001-KAT.md

# Request for Comments: 001 - The KAT System (v1.0 Specification)

**Network Working Group:** DWS LLC\
**Author:** T. Dubey\
**Contact:** dubey@dws.rip \
**Organization:** DWS LLC\
**URI:** https://www.dws.rip\
**Date:** May 2025\
**Obsoletes:** None\
**Updates:** None

## The KAT System: A Simplified Container Orchestration Protocol and Architecture Design (Version 1.0)

### Status of This Memo

This document specifies Version 1.0 of the KAT (pronounced "cat") system, a simplified container orchestration protocol and architecture developed by DWS LLC. It defines the system's components, operational semantics, resource model, networking, state management, and Application Programming Interface (API). This specification is intended for implementation, discussion, and interoperability. Distribution of this memo is unlimited.

### Abstract

The KAT system provides a lightweight, opinionated container orchestration platform specifically designed for resource-constrained environments such as single servers, small clusters, development sandboxes, home labs, and edge deployments. It contrasts with complex, enterprise-scale orchestrators by prioritizing simplicity, minimal resource overhead, developer experience, and direct integration with Git-based workflows. KAT manages containerized long-running services and batch jobs using a declarative "Quadlet" configuration model. Key features include an embedded etcd state store, a Leader-Agent architecture, automated on-agent builds from Git sources, rootless container execution, integrated overlay networking (WireGuard-based), distributed agent-local DNS for service discovery, resource-based scheduling with basic affinity/anti-affinity rules, and structured workload updates. This document provides a comprehensive specification for KAT v1.0 implementation and usage.

### Table of Contents

1.  [Introduction](#1-introduction)\
    1.1. [Motivation](#11-motivation)\
    1.2. [Goals](#12-goals)\
    1.3. [Design Philosophy](#13-design-philosophy)\
    1.4. [Scope of KAT v1.0](#14-scope-of-kat-v10)\
    1.5. [Terminology](#15-terminology)
2.  [System Architecture](#2-system-architecture)\
    2.1. [Overview](#21-overview)\
    2.2. [Components](#22-components)\
    2.3. [Node Communication Protocol](#23-node-communication-protocol)
3.  [Resource Model: KAT Quadlets](#3-resource-model-kat-quadlets)\
    3.1. [Overview](#31-overview)\
    3.2. [Workload Definition (`workload.kat`)](#32-workload-definition-workloadkat)\
    3.3. [Virtual Load Balancer Definition (`VirtualLoadBalancer.kat`)](#33-virtual-load-balancer-definition-virtualloadbalancerkat)\
    3.4. [Job Definition (`job.kat`)](#34-job-definition-jobkat)\
    3.5. [Build Definition (`build.kat`)](#35-build-definition-buildkat)\
    3.6. [Volume Definition (`volume.kat`)](#36-volume-definition-volumekat)\
    3.7. [Namespace Definition (`namespace.kat`)](#37-namespace-definition-namespacekat)\
    3.8. [Node Resource (Internal)](#38-node-resource-internal)\
    3.9. [Cluster Configuration (`cluster.kat`)](#39-cluster-configuration-clusterkat)
4.  [Core Operations and Lifecycle Management](#4-core-operations-and-lifecycle-management)\
    4.1. [System Bootstrapping and Node Lifecycle](#41-system-bootstrapping-and-node-lifecycle)\
    4.2. [Workload Deployment and Source Management](#42-workload-deployment-and-source-management)\
    4.3. [Git-Native Build Process](#43-git-native-build-process)\
    4.4. [Scheduling](#44-scheduling)\
    4.5. [Workload Updates and Rollouts](#45-workload-updates-and-rollouts)\
    4.6. [Container Lifecycle Management](#46-container-lifecycle-management)\
    4.7. [Volume Lifecycle Management](#47-volume-lifecycle-management)\
    4.8. [Job Execution Lifecycle](#48-job-execution-lifecycle)\
    4.9. [Detached Node Operation and Rejoin](#49-detached-node-operation-and-rejoin)
5.  [State Management](#5-state-management)\
    5.1. [State Store Interface (Go)](#51-state-store-interface-go)\
    5.2. [etcd Implementation Details](#52-etcd-implementation-details)\
    5.3. [Leader Election](#53-leader-election)\
    5.4. [State Backup (Leader Responsibility)](#54-state-backup-leader-responsibility)\
    5.5. [State Restore Procedure](#55-state-restore-procedure)
6.  [Container Runtime Interface](#6-container-runtime-interface)\
    6.1. [Runtime Interface Definition (Go)](#61-runtime-interface-definition-go)\
    6.2. [Default Implementation: Podman](#62-default-implementation-podman)\
    6.3. [Rootless Execution Strategy](#63-rootless-execution-strategy)
7.  [Networking](#7-networking)\
    7.1. [Integrated Overlay Network](#71-integrated-overlay-network)\
    7.2. [IP Address Management (IPAM)](#72-ip-address-management-ipam)\
    7.3. [Distributed Agent DNS and Service Discovery](#73-distributed-agent-dns-and-service-discovery)\
    7.4. [Ingress (Opinionated Recipe via Traefik)](#74-ingress-opinionated-recipe-via-traefik)
8.  [API Specification (KAT v1.0 Alpha)](#8-api-specification-kat-v10-alpha)\
    8.1. [General Principles and Authentication](#81-general-principles-and-authentication)\
    8.2. [Resource Representation (Proto3 & JSON)](#82-resource-representation-proto3--json)\
    8.3. [Core API Endpoints](#83-core-api-endpoints)
9.  [Observability](#9-observability)\
    9.1. [Logging](#91-logging)\
    9.2. [Metrics](#92-metrics)\
    9.3. [Events](#93-events)
10. [Security Considerations](#10-security-considerations)\
    10.1. [API Security](#101-api-security)\
    10.2. [Rootless Execution](#102-rootless-execution)\
    10.3. [Build Security](#103-build-security)\
    10.4. [Network Security](#104-network-security)\
    10.5. [Secrets Management](#105-secrets-management)\
    10.6. [Internal PKI](#106-internal-pki)
11. [Comparison to Alternatives](#11-comparison-to-alternatives)
12. [Future Work](#12-future-work)
13. [Acknowledgements](#13-acknowledgements)
14. [Author's Address](#14-authors-address)

---

### 1. Introduction

#### 1.1. Motivation

The landscape of container orchestration is dominated by powerful, feature-rich platforms designed for large-scale, enterprise deployments. While capable, these systems (e.g., Kubernetes) introduce significant operational complexity and resource requirements (CPU, memory overhead) that are often prohibitive or unnecessarily burdensome for smaller use cases. Developers and operators managing personal projects, home labs, CI/CD runners, small business applications, or edge devices frequently face a choice between the friction of manual deployment (SSH, scripts, `docker-compose`) and the excessive overhead of full-scale orchestrators. This gap highlights the need for a solution that provides core orchestration benefits – declarative management, automation, scheduling, self-healing – without the associated complexity and resource cost. KAT aims to be that solution.

#### 1.2. Goals

The primary objectives guiding the design of KAT v1.0 are:

*   **Simplicity:** Offer an orchestration experience that is significantly easier to install, configure, learn, and operate than existing mainstream platforms. Minimize conceptual surface area and required configuration.
*   **Minimal Overhead:** Design KAT's core components (Leader, Agent, etcd) to consume minimal system resources, ensuring maximum availability for application workloads, particularly critical in single-node or low-resource scenarios.
*   **Core Orchestration:** Provide robust management for the lifecycle of containerized long-running services, scheduled/batch jobs, and basic daemon sets.
*   **Automation:** Enable automated deployment updates, on-agent image builds triggered directly from Git repository changes, and fundamental self-healing capabilities (container restarts, service replica rescheduling).
*   **Git-Native Workflow:** Facilitate a direct "push-to-deploy" model, integrating seamlessly with common developer workflows centered around Git version control.
*   **Rootless Operation:** Implement container execution using unprivileged users by default to enhance security posture and reduce system dependencies.
*   **Integrated Experience:** Provide built-in solutions for fundamental requirements like overlay networking and service discovery, reducing reliance on complex external components for basic operation.

#### 1.3. Design Philosophy

KAT adheres to the following principles:

*   **Embrace Simplicity (Grug Brained):** Actively combat complexity. Prefer simpler solutions even if they cover slightly fewer edge cases initially. Provide opinionated defaults based on common usage patterns. ([The Grug Brained Developer](#))
*   **Declarative Configuration:** Users define the *desired state* via Quadlet files; KAT implements the control loops to achieve and maintain it.
*   **Locality of Behavior:** Group related configurations logically (Quadlet directories) rather than by arbitrary type separation across the system. ([HTMX: Locality of Behaviour](https://htmx.org/essays/locality-of-behaviour/))
*   **Leverage Stable Foundations:** Utilize proven, well-maintained technologies like etcd (for consensus) and Podman (for container runtime).
*   **Explicit is Better than Implicit (Mostly):** While providing defaults, make configuration options clear and understandable. Avoid overly "magic" behavior.
*   **Build for the Common Case:** Focus V1 on solving the 80-90% of use cases for the target audience extremely well.
*   **Fail Fast, Recover Simply:** Design components to handle failures predictably. Prioritize simple recovery mechanisms (like etcd snapshots, agent restarts converging state) over complex distributed failure handling protocols where possible for V1.

#### 1.4. Scope of KAT v1.0

This specification details KAT Version 1.0. It includes:
*   Leader-Agent architecture with etcd-based state and leader election.
*   Quadlet resource model (`Workload`, `VirtualLoadBalancer`, `JobDefinition`, `BuildDefinition`, `VolumeDefinition`, `Namespace`).
*   Deployment of Services, Jobs, and DaemonServices.
*   Source specification via direct image name or Git repository (with on-agent builds using Podman). Optional build caching via registry.
*   Resource-based scheduling with `nodeSelector` and Taint/Toleration support, using a "most empty" placement strategy.
*   Workload updates via `Simultaneous` or `Rolling` strategies (`maxSurge` control). Manual rollback support.
*   Container lifecycle management including restart policies (`Never`, `MaxCount`, `Always` with reset timer) and optional health checks.
*   Volume support for `HostMount` and `SimpleClusterStorage`.
*   Integrated WireGuard-based overlay networking.
*   Distributed agent-local DNS for service discovery, synchronized via etcd.
*   Detached node operation mode with simplified rejoin logic.
*   Basic state backup via Leader-driven etcd snapshots.
*   Rootless container execution via systemd user services.
*   A Proto3-defined, JSON-over-HTTP RESTful API (v1alpha1).
*   Opinionated Ingress recipe using Traefik.

Features explicitly deferred beyond v1.0 are listed in Section 12.

#### 1.5. Terminology

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

*   **KAT System (Cluster):** The complete set of KAT Nodes forming a single operational orchestration domain.
*   **Node:** An individual machine (physical or virtual) running the KAT Agent software. Each Node has a unique name within the cluster.
*   **Leader Node (Leader):** The single Node currently elected via the consensus mechanism to perform authoritative cluster management tasks.
*   **Agent Node (Agent):** A Node running the KAT Agent software, responsible for local workload execution and status reporting. Includes the Leader node.
*   **Namespace:** A logical partition within the KAT cluster used to organize resources (Workloads, Volumes). Defined by `namespace.kat`. Default is "default". System namespace is "kat-core".
*   **Workload:** The primary unit of application deployment, defined by a set of Quadlet files specifying desired state. Types: `Service`, `Job`, `DaemonService`.
*   **Service:** A Workload type representing a long-running application.
*   **Job:** A Workload type representing a task that runs to completion.
*   **DaemonService:** A Workload type ensuring one instance runs on each eligible Node.
*   **KAT Quadlet (Quadlet):** A set of co-located YAML files (`*.kat`) defining a single Workload. Submitted and managed atomically.
*   **Container:** A running instance managed by the container runtime (Podman).
*   **Image:** The template for a container, specified directly or built from Git.
*   **Volume:** Persistent or ephemeral storage attached to a Workload's container(s). Types: `HostMount`, `SimpleClusterStorage`.
*   **Overlay Network:** KAT-managed virtual network (WireGuard) for inter-node/inter-container communication.
*   **Service Discovery:** Mechanism (distributed agent DNS) for resolving service names to overlay IPs.
*   **Ingress:** Exposing internal services externally, typically via the Traefik recipe.
*   **Tick:** Configurable interval for Agent heartbeats to the Leader.
*   **Taint:** Key/Value/Effect marker on a Node to repel workloads.
*   **Toleration:** Marker on a Workload allowing it to schedule on Nodes with matching Taints.
*   **API:** Application Programming Interface (HTTP/JSON based on Proto3).
*   **CLI:** Command Line Interface (`katcall`).
*   **etcd:** Distributed key-value store used for consensus and state.

---

### 2. System Architecture

#### 2.1. Overview

KAT operates using a distributed Leader-Agent model built upon an embedded etcd consensus layer. One `kat-agent` instance is elected Leader, responsible for maintaining the cluster's desired state, making scheduling decisions, and serving the API. All other `kat-agent` instances act as workers (Agents), executing tasks assigned by the Leader and reporting their status. Communication occurs primarily between Agents and the Leader, facilitated by an integrated overlay network.

#### 2.2. Components

*   **`kat-agent` (Binary):** The single executable deployed on all nodes. Runs in one of two primary modes internally based on leader election status: Agent or Leader.
    *   **Common Functions:** Node registration, heartbeating, overlay network participation, local container runtime interaction (Podman via CRI interface), local state monitoring, execution of Leader commands.
    *   **Rootless Execution:** Manages container workloads under distinct, unprivileged user accounts via systemd user sessions (preferred method).
*   **Leader Role (Internal state within an elected `kat-agent`):**
    *   Hosts API endpoints.
    *   Manages desired/actual state in etcd.
    *   Runs the scheduling logic.
    *   Executes the reconciliation control loop.
    *   Manages IPAM for the overlay network.
    *   Updates DNS records in etcd.
    *   Coordinates node join/leave/failure handling.
    *   Initiates etcd backups.
*   **Embedded etcd:** Linked library providing Raft consensus for leader election and strongly consistent key-value storage for all cluster state (desired specs, actual status, network config, DNS records). Runs within the `kat-agent` process on quorum members (typically 1, 3, or 5 nodes).

#### 2.3. Node Communication Protocol

*   **Transport:** HTTP/1.1 or HTTP/2 over mandatory mTLS. KAT includes a simple internal PKI bootstrapped during `init` and `join`.
*   **Agent -> Leader:** Periodic `POST /v1alpha1/nodes/{nodeName}/status` containing heartbeat and detailed node/workload status. Triggered every `Tick`. Immediate reports for critical events MAY be sent.
*   **Leader -> Agent:** Commands (create/start/stop/remove container, update config) sent via `POST/PUT/DELETE` to agent-specific endpoints (e.g., `https://{agentOverlayIP}:{agentPort}/agent/v1alpha1/...`).
*   **Payload:** JSON, derived from Proto3 message definitions.
*   **Discovery/Join:** Initial contact via leader hint uses HTTP API; subsequent peer discovery for etcd/overlay uses information distributed by the Leader via the API/etcd.
*   **Detached Mode Communication:** Multicast/Broadcast UDP for `REJOIN_REQUEST` messages on local network segments. Direct HTTP response from parent Leader.

---

### 3. Resource Model: KAT Quadlets

#### 3.1. Overview

KAT configuration is declarative, centered around the "Quadlet" concept. A Workload is defined by a directory containing YAML files (`*.kat`), each specifying a different aspect (`kind`). This promotes modularity and locality of behavior.

#### 3.2. Workload Definition (`workload.kat`)

REQUIRED. Defines the core identity, source, type, and lifecycle policies.

```yaml
apiVersion: kat.dws.rip/v1alpha1
kind: Workload
metadata:
  name: string # REQUIRED. Workload name.
  namespace: string # OPTIONAL. Defaults to "default".
spec:
  type: enum # REQUIRED: Service | Job | DaemonService
  source: # REQUIRED. Exactly ONE of image or git must be present.
    image: string # OPTIONAL. Container image reference.
    git: # OPTIONAL. Build from Git.
      repository: string # REQUIRED if git. URL of Git repo.
      branch: string # OPTIONAL. Defaults to repo default.
      tag: string # OPTIONAL. Overrides branch.
      commit: string # OPTIONAL. Overrides tag/branch.
    cacheImage: string # OPTIONAL. Registry path for build cache layers.
                       # Used only if 'git' source is specified.
  replicas: int # REQUIRED for type: Service. Desired instance count.
                # Ignored for Job, DaemonService.
  updateStrategy: # OPTIONAL for Service/DaemonService.
    type: enum # REQUIRED: Rolling | Simultaneous. Default: Rolling.
    rolling: # Relevant if type is Rolling.
      maxSurge: int | string # OPTIONAL. Max extra instances during update. Default 1.
  restartPolicy: # REQUIRED for container lifecycle.
    condition: enum # REQUIRED: Never | MaxCount | Always
    maxRestarts: int # OPTIONAL. Used if condition=MaxCount. Default 5.
    resetSeconds: int # OPTIONAL. Used if condition=MaxCount. Window to reset count. Default 3600.
  nodeSelector: map[string]string # OPTIONAL. Schedule only on nodes matching all labels.
  tolerations: # OPTIONAL. List of taints this workload can tolerate.
    - key: string
      operator: enum # OPTIONAL. Exists | Equal. Default: Exists.
      value: string # OPTIONAL. Needed if operator=Equal.
      effect: enum # OPTIONAL. NoSchedule | PreferNoSchedule. Matches taint effect.
                   # Empty matches all effects for the key/value pair.
  # --- Container specification (V1 assumes one primary container per workload) ---
  container: # REQUIRED.
    name: string # OPTIONAL. Informational name for the container.
    command: [string] # OPTIONAL. Override image CMD.
    args: [string] # OPTIONAL. Override image ENTRYPOINT args or CMD args.
    env: # OPTIONAL. Environment variables.
      - name: string
        value: string
    volumeMounts: # OPTIONAL. Mount volumes defined in spec.volumes.
      - name: string # Volume name.
        mountPath: string # Path inside container.
        subPath: string # Optional. Mount sub-directory.
        readOnly: bool # Optional. Default false.
    resources: # OPTIONAL. Resource requests and limits.
      requests: # Used for scheduling. Defaults to limits if unspecified.
        cpu: string # e.g., "100m"
        memory: string # e.g., "64Mi"
      limits: # Enforced by runtime. Container killed if memory exceeded.
        cpu: string # CPU throttling limit (e.g., "1")
        memory: string # e.g., "256Mi"
      gpu: # OPTIONAL. Request GPU resources.
        driver: enum # OPTIONAL: any | nvidia | amd
        minVRAM_MB: int # OPTIONAL. Minimum GPU memory required.
  # --- Volume Definitions for this Workload ---
  volumes: # OPTIONAL. Defines volumes used by container.volumeMounts.
    - name: string # REQUIRED. Name referenced by volumeMounts.
      simpleClusterStorage: {} # OPTIONAL. Creates dir under agent's volumeBasePath.
                               # Use ONE OF simpleClusterStorage or hostMount.
      hostMount: # OPTIONAL. Mounts a specific path from the host node.
        hostPath: string # REQUIRED if hostMount. Absolute path on host.
        ensureType: enum # OPTIONAL: DirectoryOrCreate | Directory | FileOrCreate | File | Socket```

#### 3.3. Virtual Load Balancer Definition (`VirtualLoadBalancer.kat`)

OPTIONAL. Only relevant for `Workload` of `type: Service`. Defines networking endpoints and health criteria for load balancing and ingress.

```yaml
apiVersion: kat.dws.rip/v1alpha1
kind: VirtualLoadBalancer # Identifies this Quadlet file's purpose
spec:
  ports: # REQUIRED if this file exists. List of ports to expose/balance.
    - name: string # OPTIONAL. Informational name (e.g., "web", "grpc").
      containerPort: int # REQUIRED. Port the application listens on inside container.
      protocol: string # OPTIONAL. TCP | UDP. Default TCP.
  healthCheck: # OPTIONAL. Used for readiness in rollouts and LB target selection.
               # If omitted, container running status implies health.
    exec:
      command: [string] # REQUIRED. Exit 0 = healthy.
    initialDelaySeconds: int # OPTIONAL. Default 0.
    periodSeconds: int # OPTIONAL. Default 10.
    timeoutSeconds: int # OPTIONAL. Default 1.
    successThreshold: int # OPTIONAL. Default 1.
    failureThreshold: int # OPTIONAL. Default 3.
  ingress: # OPTIONAL. Hints for external ingress controllers (like the Traefik recipe).
    - host: string # REQUIRED. External hostname.
      path: string # OPTIONAL. Path prefix. Default "/".
      servicePortName: string # OPTIONAL. Name of port from spec.ports to target.
      servicePort: int # OPTIONAL. Port number from spec.ports. Overrides name.
                       # One of servicePortName or servicePort MUST be provided if ports exist.
      tls: bool # OPTIONAL. If true, signal ingress controller to manage TLS via ACME.
```

#### 3.4. Job Definition (`job.kat`)

REQUIRED if `spec.type` in `workload.kat` is `Job`.

```yaml
apiVersion: kat.dws.rip/v1alpha1
kind: JobDefinition # Identifies this Quadlet file's purpose
spec:
  schedule: string # OPTIONAL. Cron schedule string.
  completions: int # OPTIONAL. Desired successful completions. Default 1.
  parallelism: int # OPTIONAL. Max concurrent instances. Default 1.
  activeDeadlineSeconds: int # OPTIONAL. Timeout for the job run.
  backoffLimit: int # OPTIONAL. Max failed instance restarts before job fails. Default 3.
```

#### 3.5. Build Definition (`build.kat`)

REQUIRED if `spec.source.git` is specified in `workload.kat`.

```yaml
apiVersion: kat.dws.rip/v1alpha1
kind: BuildDefinition
spec:
  buildContext: string # OPTIONAL. Path relative to repo root. Defaults to ".".
  dockerfilePath: string # OPTIONAL. Path relative to buildContext. Defaults to "./Dockerfile".
  buildArgs: # OPTIONAL. Map for build arguments.
    map[string]string # e.g., {"VERSION": "1.2.3"}
  targetStage: string # OPTIONAL. Target stage name for multi-stage builds.
  platform: string # OPTIONAL. Target platform (e.g., "linux/arm64").
  cache: # OPTIONAL. Defines build caching strategy.
    registryPath: string # OPTIONAL. Registry path (e.g., "myreg.com/cache/myapp").
                         # Agent tags cache image with commit SHA.
```

#### 3.6. Volume Definition (`volume.kat`)

DEPRECATED in favor of defining volumes directly within `workload.kat -> spec.volumes`. This enhances Locality of Behavior. Section 3.2 reflects this change. This file kind is reserved for potential future use with cluster-wide persistent volumes.

#### 3.7. Namespace Definition (`namespace.kat`)

REQUIRED for defining non-default namespaces.

```yaml
apiVersion: kat.dws.rip/v1alpha1
kind: Namespace
metadata:
  name: string # REQUIRED. Name of the namespace.
  # labels: map[string]string # OPTIONAL.
```

#### 3.8. Node Resource (Internal)

Represents node state managed by the Leader, queryable via API. Not defined by user Quadlets. Contains fields like `name`, `status`, `addresses`, `capacity`, `allocatable`, `labels`, `taints`.

#### 3.9. Cluster Configuration (`cluster.kat`)

Used *only* during `kat-agent init` via a flag (e.g., `--config cluster.kat`). Defines immutable cluster-wide parameters.

```yaml
apiVersion: kat.dws.rip/v1alpha1
kind: ClusterConfiguration
metadata:
  name: string # REQUIRED. Informational name for the cluster.
spec:
  clusterCIDR: string # REQUIRED. CIDR for overlay network IPs (e.g., "10.100.0.0/16").
  serviceCIDR: string # REQUIRED. CIDR for internal virtual IPs (used by future internal proxy/LB).
                      # Not directly used by containers in V1 networking model.
  nodeSubnetBits: int # OPTIONAL. Number of bits for node subnets within clusterCIDR.
                      # Default 7 (yielding /23 subnets if clusterCIDR=/16).
  clusterDomain: string # OPTIONAL. DNS domain suffix. Default "kat.cluster.local".
  # --- Port configurations ---
  agentPort: int # OPTIONAL. Port agent listens on (internal). Default 9116.
  apiPort: int # OPTIONAL. Port leader listens on for API. Default 9115.
  etcdPeerPort: int # OPTIONAL. Default 2380.
  etcdClientPort: int # OPTIONAL. Default 2379.
  # --- Path configurations ---
  volumeBasePath: string # OPTIONAL. Agent base path for SimpleClusterStorage. Default "/var/lib/kat/volumes".
  backupPath: string # OPTIONAL. Path on Leader for etcd backups. Default "/var/lib/kat/backups".
  # --- Interval configurations ---
  backupIntervalMinutes: int # OPTIONAL. Frequency of etcd backups. Default 30.
  agentTickSeconds: int # OPTIONAL. Agent heartbeat interval. Default 15.
  nodeLossTimeoutSeconds: int # OPTIONAL. Time before marking node NotReady. Default 60.
```

---

### 4. Core Operations and Lifecycle Management

This section details the operational logic, state transitions, and management processes within the KAT system, from cluster initialization to workload execution and node dynamics.

#### 4.1. System Bootstrapping and Node Lifecycle

##### 4.1.1. Initial Leader Setup
The first KAT Node is initialized to become the Leader and establish the cluster.
1.  **Command:** The administrator executes `kat-agent init --config <path_to_cluster.kat>` on the designated initial node. The `cluster.kat` file (see Section 3.9) provides essential cluster-wide parameters.
2.  **Action:**
    *   The `kat-agent` process starts.
    *   It parses the `cluster.kat` file to obtain parameters like ClusterCIDR, ServiceCIDR, domain, agent/API ports, etcd ports, volume paths, and backup settings.
    *   It generates a new internal Certificate Authority (CA) key and certificate (for the PKI, see Section 10.6) if one doesn't already exist at a predefined path.
    *   It initializes and starts an embedded single-node etcd server, using the configured etcd ports. The etcd data directory is created.
    *   The agent campaigns for leadership via etcd's election mechanism (Section 5.3) and, being the only member, becomes the Leader. It writes its identity (e.g., its advertise IP and API port) to a well-known key in etcd (e.g., `/kat/config/leader_endpoint`).
    *   The Leader initializes its IPAM module (Section 7.2) for the defined ClusterCIDR.
    *   It generates its own WireGuard key pair, stores the private key securely, and publishes its public key and overlay endpoint (external IP and WireGuard port) to etcd.
    *   It sets up its local `kat0` WireGuard interface using its assigned overlay IP (the first available from its own initial subnet).
    *   It starts the API server on the configured API port.
    *   It starts its local DNS resolver (Section 7.3).
    *   The `kat-core` and `default` Namespaces are created in etcd if they do not exist.

##### 4.1.2. Agent Node Join
Subsequent Nodes join an existing KAT cluster to act as workers (and potential future etcd quorum members or leaders if so configured, though V1 focuses on a static initial quorum).
1.  **Command:** `kat-agent join --leader-api <leader_api_ip:port> --advertise-address <ip_or_interface_name> [--etcd-peer]` (The `--etcd-peer` flag indicates this node should attempt to join the etcd quorum).
2.  **Action:**
    *   The `kat-agent` process starts.
    *   It generates a WireGuard key pair.
    *   It contacts the specified Leader API endpoint to request joining the cluster, sending its intended `advertise-address` (for inter-node WireGuard communication) and its WireGuard public key. It also sends a Certificate Signing Request (CSR) for its mTLS client/server certificate.
    *   The Leader, upon validating the join request (V1 has no strong token validation, relies on network trust):
        *   Assigns a unique Node Name (if not provided by agent, Leader generates one) and a Node Subnet from the ClusterCIDR (Section 7.2).
        *   Signs the Agent's CSR using the cluster CA, returning the signed certificate and the CA certificate.
        *   Records the new Node's name, advertise address, WireGuard public key, and assigned subnet in etcd (e.g., under `/kat/nodes/registration/{nodeName}`).
        *   If `--etcd-peer` was requested and the quorum has capacity, the Leader MAY instruct the node to join the etcd quorum by providing current peer URLs. (For V1, etcd peer addition post-init is considered an advanced operation, default is static initial quorum).
        *   Provides the new Agent with the list of all current Nodes' WireGuard public keys, overlay endpoint addresses, and their assigned overlay subnets (for `AllowedIPs`).
    *   The joining Agent:
        *   Stores the received mTLS certificate and CA certificate.
        *   Configures its local `kat0` WireGuard interface with an IP from its assigned subnet (typically the `.1` address) and sets up peers for all other known nodes.
        *   If instructed to join etcd quorum, configures and starts its embedded etcd as a peer.
        *   Registers itself formally with the Leader via a status update.
        *   Starts its local DNS resolver and begins syncing DNS state from etcd.
        *   Becomes `Ready` and available for scheduling workloads.

##### 4.1.3. Node Heartbeat and Status Reporting
Each Agent Node (including the Leader acting as an Agent for its local workloads) MUST periodically send a status update to the active Leader.
*   **Interval:** Configurable `agentTickSeconds` (from `cluster.kat`, e.g., default 15 seconds).
*   **Content:** The payload is a JSON object reflecting the Node's current state:
    *   `nodeName`: string (its unique identifier)
    *   `nodeUID`: string (a persistent unique ID for the node instance)
    *   `timestamp`: int64 (Unix epoch seconds)
    *   `resources`:
        *   `capacity`: `{"cpu": "2000m", "memory": "4096Mi"}`
        *   `allocatable`: `{"cpu": "1800m", "memory": "3800Mi"}` (capacity minus system overhead)
    *   `workloadInstances`: Array of objects, each detailing a locally managed container:
        *   `workloadName`: string
        *   `namespace`: string
        *   `instanceID`: string (unique ID for this replica/run of the workload)
        *   `containerID`: string (from Podman)
        *   `imageID`: string (from Podman)
        *   `state`: string ("running", "exited", "paused", "unknown")
        *   `exitCode`: int (if exited)
        *   `healthStatus`: string ("healthy", "unhealthy", "pending_check") (from `VirtualLoadBalancer.kat` health check)
        *   `restarts`: int (count of Agent-initiated restarts for this instance)
    *   `overlayNetwork`: `{"status": "connected", "lastPeerSync": "timestamp"}`
*   **Protocol:** HTTP `POST` to Leader's `/v1alpha1/nodes/{nodeName}/status` endpoint, authenticated via mTLS. The Leader updates the Node's actual state in etcd (e.g., `/kat/nodes/actual/{nodeName}/status`).

##### 4.1.4. Node Departure and Failure Detection
*   **Graceful Departure:**
    1.  Admin action: `katcall drain <nodeName>`. This sets a `NoSchedule` Taint on the Node object in etcd and marks its desired state as "Draining".
    2.  Leader reconciliation loop:
        *   Stops scheduling *new* workloads to the Node.
        *   For existing `Service` and `DaemonService` instances on the draining node, it initiates a process to reschedule them to other eligible nodes (respecting update strategies where applicable, e.g., not violating `maxUnavailable` for the service cluster-wide).
        *   For `Job` instances, they are allowed to complete. If a Job is actively running and the node is drained, KAT V1's behavior is to let it finish; more sophisticated preemption is future work.
    3.  Once all managed workload instances are terminated or rescheduled, the Agent MAY send a final "departing" message, and the admin can decommission the node. The Leader eventually removes the Node object from etcd after a timeout if it stops heartbeating.
*   **Failure Detection:**
    1.  The Leader monitors Agent heartbeats. If an Agent misses `nodeLossTimeoutSeconds` (from `cluster.kat`, e.g., 3 * `agentTickSeconds`), the Leader marks the Node's status in etcd as `NotReady`.
    2.  Reconciliation Loop for `NotReady` Node:
        *   For `Service` instances previously on the `NotReady` node: The Leader attempts to schedule replacement instances on other `Ready` eligible nodes to maintain `spec.replicas`.
        *   For `DaemonService` instances: No action, as the node is not eligible.
        *   For `Job` instances: If the job has a restart policy allowing it, the Leader MAY attempt to reschedule the failed job instance on another eligible node.
    3.  If the Node rejoins (starts heartbeating again): The Leader marks it `Ready`. The reconciliation loop will then assess if any workloads *should* be on this node (e.g., DaemonServices, or if it's now the best fit for some pending Services). Any instances that were rescheduled off this node and are now redundant (because the original instance might still be running locally on the rejoined node if it only had a network partition) will be identified. The Leader will instruct the rejoined Agent to stop any such zombie/duplicate containers based on `instanceID` tracking.

#### 4.2. Workload Deployment and Source Management

Workloads are the primary units of deployment, defined by Quadlet directories.
1.  **Submission:**
    *   Client (e.g., `katcall apply -f ./my-workload-dir/`) archives the Quadlet directory (e.g., `my-workload-dir/`) into a `tar.gz` file.
    *   Client sends an HTTP `POST` (for new) or `PUT` (for update) to `/v1alpha1/n/{namespace}/workloads` (if name is in `workload.kat`) or `/v1alpha1/n/{namespace}/workloads/{workloadName}` (for `PUT`). The body is the `tar.gz` archive.
    *   Leader validates the `metadata.name` in `workload.kat` against the URL path for `PUT`.
2.  **Validation & Storage:**
    *   Leader unpacks the archive.
    *   It validates each `.kat` file against its known schema (e.g., `Workload`, `VirtualLoadBalancer`, `BuildDefinition`, `JobDefinition`).
    *   Cross-Quadlet file consistency is checked (e.g., referenced port names in `VirtualLoadBalancer.kat -> spec.ingress` exist in `VirtualLoadBalancer.kat -> spec.ports`).
    *   If valid, Leader persists each Quadlet file's content into etcd under `/kat/workloads/desired/{namespace}/{workloadName}/{fileName}`. The `metadata.generation` for the workload is incremented on spec changes.
    *   Leader responds `201 Created` or `200 OK` with the workload's metadata.
3.  **Source Handling Precedence by Agent (upon receiving deployment command):**
    1.  If `workload.kat -> spec.source.git` is defined:
        a.  If `workload.kat -> spec.source.cacheImage` is also defined, Agent first attempts to pull this `cacheImage` (see Section 4.3.3). If successful and image hash matches an expected value (e.g., if git commit is also specified and used to tag cache), this image is used, and local build MAY be skipped.
        b.  If no cache image or cache pull fails/mismatches, proceed to Git Build (Section 4.3). The resulting locally built image is used.
    2.  Else if `workload.kat -> spec.source.image` is defined (and no `git` source): Agent pulls this image (Section 4.6.1).
    3.  If neither `git` nor `image` is specified, it's a validation error by the Leader.

#### 4.3. Git-Native Build Process

Triggered when an Agent is instructed to run a Workload instance with `spec.source.git`.
1.  **Setup:** Agent creates a temporary, isolated build directory.
2.  **Cloning:** `git clone --depth 1 --branch <branch_or_tag_or_default> <repository_url> .` (or `git fetch origin <commit> && git checkout <commit>`).
3.  **Context & Dockerfile Path:** Agent uses `buildContext` and `dockerfilePath` from `build.kat` (defaults to `.` and `./Dockerfile` respectively).
4.  **Build Execution:**
    *   Construct `podman build` command with:
        *   `-t <internal_image_tag>` (e.g., `kat-local/{namespace}_{workloadName}:{git_commit_sha_short}`)
        *   `-f {dockerfilePath}` within the `{buildContext}`.
        *   `--build-arg` for each from `build.kat -> spec.buildArgs`.
        *   `--target {targetStage}` if specified.
        *   `--platform {platform}` if specified (else Podman defaults).
        *   The build context path.
    *   Execute as the Agent's rootless user or a dedicated build user for that workload.
5.  **Build Caching (`build.kat -> spec.cache.registryPath`):**
    *   **Pre-Build Pull (Cache Hit):** Before Step 2 (Cloning), Agent constructs a tag based on `registryPath` and the specific Git commit SHA (if available, else latest of branch/tag). Attempts `podman pull`. If successful, uses this image and skips local build steps.
    *   **Post-Build Push (Cache Miss/New Build):** After successful local build, Agent tags the new image with `{registryPath}:{git_commit_sha_short}` and attempts `podman push`. Registry credentials MUST be configured locally on the Agent (e.g., in Podman's auth file for the build user). KATv1 does not manage these credentials centrally.
6.  **Outcome:** Agent reports build success (with internal image tag) or failure to Leader.

#### 4.4. Scheduling

The Leader performs scheduling in its reconciliation loop for new or rescheduled Workload instances.
1.  **Filter Nodes - Resource Requests:**
    *   Identify `spec.container.resources.requests` (CPU, memory).
    *   Filter out Nodes whose `status.allocatable` resources are less than requested.
2.  **Filter Nodes - nodeSelector:**
    *   If `spec.nodeSelector` is present, filter out Nodes whose labels do not match *all* key-value pairs in the selector.
3.  **Filter Nodes - Taints and Tolerations:**
    *   For each remaining Node, check its `taints`.
    *   A Workload instance is repelled if the Node has a taint with `effect=NoSchedule` that is not tolerated by `spec.tolerations`.
    *   (Nodes with `PreferNoSchedule` taints not tolerated are kept but deprioritized in scoring).
4.  **Filter Nodes - GPU Requirements:**
    *   If `spec.container.resources.gpu` is specified:
        *   Filter out Nodes that do not report matching GPU capabilities (e.g., `gpu.nvidia.present=true` based on `driver` request).
        *   Filter out Nodes whose reported available VRAM (a node-level attribute, potentially dynamically tracked by agent) is less than `minVRAM_MB`.
5.  **Score Nodes ("Most Empty" Proportional):**
    *   For each remaining candidate Node:
        *   `cpu_used_percent = (node_total_cpu_requested_by_workloads / node_allocatable_cpu) * 100`
        *   `mem_used_percent = (node_total_mem_requested_by_workloads / node_allocatable_mem) * 100`
        *   `score = (100 - cpu_used_percent) + (100 - mem_used_percent)` (Higher is better, gives more weight to balanced free resources). Or `score = 100 - max(cpu_used_percent, mem_used_percent)`.
6.  **Select Node:**
    *   Prioritize nodes *not* having untolerated `PreferNoSchedule` taints.
    *   Among those (or all, if all preferred are full), pick the Node with the highest score.
    *   If multiple nodes tie for the highest score, pick one randomly.
7.  **Replica Spreading (Services/DaemonServices):** For multi-instance workloads, when choosing among equally scored nodes, the scheduler MAY prefer nodes currently running fewer instances of the *same* workload to achieve basic anti-affinity. For `DaemonService`, it schedules one instance on *every* eligible node identified after filtering.
8.  If no suitable node is found, the instance remains `Pending`.

#### 4.5. Workload Updates and Rollouts

Triggered by `PUT` to Workload API endpoint with changed Quadlet specs. Leader compares new `desiredSpecHash` with `status.observedSpecHash`.
*   **`Simultaneous` Strategy (`spec.updateStrategy.type`):**
    1.  Leader instructs Agents to stop and remove all old-version instances.
    2.  Once confirmed (or timeout), Leader schedules all new-version instances as per Section 4.4. This causes downtime.
*   **`Rolling` Strategy (`spec.updateStrategy.type`):**
    1.  `max_surge_val = calculate_absolute(spec.updateStrategy.rolling.maxSurge, new_replicas_count)`
    2.  Total allowed instances = `new_replicas_count + max_surge_val`.
    3.  The Leader updates instances incrementally:
        a.  Scales up by launching new-version instances until `total_running_instances` reaches `new_replicas_count` or `old_replicas_count + max_surge_val`, whichever is smaller and appropriate for making progress. New instances use the updated Quadlet spec.
        b.  Once a new-version instance becomes `Healthy` (passes `VirtualLoadBalancer.kat` health checks, or just starts if no checks), an old-version instance is selected and terminated.
        c.  The process continues until all instances are new-version and `new_replicas_count` are healthy.
        d.  If `new_replicas_count < old_replicas_count`, surplus old instances are terminated first, respecting a conceptual (not explicitly defined in V1, but can be `max_surge_val` effectively acting as `maxUnavailable`) limit to maintain availability.
*   **Rollbacks (Manual):**
    1.  Leader stores the Quadlet files of the previous successfully deployed version in etcd (e.g., at `/kat/workloads/archive/{namespace}/{workloadName}/{generation-1}/`).
    2.  User command: `katcall rollback workload {namespace}/{name}`.
    3.  Leader retrieves archived Quadlets, treats them as a new desired state, and applies the workload's configured `updateStrategy` to revert.

#### 4.6. Container Lifecycle Management

Managed by the Agent based on Leader commands and local policies.
1.  **Image Pull/Availability:** Before creating, Agent ensures the target image (from Git build, cache, or direct ref) is locally available, pulling if necessary.
2.  **Creation & Start:** Agent uses `ContainerRuntime` to create and start the container with parameters derived from `workload.kat -> spec.container` and `VirtualLoadBalancer.kat -> spec.ports` (translated to runtime port mappings). Node-allocated IP is assigned.
3.  **Health Checks (for Services with `VirtualLoadBalancer.kat`):** Agent periodically runs `spec.healthCheck.exec.command` inside the container after `initialDelaySeconds`. Status (Healthy/Unhealthy) based on `successThreshold`/`failureThreshold` is reported in heartbeats.
4.  **Restart Policy (`workload.kat -> spec.restartPolicy`):**
    *   `Never`: No automatic restart by Agent. Leader reschedules for Services/DaemonServices.
    *   `Always`: Agent always restarts on exit, with exponential backoff.
    *   `MaxCount`: Agent restarts on non-zero exit, up to `maxRestarts` times. If `resetSeconds` elapses since the *first* restart in a series without hitting `maxRestarts`, the restart count for that series resets. Persistent failure after `maxRestarts` within `resetSeconds` window causes instance to be marked `Failed` by Agent. Leader acts accordingly.

#### 4.7. Volume Lifecycle Management

Defined in `workload.kat -> spec.volumes` and mounted via `spec.container.volumeMounts`.
*   **Agent Responsibility:** Before container start, Agent ensures specified volumes are available:
    *   `SimpleClusterStorage`: Creates directory `{agent.volumeBasePath}/{namespace}/{workloadName}/{volumeName}` if it doesn't exist. Permissions should allow container user access.
    *   `HostMount`: Validates `hostPath` exists. If `ensureType` is `DirectoryOrCreate` or `FileOrCreate`, attempts creation. Mounts into container.
*   **Persistence:** Data in `SimpleClusterStorage` on a node persists across container restarts on that *same node*. If the underlying `agent.volumeBasePath` is on network storage (user-managed), it's cluster-persistent. `HostMount` data persists with the host path.

#### 4.8. Job Execution Lifecycle

Defined by `workload.kat -> spec.type: Job` and `job.kat`.
1.  Leader schedules Job instances based on `schedule`, `completions`, `parallelism`.
2.  Agent runs container. On exit:
    *   Exit code 0: Instance `Succeeded`.
    *   Non-zero: Instance `Failed`. Agent applies `restartPolicy` up to `job.kat -> spec.backoffLimit` for the *Job instance* (distinct from container restarts).
3.  Leader tracks `completions` and `activeDeadlineSeconds`.

#### 4.9. Detached Node Operation and Rejoin

Revised mechanism for dynamic nodes (e.g., laptops):
1.  **Configuration:** Agents have `--parent-cluster-name` and `--node-type` (e.g., `laptop`, `stable`).
2.  **Detached Mode:** If Agent cannot reach parent Leader after `nodeLossTimeoutSeconds`, it sets an internal `detached=true` flag.
3.  **Local Leadership:** Agent becomes its own single-node Leader (trivial election).
4.  **Local Operations:**
    *   Continues running pre-detachment workloads.
    *   New workloads submitted to its local API get an automatic `nodeSelector` constraint: `kat.dws.rip/nodeName: <current_node_name>`.
5.  **Rejoin Attempt:** Periodically multicasts `(REJOIN_REQUEST, <parent_cluster_name>, ...)` on local LAN.
6.  **Parent Response & Rejoin:** Parent Leader responds. Detached Agent clears flag, submits its *locally-created* (nodeSelector-constrained) workloads to parent Leader API, then performs standard Agent join.
7.  **Parent Reconciliation:** Parent Leader accepts new workloads, respecting their nodeSelector.

---

### 5. State Management

#### 5.1. State Store Interface (Go)

KAT components interact with etcd via a Go interface for abstraction.

```go
package store

import (
	"context"
	"time"
)

type KV struct { Key string; Value []byte; Version int64 /* etcd ModRevision */ }
type WatchEvent struct { Type EventType; KV KV; PrevKV *KV }
type EventType int
const ( EventTypePut EventType = iota; EventTypeDelete )

type StateStore interface {
	Put(ctx context.Context, key string, value []byte) error
	Get(ctx context.Context, key string) (*KV, error)
	Delete(ctx context.Context, key string) error
	List(ctx context.Context, prefix string) ([]KV, error)
	Watch(ctx context.Context, keyOrPrefix string, startRevision int64) (<-chan WatchEvent, error) // Added startRevision
	Close() error
	Campaign(ctx context.Context, leaderID string, leaseTTLSeconds int64) (leadershipCtx context.Context, err error) // Returns context cancelled on leadership loss
	Resign(ctx context.Context) error // Uses context from Campaign to manage lease
	GetLeader(ctx context.Context) (leaderID string, err error)
	DoTransaction(ctx context.Context, checks []Compare, onSuccess []Op, onFailure []Op) (committed bool, err error) // For CAS operations
}
type Compare struct { Key string; ExpectedVersion int64 /* 0 for key not exists */ }
type Op struct { Type OpType; Key string; Value []byte /* for Put */ }
type OpType int
const ( OpPut OpType = iota; OpDelete; OpGet /* not typically used in Txn success/fail ops */)
```
The `Campaign` method returns a context that is cancelled when leadership is lost or `Resign` is called, simplifying leadership management. `DoTransaction` enables conditional writes for atomicity.

#### 5.2. etcd Implementation Details

*   **Client:** Uses `go.etcd.io/etcd/client/v3`.
*   **Embedded Server:** Uses `go.etcd.io/etcd/server/v3/embed` within `kat-agent` on quorum nodes. Configuration (data-dir, peer/client URLs) from `cluster.kat` and agent flags.
*   **Key Schema Examples:**
    *   `/kat/schema_version`: `v1.0`
    *   `/kat/config/cluster_uid`: UUID generated at init.
    *   `/kat/config/leader_endpoint`: Current Leader's API endpoint.
    *   `/kat/nodes/registration/{nodeName}`: Node's static registration info (UID, WireGuard pubkey, advertise addr).
    *   `/kat/nodes/status/{nodeName}`: Node's dynamic status (heartbeat timestamp, resources, local instances). Leased by agent.
    *   `/kat/workloads/desired/{namespace}/{workloadName}/manifest/{fileName}`: Content of each Quadlet file.
    *   `/kat/workloads/desired/{namespace}/{workloadName}/meta`: Workload metadata (generation, overall spec hash).
    *   `/kat/workloads/status/{namespace}/{workloadName}`: Leader-maintained status of the workload.
    *   `/kat/network/config/overlay_cidr`: ClusterCIDR.
    *   `/kat/network/nodes/{nodeName}/subnet`: Assigned overlay subnet.
    *   `/kat/network/allocations/{instanceID}/ip`: Assigned container overlay IP. Leased by agent managing instance.
    *   `/kat/dns/{namespace}/{workloadName}/{recordType}/{value}`: Flattened DNS records.
    *   `/kat/leader_election/` (etcd prefix): Used by `clientv3/concurrency/election`.

#### 5.3. Leader Election

Utilizes `go.etcd.io/etcd/client/v3/concurrency#NewElection` and `Campaign`. All agents configured as potential quorum members participate. The elected Leader renews its lease continuously. If the lease expires (e.g., Leader crashes), other candidates campaign.

#### 5.4. State Backup (Leader Responsibility)

The active Leader periodically performs an etcd snapshot.
1.  **Interval:** `backupIntervalMinutes` from `cluster.kat`.
2.  **Action:** Executes `etcdctl snapshot save {backupPath}/{timestamped_filename.db}` against its *own embedded etcd member*.
3.  **Path:** `backupPath` from `cluster.kat`.
4.  **Rotation:** Leader maintains the last N snapshots locally (e.g., N=5, configurable), deleting older ones.
5.  **User Responsibility:** These are *local* snapshots on the Leader node. Users MUST implement external mechanisms to copy these snapshots to secure, off-node storage.

#### 5.5. State Restore Procedure

For disaster recovery (total cluster loss or etcd quorum corruption):
1.  **STOP** all `kat-agent` processes on all nodes.
2.  Identify the desired etcd snapshot file (`.db`).
3.  On **one** designated node (intended to be the first new Leader):
    *   Clear its old etcd data directory (`--data-dir` for etcd).
    *   Restore the snapshot: `etcdctl snapshot restore <snapshot.db> --name <member_name> --initial-cluster <member_name>=http://<node_ip>:<etcdPeerPort> --initial-cluster-token <new_token> --data-dir <new_data_dir_path>`
    *   Modify the `kat-agent` startup for this node to use the `new_data_dir_path` and configure it as if initializing a new cluster but pointing to this restored data (specific flags for etcd embed).
4.  Start the `kat-agent` on this restored node. It will become Leader of a new single-member cluster with the restored state.
5.  On all other KAT nodes:
    *   Clear their old etcd data directories.
    *   Clear any KAT agent local state (e.g., WireGuard configs, runtime state).
    *   Join them to the new Leader using `kat-agent join` as if joining a fresh cluster.
6.  The Leader's reconciliation loop will then redeploy workloads according to the restored desired state. **In-flight data or states not captured in the last etcd snapshot will be lost.**

---

### 6. Container Runtime Interface

#### 6.1. Runtime Interface Definition (Go)

Defines the abstraction KAT uses to manage containers.

```go
package runtime

import (
	"context"
	"io"
	"time"
)

type ImageSummary struct { ID string; Tags []string; Size int64 }
type ContainerState string
const (
	ContainerStateRunning    ContainerState = "running"
	ContainerStateExited     ContainerState = "exited"
	ContainerStateCreated    ContainerState = "created"
	ContainerStatePaused     ContainerState = "paused"
	ContainerStateRemoving   ContainerState = "removing"
	ContainerStateUnknown    ContainerState = "unknown"
)
type HealthState string
const (
    HealthStateHealthy      HealthState = "healthy"
    HealthStateUnhealthy    HealthState = "unhealthy"
    HealthStatePending      HealthState = "pending_check" // Health check defined but not yet run
    HealthStateNotApplicable HealthState = "not_applicable" // No health check defined
)
type ContainerStatus struct {
	ID         string
	ImageID    string
	ImageName  string // Image used to create container
	State      ContainerState
	ExitCode   int
	StartedAt  time.Time
	FinishedAt time.Time
	Health     HealthState
	Restarts   int // Number of times runtime restarted this specific container instance
	OverlayIP  string
}
type BuildOptions struct { // From Section 3.5, expanded
	ContextDir     string
	DockerfilePath string
	ImageTag       string // Target tag for the build
	BuildArgs      map[string]string
	TargetStage    string
	Platform       string
	CacheTo        []string // e.g., ["type=registry,ref=myreg.com/cache/img:latest"]
	CacheFrom      []string // e.g., ["type=registry,ref=myreg.com/cache/img:latest"]
	NoCache        bool
	Pull           bool // Whether to attempt to pull base images
}
type PortMapping struct { HostPort int; ContainerPort int; Protocol string /* TCP, UDP */; HostIP string /* 0.0.0.0 default */}
type VolumeMount struct {
    Name        string // User-defined name of the volume from workload.spec.volumes
	Type        string // "hostMount", "simpleClusterStorage" (translated to "bind" for Podman)
	Source      string // Resolved host path for the volume
	Destination string // Mount path inside container
	ReadOnly    bool
	// SELinuxLabel, Propagation options if needed later
}
type GPUOptions struct { DeviceIDs []string /* e.g., ["0", "1"] or ["all"] */; Capabilities [][]string /* e.g., [["gpu"], ["compute","utility"]] */}
type ResourceSpec struct {
	CPUShares  int64 // Relative weight
	CPUQuota   int64 // Microseconds per period (e.g., 50000 for 0.5 CPU with 100000 period)
	CPUPeriod  int64 // Microseconds (e.g., 100000)
	MemoryLimitBytes int64
	GPUSpec    *GPUOptions // If GPU requested
}
type ContainerCreateOptions struct {
	WorkloadName  string
	Namespace     string
	InstanceID    string // KAT-generated unique ID for this replica/run
	ImageName     string // Image to run (after pull/build)
	Hostname      string
	Command       []string
	Args          []string
	Env           map[string]string
	Labels        map[string]string // Include KAT ownership labels
	RestartPolicy string // "no", "on-failure", "always" (Podman specific values)
	Resources     ResourceSpec
	Ports         []PortMapping
	Volumes       []VolumeMount
	NetworkName   string // Name of Podman network to join (e.g., for overlay)
	IPAddress     string // Static IP within Podman network, if assigned by KAT IPAM
	User          string // User to run as inside container (e.g., "1000:1000")
	CapAdd        []string
	CapDrop       []string
	SecurityOpt   []string
	HealthCheck   *ContainerHealthCheck // Podman native healthcheck config
	Systemd       bool // Run container with systemd as init
}
type ContainerHealthCheck struct {
    Test        []string // e.g., ["CMD", "curl", "-f", "http://localhost/health"]
    Interval    time.Duration
    Timeout     time.Duration
    Retries     int
    StartPeriod time.Duration
}

type ContainerRuntime interface {
	BuildImage(ctx context.Context, opts BuildOptions) (imageID string, err error)
	PullImage(ctx context.Context, imageName string, platform string) (imageID string, err error)
	PushImage(ctx context.Context, imageName string, destinationRegistry string) error
	CreateContainer(ctx context.Context, opts ContainerCreateOptions) (containerID string, err error)
	StartContainer(ctx context.Context, containerID string) error
	StopContainer(ctx context.Context, containerID string, timeoutSeconds uint) error
	RemoveContainer(ctx context.Context, containerID string, force bool, removeVolumes bool) error
	GetContainerStatus(ctx context.Context, containerOrName string) (*ContainerStatus, error)
	StreamContainerLogs(ctx context.Context, containerID string, follow bool, since time.Time, stdout io.Writer, stderr io.Writer) error
	PruneAllStoppedContainers(ctx context.Context) (reclaimedSpace int64, err error)
	PruneAllUnusedImages(ctx context.Context) (reclaimedSpace int64, err error)
	EnsureNetworkExists(ctx context.Context, networkName string, driver string, subnet string, gateway string, options map[string]string) error
    RemoveNetwork(ctx context.Context, networkName string) error
	ListManagedContainers(ctx context.Context) ([]ContainerStatus, error) // Lists containers labelled by KAT
}
```

#### 6.2. Default Implementation: Podman

The default and only supported `ContainerRuntime` for KAT v1.0 is Podman. The implementation will primarily shell out to the `podman` CLI, using appropriate JSON output flags for parsing. It assumes `podman` is installed and correctly configured for rootless operation on Agent nodes. Key commands used: `podman build`, `podman pull`, `podman push`, `podman create`, `podman start`, `podman stop`, `podman rm`, `podman inspect`, `podman logs`, `podman system prune`, `podman network create/rm/inspect`.

#### 6.3. Rootless Execution Strategy

KAT Agents MUST orchestrate container workloads rootlessly. The PREFERRED strategy is:
1.  **Dedicated User per Workload/Namespace:** The `kat-agent` (running as root, or with specific sudo rights for `useradd`, `loginctl`, `systemctl --user`) creates a dedicated, unprivileged system user account (e.g., `kat_wl_mywebapp`) when a workload is first scheduled to the node, or uses a pre-existing user from a pool.
2.  **Enable Linger:** `loginctl enable-linger <username>`.
3.  **Generate Systemd Unit:** The Agent translates the KAT workload definition into container create options and uses `podman generate systemd --new --name {instanceID} --files --time 10 {imageName} {command...}` to produce a `.service` unit file. This unit will include environment variables, volume mounts, port mappings (if host-mapped), resource limits, etc. The `Restart=` directive in the systemd unit will be set according to `workload.kat -> spec.restartPolicy`.
4.  **Place and Manage Unit:** The unit file is placed in `/etc/systemd/user/` (if agent is root, enabling it for the target user) or `~{username}/.config/systemd/user/`. The Agent then uses `systemctl --user --machine={username}@.host daemon-reload`, `systemctl --user --machine={username}@.host enable --now {service_name}.service` to start and manage it.
5.  **Status and Logs:** Agent queries `systemctl --user --machine... status` and `journalctl --user-unit ...` for status and logs.

This leverages systemd's robust process supervision and cgroup management for rootless containers.

---

### 7. Networking

#### 7.1. Integrated Overlay Network

KAT v1.0 implements a mandatory, simple, encrypted Layer 3 overlay network connecting all Nodes using WireGuard.
1.  **Configuration:** Defined by `cluster.kat -> spec.clusterCIDR`.
2.  **Key Management:**
    *   Each Agent generates a WireGuard key pair locally upon first start/join. Private key is stored securely (e.g., `/etc/kat/wg_private.key`, mode 0600). Public key is reported to the Leader during registration.
    *   Leader stores all registered Node public keys and their *external* advertise IPs (for WireGuard endpoint) in etcd under `/kat/network/nodes/{nodeName}/wg_pubkey` and `/kat/network/nodes/{nodeName}/wg_endpoint`.
3.  **Peer Configuration:** Each Agent watches `/kat/network/nodes/` in etcd. When a new node joins or an existing node's WireGuard info changes, the Agent updates its local WireGuard configuration (e.g., for interface `kat0`):
    *   Adds/updates a `[Peer]` section for every *other* node.
    *   `PublicKey = {peer_public_key}`
    *   `Endpoint = {peer_advertise_ip}:{configured_wg_port}`
    *   `AllowedIPs = {peer_assigned_overlay_subnet_cidr}` (see IPAM below).
    *   PersistentKeepalive MAY be used if nodes are behind NAT.
4.  **Interface Setup:** Agent ensures `kat0` interface is up with its assigned overlay IP. Standard OS routing rules handle traffic for the `clusterCIDR` via `kat0`.

#### 7.2. IP Address Management (IPAM)

The Leader manages IP allocation for the overlay network.
1.  **Node Subnets:** From `clusterCIDR` and `nodeSubnetBits` (from `cluster.kat`), the Leader carves out a distinct subnet for each Node that joins (e.g., if clusterCIDR is `10.100.0.0/16` and `nodeSubnetBits` is 7, each node gets a `/23`, like `10.100.0.0/23`, `10.100.2.0/23`, etc.). This Node-to-Subnet mapping is stored in etcd.
2.  **Container IPs:** When the Leader schedules a Workload instance to a Node, it allocates the next available IP address from that Node's assigned subnet. This `instanceID -> containerIP` mapping is stored in etcd, possibly with a lease. The Agent is informed of this IP to pass to `podman create --ip ...`.
3.  **Maximum Instances:** The size of the node subnet implicitly limits the number of container instances per node.

#### 7.3. Distributed Agent DNS and Service Discovery

Each KAT Agent runs an embedded DNS resolver, synchronized via etcd, providing service discovery.
1.  **DNS Server Implementation:** Agents use `github.com/miekg/dns` to run a DNS server goroutine, listening on their `kat0` overlay IP (port 53).
2.  **Record Source:**
    *   When a `Workload` instance (especially `Service` or `DaemonService`) with an assigned overlay IP becomes healthy (or starts, if no health check), the Leader writes DNS A records to etcd:
        *   `A <instanceID>.<workloadName>.<namespace>.<clusterDomain>` -> `<containerOverlayIP}`
        *   For Services with `VirtualLoadBalancer.kat -> spec.ports`:
            `A <workloadName>.<namespace>.<clusterDomain>` -> `<containerOverlayIP}` (multiple A records for different healthy instances are created).
    *   The etcd key structure might be `/kat/dns/{clusterDomain}/{namespace}/{workloadName}/{instanceID_or_service_A}`.
3.  **Agent DNS Sync:** Each Agent's DNS server `Watches` the `/kat/dns/` prefix in etcd. On changes, it updates its in-memory DNS zone data.
4.  **Container Configuration:** Agents configure the `/etc/resolv.conf` of all managed containers to use the Agent's own `kat0` overlay IP as the *sole* nameserver.
5.  **Query Handling:**
    *   The local Agent DNS resolver first attempts to resolve queries based on the source container's namespace (e.g., `app` from `ns-foo` tries `app.ns-foo.kat.cluster.local`).
    *   If not found, it tries fully qualified name as-is.
    *   Implements basic negative caching (NXDOMAIN for short TTL) to reduce load.
    *   It does NOT forward to upstream resolvers for KAT domain names. For external names, it may forward or containers must have a secondary configured upstream resolver (V1: no upstream forwarding by agent DNS for simplicity).

#### 7.4. Ingress (Opinionated Recipe via Traefik)

KAT provides a standardized way to deploy Traefik for ingress.
1.  **Ingress Node Designation:** Admins label Nodes intended for ingress with `kat.dws.rip/role=ingress`.
2.  **`kat-traefik-ingress` Quadlet:** DWS LLC provides standard Quadlet files:
    *   `workload.kat`: Deploys Traefik as a `DaemonService` with a `nodeSelector` for `kat.dws.rip/role=ingress`. Includes the `kat-ingress-updater` container.
    *   `VirtualLoadBalancer.kat`: Exposes Traefik's ports (80, 443) via `HostPort` on the ingress Nodes. Specifies health checks for Traefik itself.
    *   `volume.kat`: Mounts host paths for `/etc/traefik/traefik.yaml` (static config), `/data/traefik/dynamic_conf/` (for `kat-ingress-updater`), and `/data/traefik/acme/` (for Let's Encrypt certs).
3.  **`kat-ingress-updater` Container:**
    *   Runs alongside Traefik. Watches KAT API for `VirtualLoadBalancer` Quadlets with `spec.ingress` stanzas.
    *   Generates Traefik dynamic configuration files (routers, services) mapping external host/path to internal KAT service FQDNs (e.g., `<service>.<namespace>.kat.cluster.local:<port>`).
    *   Configures Traefik `certResolver` for Let's Encrypt for services requesting TLS.
    *   Traefik watches its dynamic configuration directory.

---

### 8. API Specification (KAT v1.0 Alpha)

#### 8.1. General Principles and Authentication

*   **Protocol:** HTTP/1.1 or HTTP/2. Mandatory mTLS for Agent-Leader and CLI-Leader.
*   **Data Format:** Request/Response bodies MUST be JSON.
*   **Versioning:** Endpoints prefixed with `/v1alpha1`.
*   **Authentication:** Static Bearer Token in `Authorization` header for CLI/external API clients. For KATv1, this token grants full cluster admin rights. Agent-to-Leader mTLS serves as agent authentication.
*   **Error Reporting:** Standard HTTP status codes. JSON body for errors: `{"error": "code", "message": "details"}`.

#### 8.2. Resource Representation (Proto3 & JSON)

All API resources (Workloads, Namespaces, Nodes, etc., and their Quadlet file contents) are defined using Protocol Buffer v3 messages. The HTTP API transports these as JSON. Common metadata (name, namespace, uid, generation, resourceVersion, creationTimestamp) and status structures are standardized.

#### 8.3. Core API Endpoints

(Referencing the structure from prior discussion in RFC draft section 8.3, ensuring:
*   Namespace CRUD.
*   Workload CRUD: `POST/PUT` accepts `tar.gz` of Quadlet dir. `GET` returns metadata+status. Endpoints for individual Quadlet file content (`.../files/{fileName}`). Endpoint for logs (`.../instances/{instanceID}/logs`). Endpoint for rollback (`.../rollback`).
*   Node read endpoints: `GET /nodes`, `GET /nodes/{name}`. Agent status update: `POST /nodes/{name}/status`. Admin taint update: `PUT /nodes/{name}/taints`.
*   Event query endpoint: `GET /events`.
*   ClusterConfiguration read endpoint: `GET /config/cluster` (shows sanitized running config).
No separate top-level Volume API for KAT v1; volumes are defined within workloads.)

---

### 9. Observability

#### 9.1. Logging

*   **Container Logs:** Agents capture stdout/stderr, make available via `podman logs` mechanism, and stream via API to `katcall logs`. Local rotation on agent node.
*   **Agent Logs:** `kat-agent` logs to systemd journal or local files.
*   **API Audit (Basic):** Leader logs API requests (method, path, source IP, user if distinguishable) at a configurable level.

#### 9.2. Metrics

*   **Agent Metrics:** Node resource usage (CPU, memory, disk, network), container resource usage. Included in heartbeats.
*   **Leader Metrics:** API request latencies/counts, scheduling attempts/successes/failures, etcd health.
*   **Exposure (V1):** Minimal exposure via a `/metrics` JSON endpoint on Leader and Agent, not Prometheus formatted yet.
*   **Future:** Standardized Prometheus exposition format.

#### 9.3. Events

Leader records significant cluster events (Workload create/update/delete, instance schedule/fail/health_change, Node ready/not_ready/join/leave, build success/fail, detached/rejoin actions) into a capped, time-series like structure in etcd.
*   **API:** `GET /v1alpha1/events?[resourceType=X][&resourceName=Y][&namespace=Z]`
*   Fields per event: Timestamp, Type, Reason, InvolvedObject (kind, name, ns, uid), Message.

---

### 10. Security Considerations

#### 10.1. API Security

*   mTLS REQUIRED for all inter-KAT component communication (Agent-Leader).
*   Bearer token for external API clients (e.g., `katcall`). V1: single admin token. No granular RBAC.
*   API server should implement rate limiting.

#### 10.2. Rootless Execution

Core design principle. Agents execute workloads via Podman in rootless mode, leveraging systemd user sessions for enhanced isolation. Minimizes container escape impact.

#### 10.3. Build Security

*   Building arbitrary Git repositories on Agent nodes is a potential risk.
*   Builds run as unprivileged users via rootless Podman.
*   Network access during build MAY be restricted in future (V1: unrestricted).
*   Users are responsible for trusting Git sources. `cacheImage` provides a way to use pre-vetted images.

#### 10.4. Network Security

*   WireGuard overlay provides inter-node and inter-container encryption.
*   Host firewalls are user responsibility. `nodePort` or Ingress exposure requires careful firewall configuration.
*   API/Agent communication ports should be firewalled from public access.

#### 10.5. Secrets Management

*   KAT v1 has NO dedicated secret management.
*   Sensitive data passed via environment variables in `workload.kat -> spec.container.env` is stored plain in etcd. This is NOT secure for production secrets.
*   Registry credentials for `cacheImage` push/pull are local Agent configuration.
*   **Recommendation:** For sensitive data, users should use application-level encryption or sidecars that fetch from external secret stores (e.g., Vault), outside KAT's direct management in V1.

#### 10.6. Internal PKI

1.  **Initialization (`kat-agent init`):**
    *   Generates a self-signed CA key (`ca.key`) and CA certificate (`ca.crt`). Stored securely on the initial Leader node (e.g., `/var/lib/kat/pki/`).
    *   Generates a Leader server key/cert signed by this CA for its API and Agent communication endpoints.
    *   Generates a Leader client key/cert signed by this CA for authenticating to etcd and Agents.
2.  **Node Join (`kat-agent join`):**
    *   Agent generates a keypair and a CSR.
    *   Sends CSR to Leader over an initial (potentially untrusted, or token-protected if implemented later) channel.
    *   Leader signs the Agent's CSR using the CA key, returns the signed Agent certificate and the CA certificate.
    *   Agent stores its key, its signed certificate, and the CA cert for mTLS.
3.  **mTLS Usage:** All Agent-Leader and Leader-Agent (for commands) communications use mTLS, validating peer certificates against the cluster CA.
4.  **Certificate Lifespan & Rotation:** For V1, certificates might have a long lifespan (e.g., 1-10 years). Automated rotation is deferred. Manual regeneration/redistribution would be needed.

---

### 13. Acknowledgements

The KAT system design, while aiming for novel simplicity, stands on the shoulders of giants. Its architecture and concepts draw inspiration and incorporate lessons learned from numerous preceding systems and bodies of work in distributed computing and container orchestration. We specifically acknowledge the influence of:

*   **Kubernetes:** For establishing many of the core concepts and terminology in modern container orchestration, even as KAT diverges in implementation complexity and API specifics.
*   **k3s and MicroK8s:** For demonstrating the demand and feasibility of lightweight Kubernetes distributions, validating the need KAT aims to fill more radically.
*   **Podman & Quadlets:** For pioneering robust rootless containerization and providing the direct inspiration for KAT's declarative Quadlet configuration model and systemd user service execution strategy.
*   **Docker Compose:** For setting the standard in single-host multi-container application definition simplicity.
*   **HashiCorp Nomad:** For demonstrating an alternative, successful approach to simplified, flexible orchestration beyond the Kubernetes paradigm, particularly its use of HCL and clear deployment primitives.
*   **Google Borg:** For concepts in large-scale cluster management, scheduling, and the importance of introspection, as documented in their published research.
*   **The "Hints for Computer System Design" (Butler Lampson):** For principles regarding simplicity, abstraction, performance trade-offs, and fault tolerance that heavily influenced KAT's philosophy.
*   **"A Note on Distributed Computing" (Waldo et al.):** For articulating the fundamental differences between local and distributed computing that KAT attempts to manage pragmatically, rather than hide entirely.
*   **The Grug Brained Developer:** For the essential reminder to relentlessly fight complexity and prioritize understandability.
*   **Open Source Community:** For countless libraries, tools, discussions, and prior art that make a project like KAT feasible.

Finally, thanks to **Simba**, my cat, for providing naming inspiration.

---

### 14. Author's Address

Tanishq Dubey\
DWS LLC\
Email: dubey@dws.rip\
URI: https://www.dws.rip