# **Phase 1: State Management & Leader Election** * **Goal**: Establish the foundational state layer using embedded etcd and implement a reliable leader election mechanism. A single `kat-agent` can initialize a cluster, become its leader, and store initial configuration. * **RFC Sections Primarily Used**: 2.2 (Embedded etcd), 3.9 (ClusterConfiguration), 5.1 (State Store Interface), 5.2 (etcd Implementation Details), 5.3 (Leader Election). **Tasks & Sub-Tasks:** 1. **Define `StateStore` Go Interface (`internal/store/interface.go`)** * **Purpose**: Create the abstraction layer for all state operations, decoupling the rest of the system from direct etcd dependencies. * **Details**: Transcribe the Go interface from RFC 5.1 verbatim. Include `KV`, `WatchEvent`, `EventType`, `Compare`, `Op`, `OpType` structs/constants. * **Verification**: Code compiles. Interface definition matches RFC. 2. **Implement Embedded etcd Server Logic (`internal/store/etcd.go`)** * **Purpose**: Allow `kat-agent` to run its own etcd instance for single-node clusters or as part of a multi-node quorum. * **Details**: * Use `go.etcd.io/etcd/server/v3/embed`. * Function to start an embedded etcd server: * Input: configuration parameters (data directory, peer URLs, client URLs, name). These will come from `cluster.kat` or defaults. * Output: a running `embed.Etcd` instance or an error. * Graceful shutdown logic for the embedded etcd server. * **Verification**: A test can start and stop an embedded etcd server. Data directory is created and used. 3. **Implement `StateStore` with etcd Backend (`internal/store/etcd.go`)** * **Purpose**: Provide the concrete implementation for interacting with an etcd cluster (embedded or external). * **Details**: * Create a struct that implements the `StateStore` interface and holds an `etcd/clientv3.Client`. * Implement `Put(ctx, key, value)`: Use `client.Put()`. * Implement `Get(ctx, key)`: Use `client.Get()`. Handle key-not-found. Populate `KV.Version` with `ModRevision`. * Implement `Delete(ctx, key)`: Use `client.Delete()`. * Implement `List(ctx, prefix)`: Use `client.Get()` with `clientv3.WithPrefix()`. * Implement `Watch(ctx, keyOrPrefix, startRevision)`: Use `client.Watch()`. Translate etcd events to `WatchEvent`. * Implement `Close()`: Close the `clientv3.Client`. * Implement `Campaign(ctx, leaderID, leaseTTLSeconds)`: * Use `concurrency.NewSession()` to create a lease. * Use `concurrency.NewElection()` and `election.Campaign()`. * Return a context that is cancelled when leadership is lost (e.g., by watching the campaign context or session done channel). * Implement `Resign(ctx)`: Use `election.Resign()`. * Implement `GetLeader(ctx)`: Observe the election or query the leader key. * Implement `DoTransaction(ctx, checks, onSuccess, onFailure)`: Use `client.Txn()` with `clientv3.Compare` and `clientv3.Op`. * **Potential Challenges**: Correctly handling etcd transaction semantics, context propagation, and error translation. Efficiently managing watches. * **Verification**: * Unit tests for each `StateStore` method using a real embedded etcd instance (test-scoped). * Verify `Put` then `Get` retrieves the correct value and version. * Verify `List` with prefix. * Verify `Delete` removes the key. * Verify `Watch` receives correct events for puts/deletes. * Verify `DoTransaction` commits on success and rolls back on failure. 4. **Integrate Leader Election into `kat-agent` (`cmd/kat-agent/main.go`, `internal/leader/election.go` - new file maybe)** * **Purpose**: Enable an agent instance to attempt to become the cluster leader. * **Details**: * `kat-agent` main function will initialize its `StateStore` client. * A dedicated goroutine will call `StateStore.Campaign()`. * The outcome of `Campaign` (e.g., leadership acquired, context for leadership duration) will determine if the agent activates its Leader-specific logic (Phase 2+). * Leader ID could be `nodeName` or a UUID. Lease TTL from `cluster.kat`. * **Verification**: * Start one `kat-agent` with etcd enabled; it should log "became leader". * Start a second `kat-agent` configured to connect to the first's etcd; it should log "observing leader " or similar, but not become leader itself. * If the first agent (leader) is stopped, the second agent should eventually log "became leader". 5. **Implement Basic `kat-agent init` Command (`cmd/kat-agent/main.go`, `internal/config/parse.go`)** * **Purpose**: Initialize a new KAT cluster (single node initially). * **Details**: * Define `init` subcommand in `kat-agent` using a CLI library (e.g., `cobra`). * Flag: `--config `. * Parse `cluster.kat` (from Phase 0, now used to extract etcd peer/client URLs, data dir, backup paths etc.). * Generate a persistent Cluster UID and store it in etcd (e.g., `/kat/config/cluster_uid`). * Store `cluster.kat` relevant parameters (or the whole sanitized config) into etcd (e.g., under `/kat/config/cluster_config`). * Start the embedded etcd server using parsed configurations. * Initiate leader election. * **Potential Challenges**: Ensuring `cluster.kat` parsing is robust. Handling existing data directories. * **Milestone Verification**: * Running `kat-agent init --config examples/cluster.kat` on a clean system: * Starts the `kat-agent` process. * Creates the etcd data directory. * Logs "Successfully initialized etcd". * Logs "Became leader: ". * Using `etcdctl` (or a simple `StateStore.Get` test client): * Verify `/kat/config/cluster_uid` exists and has a UUID. * Verify `/kat/config/cluster_config` (or similar keys) contains data from `cluster.kat` (e.g., `clusterCIDR`, `serviceCIDR`, `agentPort`, `apiPort`). * Verify the leader election key exists for the current leader.