kat/docs/plan/phase1.md
2025-05-09 19:15:50 -04:00

81 lines
6.2 KiB
Markdown

# **Phase 1: State Management & Leader Election**
* **Goal**: Establish the foundational state layer using embedded etcd and implement a reliable leader election mechanism. A single `kat-agent` can initialize a cluster, become its leader, and store initial configuration.
* **RFC Sections Primarily Used**: 2.2 (Embedded etcd), 3.9 (ClusterConfiguration), 5.1 (State Store Interface), 5.2 (etcd Implementation Details), 5.3 (Leader Election).
**Tasks & Sub-Tasks:**
1. **Define `StateStore` Go Interface (`internal/store/interface.go`)**
* **Purpose**: Create the abstraction layer for all state operations, decoupling the rest of the system from direct etcd dependencies.
* **Details**: Transcribe the Go interface from RFC 5.1 verbatim. Include `KV`, `WatchEvent`, `EventType`, `Compare`, `Op`, `OpType` structs/constants.
* **Verification**: Code compiles. Interface definition matches RFC.
2. **Implement Embedded etcd Server Logic (`internal/store/etcd.go`)**
* **Purpose**: Allow `kat-agent` to run its own etcd instance for single-node clusters or as part of a multi-node quorum.
* **Details**:
* Use `go.etcd.io/etcd/server/v3/embed`.
* Function to start an embedded etcd server:
* Input: configuration parameters (data directory, peer URLs, client URLs, name). These will come from `cluster.kat` or defaults.
* Output: a running `embed.Etcd` instance or an error.
* Graceful shutdown logic for the embedded etcd server.
* **Verification**: A test can start and stop an embedded etcd server. Data directory is created and used.
3. **Implement `StateStore` with etcd Backend (`internal/store/etcd.go`)**
* **Purpose**: Provide the concrete implementation for interacting with an etcd cluster (embedded or external).
* **Details**:
* Create a struct that implements the `StateStore` interface and holds an `etcd/clientv3.Client`.
* Implement `Put(ctx, key, value)`: Use `client.Put()`.
* Implement `Get(ctx, key)`: Use `client.Get()`. Handle key-not-found. Populate `KV.Version` with `ModRevision`.
* Implement `Delete(ctx, key)`: Use `client.Delete()`.
* Implement `List(ctx, prefix)`: Use `client.Get()` with `clientv3.WithPrefix()`.
* Implement `Watch(ctx, keyOrPrefix, startRevision)`: Use `client.Watch()`. Translate etcd events to `WatchEvent`.
* Implement `Close()`: Close the `clientv3.Client`.
* Implement `Campaign(ctx, leaderID, leaseTTLSeconds)`:
* Use `concurrency.NewSession()` to create a lease.
* Use `concurrency.NewElection()` and `election.Campaign()`.
* Return a context that is cancelled when leadership is lost (e.g., by watching the campaign context or session done channel).
* Implement `Resign(ctx)`: Use `election.Resign()`.
* Implement `GetLeader(ctx)`: Observe the election or query the leader key.
* Implement `DoTransaction(ctx, checks, onSuccess, onFailure)`: Use `client.Txn()` with `clientv3.Compare` and `clientv3.Op`.
* **Potential Challenges**: Correctly handling etcd transaction semantics, context propagation, and error translation. Efficiently managing watches.
* **Verification**:
* Unit tests for each `StateStore` method using a real embedded etcd instance (test-scoped).
* Verify `Put` then `Get` retrieves the correct value and version.
* Verify `List` with prefix.
* Verify `Delete` removes the key.
* Verify `Watch` receives correct events for puts/deletes.
* Verify `DoTransaction` commits on success and rolls back on failure.
4. **Integrate Leader Election into `kat-agent` (`cmd/kat-agent/main.go`, `internal/leader/election.go` - new file maybe)**
* **Purpose**: Enable an agent instance to attempt to become the cluster leader.
* **Details**:
* `kat-agent` main function will initialize its `StateStore` client.
* A dedicated goroutine will call `StateStore.Campaign()`.
* The outcome of `Campaign` (e.g., leadership acquired, context for leadership duration) will determine if the agent activates its Leader-specific logic (Phase 2+).
* Leader ID could be `nodeName` or a UUID. Lease TTL from `cluster.kat`.
* **Verification**:
* Start one `kat-agent` with etcd enabled; it should log "became leader".
* Start a second `kat-agent` configured to connect to the first's etcd; it should log "observing leader <leaderID>" or similar, but not become leader itself.
* If the first agent (leader) is stopped, the second agent should eventually log "became leader".
5. **Implement Basic `kat-agent init` Command (`cmd/kat-agent/main.go`, `internal/config/parse.go`)**
* **Purpose**: Initialize a new KAT cluster (single node initially).
* **Details**:
* Define `init` subcommand in `kat-agent` using a CLI library (e.g., `cobra`).
* Flag: `--config <path_to_cluster.kat>`.
* Parse `cluster.kat` (from Phase 0, now used to extract etcd peer/client URLs, data dir, backup paths etc.).
* Generate a persistent Cluster UID and store it in etcd (e.g., `/kat/config/cluster_uid`).
* Store `cluster.kat` relevant parameters (or the whole sanitized config) into etcd (e.g., under `/kat/config/cluster_config`).
* Start the embedded etcd server using parsed configurations.
* Initiate leader election.
* **Potential Challenges**: Ensuring `cluster.kat` parsing is robust. Handling existing data directories.
* **Milestone Verification**:
* Running `kat-agent init --config examples/cluster.kat` on a clean system:
* Starts the `kat-agent` process.
* Creates the etcd data directory.
* Logs "Successfully initialized etcd".
* Logs "Became leader: <nodeName>".
* Using `etcdctl` (or a simple `StateStore.Get` test client):
* Verify `/kat/config/cluster_uid` exists and has a UUID.
* Verify `/kat/config/cluster_config` (or similar keys) contains data from `cluster.kat` (e.g., `clusterCIDR`, `serviceCIDR`, `agentPort`, `apiPort`).
* Verify the leader election key exists for the current leader.