81 lines
6.2 KiB
Markdown
81 lines
6.2 KiB
Markdown
# **Phase 1: State Management & Leader Election**
|
|
|
|
* **Goal**: Establish the foundational state layer using embedded etcd and implement a reliable leader election mechanism. A single `kat-agent` can initialize a cluster, become its leader, and store initial configuration.
|
|
* **RFC Sections Primarily Used**: 2.2 (Embedded etcd), 3.9 (ClusterConfiguration), 5.1 (State Store Interface), 5.2 (etcd Implementation Details), 5.3 (Leader Election).
|
|
|
|
**Tasks & Sub-Tasks:**
|
|
|
|
1. **Define `StateStore` Go Interface (`internal/store/interface.go`)**
|
|
* **Purpose**: Create the abstraction layer for all state operations, decoupling the rest of the system from direct etcd dependencies.
|
|
* **Details**: Transcribe the Go interface from RFC 5.1 verbatim. Include `KV`, `WatchEvent`, `EventType`, `Compare`, `Op`, `OpType` structs/constants.
|
|
* **Verification**: Code compiles. Interface definition matches RFC.
|
|
|
|
2. **Implement Embedded etcd Server Logic (`internal/store/etcd.go`)**
|
|
* **Purpose**: Allow `kat-agent` to run its own etcd instance for single-node clusters or as part of a multi-node quorum.
|
|
* **Details**:
|
|
* Use `go.etcd.io/etcd/server/v3/embed`.
|
|
* Function to start an embedded etcd server:
|
|
* Input: configuration parameters (data directory, peer URLs, client URLs, name). These will come from `cluster.kat` or defaults.
|
|
* Output: a running `embed.Etcd` instance or an error.
|
|
* Graceful shutdown logic for the embedded etcd server.
|
|
* **Verification**: A test can start and stop an embedded etcd server. Data directory is created and used.
|
|
|
|
3. **Implement `StateStore` with etcd Backend (`internal/store/etcd.go`)**
|
|
* **Purpose**: Provide the concrete implementation for interacting with an etcd cluster (embedded or external).
|
|
* **Details**:
|
|
* Create a struct that implements the `StateStore` interface and holds an `etcd/clientv3.Client`.
|
|
* Implement `Put(ctx, key, value)`: Use `client.Put()`.
|
|
* Implement `Get(ctx, key)`: Use `client.Get()`. Handle key-not-found. Populate `KV.Version` with `ModRevision`.
|
|
* Implement `Delete(ctx, key)`: Use `client.Delete()`.
|
|
* Implement `List(ctx, prefix)`: Use `client.Get()` with `clientv3.WithPrefix()`.
|
|
* Implement `Watch(ctx, keyOrPrefix, startRevision)`: Use `client.Watch()`. Translate etcd events to `WatchEvent`.
|
|
* Implement `Close()`: Close the `clientv3.Client`.
|
|
* Implement `Campaign(ctx, leaderID, leaseTTLSeconds)`:
|
|
* Use `concurrency.NewSession()` to create a lease.
|
|
* Use `concurrency.NewElection()` and `election.Campaign()`.
|
|
* Return a context that is cancelled when leadership is lost (e.g., by watching the campaign context or session done channel).
|
|
* Implement `Resign(ctx)`: Use `election.Resign()`.
|
|
* Implement `GetLeader(ctx)`: Observe the election or query the leader key.
|
|
* Implement `DoTransaction(ctx, checks, onSuccess, onFailure)`: Use `client.Txn()` with `clientv3.Compare` and `clientv3.Op`.
|
|
* **Potential Challenges**: Correctly handling etcd transaction semantics, context propagation, and error translation. Efficiently managing watches.
|
|
* **Verification**:
|
|
* Unit tests for each `StateStore` method using a real embedded etcd instance (test-scoped).
|
|
* Verify `Put` then `Get` retrieves the correct value and version.
|
|
* Verify `List` with prefix.
|
|
* Verify `Delete` removes the key.
|
|
* Verify `Watch` receives correct events for puts/deletes.
|
|
* Verify `DoTransaction` commits on success and rolls back on failure.
|
|
|
|
4. **Integrate Leader Election into `kat-agent` (`cmd/kat-agent/main.go`, `internal/leader/election.go` - new file maybe)**
|
|
* **Purpose**: Enable an agent instance to attempt to become the cluster leader.
|
|
* **Details**:
|
|
* `kat-agent` main function will initialize its `StateStore` client.
|
|
* A dedicated goroutine will call `StateStore.Campaign()`.
|
|
* The outcome of `Campaign` (e.g., leadership acquired, context for leadership duration) will determine if the agent activates its Leader-specific logic (Phase 2+).
|
|
* Leader ID could be `nodeName` or a UUID. Lease TTL from `cluster.kat`.
|
|
* **Verification**:
|
|
* Start one `kat-agent` with etcd enabled; it should log "became leader".
|
|
* Start a second `kat-agent` configured to connect to the first's etcd; it should log "observing leader <leaderID>" or similar, but not become leader itself.
|
|
* If the first agent (leader) is stopped, the second agent should eventually log "became leader".
|
|
|
|
5. **Implement Basic `kat-agent init` Command (`cmd/kat-agent/main.go`, `internal/config/parse.go`)**
|
|
* **Purpose**: Initialize a new KAT cluster (single node initially).
|
|
* **Details**:
|
|
* Define `init` subcommand in `kat-agent` using a CLI library (e.g., `cobra`).
|
|
* Flag: `--config <path_to_cluster.kat>`.
|
|
* Parse `cluster.kat` (from Phase 0, now used to extract etcd peer/client URLs, data dir, backup paths etc.).
|
|
* Generate a persistent Cluster UID and store it in etcd (e.g., `/kat/config/cluster_uid`).
|
|
* Store `cluster.kat` relevant parameters (or the whole sanitized config) into etcd (e.g., under `/kat/config/cluster_config`).
|
|
* Start the embedded etcd server using parsed configurations.
|
|
* Initiate leader election.
|
|
* **Potential Challenges**: Ensuring `cluster.kat` parsing is robust. Handling existing data directories.
|
|
* **Milestone Verification**:
|
|
* Running `kat-agent init --config examples/cluster.kat` on a clean system:
|
|
* Starts the `kat-agent` process.
|
|
* Creates the etcd data directory.
|
|
* Logs "Successfully initialized etcd".
|
|
* Logs "Became leader: <nodeName>".
|
|
* Using `etcdctl` (or a simple `StateStore.Get` test client):
|
|
* Verify `/kat/config/cluster_uid` exists and has a UUID.
|
|
* Verify `/kat/config/cluster_config` (or similar keys) contains data from `cluster.kat` (e.g., `clusterCIDR`, `serviceCIDR`, `agentPort`, `apiPort`).
|
|
* Verify the leader election key exists for the current leader. |