kat/docs/plan/phase1.md
2025-05-09 19:15:50 -04:00

6.2 KiB

Phase 1: State Management & Leader Election

  • Goal: Establish the foundational state layer using embedded etcd and implement a reliable leader election mechanism. A single kat-agent can initialize a cluster, become its leader, and store initial configuration.
  • RFC Sections Primarily Used: 2.2 (Embedded etcd), 3.9 (ClusterConfiguration), 5.1 (State Store Interface), 5.2 (etcd Implementation Details), 5.3 (Leader Election).

Tasks & Sub-Tasks:

  1. Define StateStore Go Interface (internal/store/interface.go)

    • Purpose: Create the abstraction layer for all state operations, decoupling the rest of the system from direct etcd dependencies.
    • Details: Transcribe the Go interface from RFC 5.1 verbatim. Include KV, WatchEvent, EventType, Compare, Op, OpType structs/constants.
    • Verification: Code compiles. Interface definition matches RFC.
  2. Implement Embedded etcd Server Logic (internal/store/etcd.go)

    • Purpose: Allow kat-agent to run its own etcd instance for single-node clusters or as part of a multi-node quorum.
    • Details:
      • Use go.etcd.io/etcd/server/v3/embed.
      • Function to start an embedded etcd server:
        • Input: configuration parameters (data directory, peer URLs, client URLs, name). These will come from cluster.kat or defaults.
        • Output: a running embed.Etcd instance or an error.
      • Graceful shutdown logic for the embedded etcd server.
    • Verification: A test can start and stop an embedded etcd server. Data directory is created and used.
  3. Implement StateStore with etcd Backend (internal/store/etcd.go)

    • Purpose: Provide the concrete implementation for interacting with an etcd cluster (embedded or external).
    • Details:
      • Create a struct that implements the StateStore interface and holds an etcd/clientv3.Client.
      • Implement Put(ctx, key, value): Use client.Put().
      • Implement Get(ctx, key): Use client.Get(). Handle key-not-found. Populate KV.Version with ModRevision.
      • Implement Delete(ctx, key): Use client.Delete().
      • Implement List(ctx, prefix): Use client.Get() with clientv3.WithPrefix().
      • Implement Watch(ctx, keyOrPrefix, startRevision): Use client.Watch(). Translate etcd events to WatchEvent.
      • Implement Close(): Close the clientv3.Client.
      • Implement Campaign(ctx, leaderID, leaseTTLSeconds):
        • Use concurrency.NewSession() to create a lease.
        • Use concurrency.NewElection() and election.Campaign().
        • Return a context that is cancelled when leadership is lost (e.g., by watching the campaign context or session done channel).
      • Implement Resign(ctx): Use election.Resign().
      • Implement GetLeader(ctx): Observe the election or query the leader key.
      • Implement DoTransaction(ctx, checks, onSuccess, onFailure): Use client.Txn() with clientv3.Compare and clientv3.Op.
    • Potential Challenges: Correctly handling etcd transaction semantics, context propagation, and error translation. Efficiently managing watches.
    • Verification:
      • Unit tests for each StateStore method using a real embedded etcd instance (test-scoped).
      • Verify Put then Get retrieves the correct value and version.
      • Verify List with prefix.
      • Verify Delete removes the key.
      • Verify Watch receives correct events for puts/deletes.
      • Verify DoTransaction commits on success and rolls back on failure.
  4. Integrate Leader Election into kat-agent (cmd/kat-agent/main.go, internal/leader/election.go - new file maybe)

    • Purpose: Enable an agent instance to attempt to become the cluster leader.
    • Details:
      • kat-agent main function will initialize its StateStore client.
      • A dedicated goroutine will call StateStore.Campaign().
      • The outcome of Campaign (e.g., leadership acquired, context for leadership duration) will determine if the agent activates its Leader-specific logic (Phase 2+).
      • Leader ID could be nodeName or a UUID. Lease TTL from cluster.kat.
    • Verification:
      • Start one kat-agent with etcd enabled; it should log "became leader".
      • Start a second kat-agent configured to connect to the first's etcd; it should log "observing leader " or similar, but not become leader itself.
      • If the first agent (leader) is stopped, the second agent should eventually log "became leader".
  5. Implement Basic kat-agent init Command (cmd/kat-agent/main.go, internal/config/parse.go)

    • Purpose: Initialize a new KAT cluster (single node initially).
    • Details:
      • Define init subcommand in kat-agent using a CLI library (e.g., cobra).
      • Flag: --config <path_to_cluster.kat>.
      • Parse cluster.kat (from Phase 0, now used to extract etcd peer/client URLs, data dir, backup paths etc.).
      • Generate a persistent Cluster UID and store it in etcd (e.g., /kat/config/cluster_uid).
      • Store cluster.kat relevant parameters (or the whole sanitized config) into etcd (e.g., under /kat/config/cluster_config).
      • Start the embedded etcd server using parsed configurations.
      • Initiate leader election.
    • Potential Challenges: Ensuring cluster.kat parsing is robust. Handling existing data directories.
    • Milestone Verification:
      • Running kat-agent init --config examples/cluster.kat on a clean system:
        • Starts the kat-agent process.
        • Creates the etcd data directory.
        • Logs "Successfully initialized etcd".
        • Logs "Became leader: ".
        • Using etcdctl (or a simple StateStore.Get test client):
          • Verify /kat/config/cluster_uid exists and has a UUID.
          • Verify /kat/config/cluster_config (or similar keys) contains data from cluster.kat (e.g., clusterCIDR, serviceCIDR, agentPort, apiPort).
          • Verify the leader election key exists for the current leader.