Init Docs
This commit is contained in:
parent
9520ac0fd1
commit
e03e27270b
131
.voidrules
Normal file
131
.voidrules
Normal file
@ -0,0 +1,131 @@
|
||||
You are an AI Pair Programming Assistant with extensive expertise in backend software engineering. Your knowledge spans a wide range of technologies, practices, and concepts commonly used in modern backend systems. Your role is to provide comprehensive, insightful, and practical advice on various backend development topics.
|
||||
|
||||
Your areas of expertise include, but are not limited to:
|
||||
1. Database Management (SQL, NoSQL, NewSQL)
|
||||
2. API Development (REST, GraphQL, gRPC)
|
||||
3. Server-Side Programming (Go, Rust, Java, Python, Node.js)
|
||||
4. Performance Optimization
|
||||
5. Scalability and Load Balancing
|
||||
6. Security Best Practices
|
||||
7. Caching Strategies
|
||||
8. Data Modeling
|
||||
9. Microservices Architecture
|
||||
10. Testing and Debugging
|
||||
11. Logging and Monitoring
|
||||
12. Containerization and Orchestration
|
||||
13. CI/CD Pipelines
|
||||
14. Docker and Kubernetes
|
||||
15. gRPC and Protocol Buffers
|
||||
16. Git Version Control
|
||||
17. Data Infrastructure (Kafka, RabbitMQ, Redis)
|
||||
18. Cloud Platforms (AWS, GCP, Azure)
|
||||
|
||||
When responding to queries:
|
||||
1. Begin with a section where you:
|
||||
- Analyze the query to identify the main topics and technologies involved
|
||||
- Consider the broader context and implications of the question
|
||||
- Plan your approach to answering the query comprehensively
|
||||
|
||||
2. Provide clear, concise explanations of backend concepts and technologies
|
||||
3. Offer practical advice and best practices for real-world scenarios
|
||||
4. Share code snippets or configuration examples when appropriate, using proper formatting and syntax highlighting
|
||||
5. Explain trade-offs between different approaches when multiple solutions exist
|
||||
6. Consider scalability, performance, and security implications in your recommendations
|
||||
7. Reference official documentation or reputable sources when needed, but note that you don't have access to real-time information
|
||||
8. End your response with a section that summarizes the key points and provides a direct answer to the query
|
||||
|
||||
Your goal is to help users understand, implement, and optimize backend systems while adhering to industry best practices and standards. Always consider factors such as scalability, reliability, maintainability, and security in your recommendations.
|
||||
|
||||
If a query is unclear or lacks necessary details, ask for clarification before providing an answer. If a question is outside the scope of backend development, politely inform the user and offer to assist with related backend topics if possible.
|
||||
|
||||
Here's an example of how to structure your response:
|
||||
|
||||
You:
|
||||
To answer this query, I need to consider:
|
||||
1. The basics of gRPC and Protocol Buffers
|
||||
2. Go programming for gRPC server implementation
|
||||
3. Database interaction using Go's database/sql package or an ORM
|
||||
4. Best practices for structuring a gRPC service
|
||||
5. Error handling and data validation
|
||||
|
||||
I'll provide a step-by-step guide with code examples to illustrate the implementation.
|
||||
|
||||
1. Define your Protocol Buffer messages and service:
|
||||
Create a file named `service.proto`:
|
||||
```protobuf
|
||||
syntax = "proto3";
|
||||
package myservice;
|
||||
option go_package = "./pb";
|
||||
message User {
|
||||
int32 id = 1;
|
||||
string name = 2;
|
||||
string email = 3;
|
||||
}
|
||||
message GetUserRequest {
|
||||
int32 id = 1;
|
||||
}
|
||||
service UserService {
|
||||
rpc GetUser(GetUserRequest) returns (User) {}
|
||||
}
|
||||
```
|
||||
|
||||
2. Generate Go code from the Proto file:
|
||||
```
|
||||
protoc --go_out=. --go-grpc_out=. service.proto
|
||||
```
|
||||
|
||||
3. Implement the gRPC server in Go:
|
||||
```go
|
||||
package main
|
||||
import (
|
||||
"context"
|
||||
"database/sql"
|
||||
"log"
|
||||
"net"
|
||||
"google.golang.org/grpc"
|
||||
"github.com/lib/pq"
|
||||
pb "your_project/pb"
|
||||
)
|
||||
type server struct {
|
||||
pb.UnimplementedUserServiceServer
|
||||
db *sql.DB
|
||||
}
|
||||
func (s *server) GetUser(ctx context.Context, req *pb.GetUserRequest) (*pb.User, error) {
|
||||
var user pb.User
|
||||
err := s.db.QueryRow("SELECT id, name, email FROM users WHERE id = $1", req.Id).Scan(&user.Id, &user.Name, &user.Email)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
return &user, nil
|
||||
}
|
||||
func main() {
|
||||
// Connect to PostgreSQL
|
||||
db, err := sql.Open("postgres", "postgresql://username:password@localhost/dbname?sslmode=disable")
|
||||
if err != nil {
|
||||
log.Fatalf("Failed to connect to database: %v", err)
|
||||
}
|
||||
defer db.Close()
|
||||
// Create gRPC server
|
||||
s := grpc.NewServer()
|
||||
pb.RegisterUserServiceServer(s, &server{db: db})
|
||||
// Start listening
|
||||
lis, err := net.Listen("tcp", ":50051")
|
||||
if err != nil {
|
||||
log.Fatalf("Failed to listen: %v", err)
|
||||
}
|
||||
log.Println("Server listening on :50051")
|
||||
if err := s.Serve(lis); err != nil {
|
||||
log.Fatalf("Failed to serve: %v", err)
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
This example demonstrates:
|
||||
- Defining a simple gRPC service using Protocol Buffers
|
||||
- Implementing the service in Go
|
||||
- Connecting to a PostgreSQL database
|
||||
- Handling a basic database query within a gRPC method
|
||||
|
||||
Remember to handle errors properly, implement proper validation, and consider using an ORM like GORM for more complex database interactions. Also, ensure you're following best practices for security, such as using prepared statements to prevent SQL injection.
|
||||
|
||||
By following this structure and guidelines, you'll provide comprehensive and practical assistance for backend software engineering queries.
|
134
docs/plan/filestructure.md
Normal file
134
docs/plan/filestructure.md
Normal file
@ -0,0 +1,134 @@
|
||||
# Directory/File Structure
|
||||
|
||||
This structure assumes a Go-based project, as hinted by the Go interface definitions in the RFC.
|
||||
|
||||
```
|
||||
kat-system/
|
||||
├── README.md # Project overview, build instructions, contribution guide
|
||||
├── LICENSE # Project license (e.g., Apache 2.0, MIT)
|
||||
├── go.mod # Go modules definition
|
||||
├── go.sum # Go modules checksums
|
||||
├── Makefile # Build, test, lint, generate code, etc.
|
||||
│
|
||||
├── api/
|
||||
│ └── v1alpha1/
|
||||
│ ├── kat.proto # Protocol Buffer definitions for all KAT resources (Workload, Node, etc.)
|
||||
│ └── generated/ # Generated Go code from .proto files (e.g., using protoc-gen-go)
|
||||
│ # Potentially OpenAPI/Swagger specs generated from protos too.
|
||||
│
|
||||
├── cmd/
|
||||
│ ├── kat-agent/
|
||||
│ │ └── main.go # Entrypoint for the kat-agent binary
|
||||
│ └── katcall/
|
||||
│ └── main.go # Entrypoint for the katcall CLI binary
|
||||
│
|
||||
├── internal/
|
||||
│ ├── agent/
|
||||
│ │ ├── agent.go # Core agent logic, heartbeating, command processing
|
||||
│ │ ├── runtime.go # Interface with ContainerRuntime (Podman)
|
||||
│ │ ├── build.go # Git-native build process logic
|
||||
│ │ └── dns_resolver.go # Embedded DNS server logic
|
||||
│ │
|
||||
│ ├── leader/
|
||||
│ │ ├── leader.go # Core leader logic, reconciliation loops
|
||||
│ │ ├── schedule.go # Scheduling algorithm implementation
|
||||
│ │ ├── ipam.go # IP Address Management logic
|
||||
│ │ ├── state_backup.go # etcd backup logic
|
||||
│ │ └── api_handler.go # HTTP API request handlers (connects to api/v1alpha1)
|
||||
│ │
|
||||
│ ├── api/ # Server-side API implementation details
|
||||
│ │ ├── server.go # HTTP server setup, middleware (auth, logging)
|
||||
│ │ ├── router.go # API route definitions
|
||||
│ │ └── auth.go # Authentication (mTLS, Bearer token) logic
|
||||
│ │
|
||||
│ ├── cli/
|
||||
│ │ ├── commands/ # Subdirectories for each katcall command (apply, get, logs, etc.)
|
||||
│ │ │ ├── apply.go
|
||||
│ │ │ └── ...
|
||||
│ │ ├── client.go # HTTP client for interacting with KAT API
|
||||
│ │ └── utils.go # CLI helper functions
|
||||
│ │
|
||||
│ ├── config/
|
||||
│ │ ├── types.go # Go structs for Quadlet file kinds if not directly from proto
|
||||
│ │ ├── parse.go # Logic for parsing and validating *.kat files (Quadlets, cluster.kat)
|
||||
│ │ └── defaults.go # Default values for configurations
|
||||
│ │
|
||||
│ ├── store/
|
||||
│ │ ├── interface.go # Definition of StateStore interface (as in RFC 5.1)
|
||||
│ │ └── etcd.go # etcd implementation of StateStore, embedded etcd setup
|
||||
│ │
|
||||
│ ├── runtime/
|
||||
│ │ ├── interface.go # Definition of ContainerRuntime interface (as in RFC 6.1)
|
||||
│ │ └── podman.go # Podman implementation of ContainerRuntime
|
||||
│ │
|
||||
│ ├── network/
|
||||
│ │ ├── wireguard.go # WireGuard setup and peer management logic
|
||||
│ │ └── types.go # Network related internal types
|
||||
│ │
|
||||
│ ├── pki/
|
||||
│ │ ├── ca.go # Certificate Authority management (generation, signing)
|
||||
│ │ └── certs.go # Certificate generation and handling utilities
|
||||
│ │
|
||||
│ ├── observability/
|
||||
│ │ ├── logging.go # Logging setup for components
|
||||
│ │ ├── metrics.go # Metrics collection and exposure logic
|
||||
│ │ └── events.go # Event recording and retrieval logic
|
||||
│ │
|
||||
│ ├── types/ # Core internal data structures if not covered by API protos
|
||||
│ │ ├── node.go
|
||||
│ │ ├── workload.go
|
||||
│ │ └── ...
|
||||
│ │
|
||||
│ ├── constants/
|
||||
│ │ └── constants.go # Global constants (etcd key prefixes, default ports, etc.)
|
||||
│ │
|
||||
│ └── utils/
|
||||
│ ├── utils.go # Common utility functions (error handling, string manipulation)
|
||||
│ └── tar.go # Utilities for handling tar.gz Quadlet archives
|
||||
│
|
||||
├── docs/
|
||||
│ ├── rfc/
|
||||
│ │ └── RFC001-KAT.md # The source RFC document
|
||||
│ ├── user-guide/ # User documentation (installation, getting started, tutorials)
|
||||
│ │ ├── installation.md
|
||||
│ │ └── basic_usage.md
|
||||
│ └── api-guide/ # API usage documentation (perhaps generated)
|
||||
│
|
||||
├── examples/
|
||||
│ ├── simple-service/ # Example Quadlet for a simple service
|
||||
│ │ ├── workload.kat
|
||||
│ │ └── VirtualLoadBalancer.kat
|
||||
│ ├── git-build-service/ # Example Quadlet for a service built from Git
|
||||
│ │ ├── workload.kat
|
||||
│ │ └── build.kat
|
||||
│ ├── job/ # Example Quadlet for a Job
|
||||
│ │ ├── workload.kat
|
||||
│ │ └── job.kat
|
||||
│ └── cluster.kat # Example cluster configuration file
|
||||
│
|
||||
├── scripts/
|
||||
│ ├── setup-dev-env.sh # Script to set up development environment
|
||||
│ ├── lint.sh # Code linting script
|
||||
│ ├── test.sh # Script to run all tests
|
||||
│ └── gen-proto.sh # Script to generate Go code from .proto files
|
||||
│
|
||||
└── test/
|
||||
├── unit/ # Unit tests (mirroring internal/ structure)
|
||||
├── integration/ # Integration tests (e.g., agent-leader interaction)
|
||||
└── e2e/ # End-to-end tests (testing full cluster operations via katcall)
|
||||
├── fixtures/ # Test Quadlet files
|
||||
└── e2e_test.go
|
||||
```
|
||||
|
||||
**Description of Key Files/Directories and Relationships:**
|
||||
|
||||
* **`api/v1alpha1/kat.proto`**: The source of truth for all resource definitions. `make generate` (or `scripts/gen-proto.sh`) would convert this into Go structs in `api/v1alpha1/generated/`. These structs will be used across the `internal/` packages.
|
||||
* **`cmd/kat-agent/main.go`**: Initializes and runs the `kat-agent`. It will instantiate components from `internal/store` (for etcd), `internal/agent`, `internal/leader`, `internal/pki`, `internal/network`, and `internal/api` (for the API server if elected leader).
|
||||
* **`cmd/katcall/main.go`**: Entry point for the CLI. It uses `internal/cli` components to parse commands and interact with the KAT API via `internal/cli/client.go`.
|
||||
* **`internal/config/parse.go`**: Used by the Leader to parse submitted Quadlet `tar.gz` archives and by `kat-agent init` to parse `cluster.kat`.
|
||||
* **`internal/store/etcd.go`**: Implements `StateStore` and manages the embedded etcd instance. Used by both Agent (for watching) and Leader (for all state modifications, leader election).
|
||||
* **`internal/runtime/podman.go`**: Implements `ContainerRuntime`. Used by `internal/agent/runtime.go` to manage containers based on Podman.
|
||||
* **`internal/agent/agent.go`** and **`internal/leader/leader.go`**: Contain the core state machines and logic for the respective roles. The `kat-agent` binary decides which role's logic to activate based on leader election status.
|
||||
* **`internal/pki/ca.go`**: Used by `kat-agent init` to create the CA, and by the Leader to sign CSRs from joining agents.
|
||||
* **`internal/network/wireguard.go`**: Used by agents to configure their local WireGuard interface based on data synced from etcd (managed by the Leader).
|
||||
* **`internal/leader/api_handler.go`**: Implements the HTTP handlers for the API, using other leader components (scheduler, IPAM, store) to fulfill requests.
|
183
docs/plan/overview.md
Normal file
183
docs/plan/overview.md
Normal file
@ -0,0 +1,183 @@
|
||||
# Implementation Plan
|
||||
|
||||
This plan breaks down the implementation into manageable phases, each with a testable milestone.
|
||||
|
||||
**Phase 0: Project Setup & Core Types**
|
||||
* **Goal**: Basic project structure, version control, build system, and core data type definitions.
|
||||
* **Tasks**:
|
||||
1. Initialize Git repository, `go.mod`.
|
||||
2. Create initial directory structure (as above).
|
||||
3. Define core Proto3 messages in `api/v1alpha1/kat.proto` for: `Workload`, `VirtualLoadBalancer`, `JobDefinition`, `BuildDefinition`, `Namespace`, `Node` (internal representation), `ClusterConfiguration`.
|
||||
4. Set up `scripts/gen-proto.sh` and generate initial Go types.
|
||||
5. Implement parsing and basic validation for `cluster.kat` (`internal/config/parse.go`).
|
||||
6. Implement parsing and basic validation for Quadlet files (`workload.kat`, etc.) and their `tar.gz` packaging/unpackaging.
|
||||
* **Milestone**:
|
||||
* `make generate` successfully creates Go types from protos.
|
||||
* Unit tests pass for parsing `cluster.kat` and a sample Quadlet directory (as `tar.gz`) into their respective Go structs.
|
||||
|
||||
**Phase 1: State Management & Leader Election**
|
||||
* **Goal**: A functional embedded etcd and leader election mechanism.
|
||||
* **Tasks**:
|
||||
1. Implement the `StateStore` interface (RFC 5.1) with an etcd backend (`internal/store/etcd.go`).
|
||||
2. Integrate embedded etcd server into `kat-agent` (RFC 2.2, 5.2), configurable via `cluster.kat` parameters.
|
||||
3. Implement leader election using `go.etcd.io/etcd/client/v3/concurrency` (RFC 5.3).
|
||||
4. Basic `kat-agent init` functionality:
|
||||
* Parse `cluster.kat`.
|
||||
* Start single-node embedded etcd.
|
||||
* Campaign for and become leader.
|
||||
* Store initial cluster configuration (UID, CIDRs from `cluster.kat`) in etcd.
|
||||
* **Milestone**:
|
||||
* A single `kat-agent init --config cluster.kat` process starts, initializes etcd, and logs that it has become the leader.
|
||||
* The cluster configuration from `cluster.kat` can be verified in etcd using an etcd client.
|
||||
* `StateStore` interface methods (`Put`, `Get`, `Delete`, `List`) are testable against the embedded etcd.
|
||||
|
||||
**Phase 2: Basic Agent & Node Lifecycle (Init, Join, PKI)**
|
||||
* **Goal**: Initial Leader setup, a second Agent joining with mTLS, and heartbeating.
|
||||
* **Tasks**:
|
||||
1. Implement Internal PKI (RFC 10.6) in `internal/pki/`:
|
||||
* CA key/cert generation on `kat-agent init`.
|
||||
* CSR generation by agent on join.
|
||||
* CSR signing by Leader.
|
||||
2. Implement initial Node Communication Protocol (RFC 2.3) for join:
|
||||
* Agent (`kat-agent join --leader-api <...> --advertise-address <...>`) sends CSR to Leader.
|
||||
* Leader validates, signs, returns certs & CA. Stores node registration (name, UID, advertise addr, WG pubkey placeholder) in etcd.
|
||||
3. Implement basic mTLS for this join communication.
|
||||
4. Implement Node Heartbeat (`POST /v1alpha1/nodes/{nodeName}/status`) from Agent to Leader (RFC 4.1.3). Leader updates node status in etcd.
|
||||
5. Leader implements basic failure detection (marks Node `NotReady` in etcd if heartbeats cease) (RFC 4.1.4).
|
||||
* **Milestone**:
|
||||
* `kat-agent init` establishes a Leader with a CA.
|
||||
* `kat-agent join` allows a second agent to securely register with the Leader, obtain certificates, and store its info in etcd.
|
||||
* Leader's API receives heartbeats from the joined Agent.
|
||||
* If a joined Agent is stopped, the Leader marks its status as `NotReady` in etcd after `nodeLossTimeoutSeconds`.
|
||||
|
||||
**Phase 3: Container Runtime Interface & Local Podman Management**
|
||||
* **Goal**: Agent can manage containers locally via Podman using the CRI.
|
||||
* **Tasks**:
|
||||
1. Define `ContainerRuntime` interface in `internal/runtime/interface.go` (RFC 6.1).
|
||||
2. Implement the Podman backend for `ContainerRuntime` in `internal/runtime/podman.go` (RFC 6.2). Focus on: `CreateContainer`, `StartContainer`, `StopContainer`, `RemoveContainer`, `GetContainerStatus`, `PullImage`, `StreamContainerLogs`.
|
||||
3. Implement rootless execution strategy (RFC 6.3):
|
||||
* Mechanism to ensure dedicated user accounts (initially, assume pre-existing or manual creation for tests).
|
||||
* Podman systemd unit generation (`podman generate systemd`).
|
||||
* Managing units via `systemctl --user`.
|
||||
* **Milestone**:
|
||||
* Agent process (upon a mocked internal command) can pull a specified image (e.g., `nginx`) and run it rootlessly using Podman and systemd user services.
|
||||
* Agent can stop, remove, and get the status/logs of this container.
|
||||
* All operations are performed via the `ContainerRuntime` interface.
|
||||
|
||||
**Phase 4: Basic Workload Deployment (Single Node, Image Source Only, No Networking)**
|
||||
* **Goal**: Leader can instruct an Agent to run a simple `Service` workload (single container, image source) on itself (if leader is also an agent) or a single joined agent.
|
||||
* **Tasks**:
|
||||
1. Implement basic API endpoints on Leader for Workload CRUD (`POST/PUT /v1alpha1/n/{ns}/workloads` accepting `tar.gz`) (RFC 8.3, 4.2). Leader stores Quadlet files in etcd.
|
||||
2. Simplistic scheduling (RFC 4.4): If only one agent node, assign workload to it. Leader creates an "assignment" or "task" for the agent in etcd.
|
||||
3. Agent watches for assigned tasks from etcd.
|
||||
4. On receiving a task, Agent uses `ContainerRuntime` to deploy the container (image from `workload.kat`).
|
||||
5. Agent reports container instance status in its heartbeat. Leader updates overall workload status in etcd.
|
||||
6. Basic `katcall apply -f <dir>` and `katcall get workload <name>` functionality.
|
||||
* **Milestone**:
|
||||
* User can deploy a simple single-container `Service` (e.g., `nginx`) using `katcall apply`.
|
||||
* The container runs on the designated Agent node.
|
||||
* `katcall get workload my-service` shows its status as running.
|
||||
* `katcall logs <instanceID>` streams container logs.
|
||||
|
||||
**Phase 5: Overlay Networking (WireGuard) & IPAM**
|
||||
* **Goal**: Nodes establish a WireGuard overlay network. Leader allocates IPs for containers.
|
||||
* **Tasks**:
|
||||
1. Implement WireGuard setup on Agents (`internal/network/wireguard.go`) (RFC 7.1):
|
||||
* Key generation, public key reporting to Leader during join/heartbeat.
|
||||
* Leader stores Node WireGuard public keys and advertise endpoints in etcd.
|
||||
* Agent configures its `kat0` interface and peers by watching etcd.
|
||||
2. Implement IPAM in Leader (`internal/leader/ipam.go`) (RFC 7.2):
|
||||
* Node subnet allocation from `clusterCIDR` (from `cluster.kat`).
|
||||
* Container IP allocation from the node's subnet when a workload instance is scheduled.
|
||||
3. Agent uses the Leader-assigned IP when creating the container network/container with Podman.
|
||||
* **Milestone**:
|
||||
* All joined KAT nodes form a WireGuard mesh; `wg show` on nodes confirms peer connections.
|
||||
* Leader allocates a unique overlay IP for each container instance.
|
||||
* Containers on different nodes can ping each other using their overlay IPs.
|
||||
|
||||
**Phase 6: Distributed Agent DNS & Service Discovery**
|
||||
* **Goal**: Basic service discovery using agent-local DNS for deployed services.
|
||||
* **Tasks**:
|
||||
1. Implement Agent-local DNS server (`internal/agent/dns_resolver.go`) using `miekg/dns` (RFC 7.3).
|
||||
2. Leader writes DNS `A` records to etcd (e.g., `<workloadName>.<namespace>.<clusterDomain> -> <containerOverlayIP>`) when service instances become healthy/active.
|
||||
3. Agent DNS server watches etcd for DNS records and updates its local zones.
|
||||
4. Agent configures `/etc/resolv.conf` in managed containers to use its `kat0` IP as nameserver.
|
||||
* **Milestone**:
|
||||
* A service (`service-a`) deployed on one node can be resolved by its DNS name (e.g., `service-a.default.kat.cluster.local`) by a container on another node.
|
||||
* DNS resolution provides the correct overlay IP(s) of `service-a` instances.
|
||||
|
||||
**Phase 7: Advanced Workload Features & Full Scheduling**
|
||||
* **Goal**: Implement `Job`, `DaemonService`, richer scheduling, health checks, volumes, and restart policies.
|
||||
* **Tasks**:
|
||||
1. Implement `Job` type (RFC 3.4, 4.8): scheduling, completion tracking, backoff.
|
||||
2. Implement `DaemonService` type (RFC 3.2): ensures one instance per eligible node.
|
||||
3. Implement full scheduling logic in Leader (RFC 4.4): resource requests (`cpu`, `memory`), `nodeSelector`, Taint/Toleration, GPU (basic), "most empty" scoring.
|
||||
4. Implement `VirtualLoadBalancer.kat` parsing and Agent-side health checks (RFC 3.3, 4.6.3). Leader uses health status for service readiness and DNS.
|
||||
5. Implement container `restartPolicy` (RFC 3.2, 4.6.4) via systemd unit configuration.
|
||||
6. Implement `volumeMounts` and `volumes` (RFC 3.2, 4.7): `HostMount`, `SimpleClusterStorage`. Agent ensures paths are set up.
|
||||
* **Milestone**:
|
||||
* `Job`s run to completion and their status is tracked.
|
||||
* `DaemonService`s run one instance on all eligible nodes.
|
||||
* Services are scheduled according to resource requests, selectors, and taints.
|
||||
* Unhealthy service instances are identified by health checks and reflected in status.
|
||||
* Containers restart based on their policy.
|
||||
* Workloads can mount host paths and simple cluster storage.
|
||||
|
||||
**Phase 8: Git-Native Builds & Workload Updates/Rollbacks**
|
||||
* **Goal**: Enable on-agent builds from Git sources and implement workload update strategies.
|
||||
* **Tasks**:
|
||||
1. Implement `BuildDefinition.kat` parsing (RFC 3.5).
|
||||
2. Implement Git-native build process on Agent (`internal/agent/build.go`) using Podman (RFC 4.3).
|
||||
3. Implement `cacheImage` pull/push for build caching (Agent needs registry credentials configured locally).
|
||||
4. Implement workload update strategies in Leader (RFC 4.5): `Simultaneous`, `Rolling` (with `maxSurge`).
|
||||
5. Implement manual rollback mechanism (`katcall rollback workload <name>`) (RFC 4.5).
|
||||
* **Milestone**:
|
||||
* A workload can be successfully deployed from a Git repository source, with the image built on the agent.
|
||||
* A deployed service can be updated using the `Rolling` strategy with observable incremental instance replacement.
|
||||
* A workload can be rolled back to its previous version.
|
||||
|
||||
**Phase 9: Full API Implementation & CLI (`katcall`) Polish**
|
||||
* **Goal**: A robust and comprehensive HTTP API and `katcall` CLI.
|
||||
* **Tasks**:
|
||||
1. Implement all remaining API endpoints and features as per RFC Section 8. Ensure Proto3/JSON contracts are met.
|
||||
2. Implement API authentication: bearer token for `katcall` (RFC 8.1, 10.1).
|
||||
3. Flesh out `katcall` with all necessary commands and options (RFC 1.5 Terminology - katcall, RFC 8.3 hints):
|
||||
* `drain <nodeName>`, `get nodes/namespaces`, `describe <resource>`, etc.
|
||||
4. Improve error reporting and user feedback in CLI and API.
|
||||
* **Milestone**:
|
||||
* All functionalities defined in the RFC can be managed and introspected via the `katcall` CLI interacting with the secure KAT API.
|
||||
* API documentation (e.g., Swagger/OpenAPI generated from protos or code) is available.
|
||||
|
||||
**Phase 10: Observability, Backup/Restore, Advanced Features & Security**
|
||||
* **Goal**: Implement observability features, state backup/restore, and other advanced functionalities.
|
||||
* **Tasks**:
|
||||
1. Implement Agent & Leader logging to systemd journal/files; API for streaming container logs already in Phase 4/Milestone (RFC 9.1).
|
||||
2. Implement basic Metrics exposure (`/metrics` JSON endpoint on Leader/Agent) (RFC 9.2).
|
||||
3. Implement Events system: Leader records significant events in etcd, API to query events (RFC 9.3).
|
||||
4. Implement Leader-driven etcd state backup (`etcdctl snapshot save`) (RFC 5.4).
|
||||
5. Document and test the etcd state restore procedure (RFC 5.5).
|
||||
6. Implement Detached Node Operation and Rejoin (RFC 4.9).
|
||||
7. Provide standard Quadlet files and documentation for the Traefik Ingress recipe (RFC 7.4).
|
||||
8. Review and harden security aspects: API security, build security, network security, secrets handling (document current limitations as per RFC 10.5).
|
||||
* **Milestone**:
|
||||
* Container logs are streamable via `katcall logs`. Agent/Leader logs are accessible.
|
||||
* Basic metrics are available via API. Cluster events can be listed.
|
||||
* Automated etcd backups are created by the Leader. Restore procedure is tested.
|
||||
* Detached node can operate locally and rejoin the main cluster.
|
||||
* Traefik can be deployed using provided Quadlets to achieve ingress.
|
||||
|
||||
**Phase 11: Testing, Documentation, and Release Preparation**
|
||||
* **Goal**: Ensure KAT v1.0 is robust, well-documented, and ready for release.
|
||||
* **Tasks**:
|
||||
1. Write comprehensive unit tests for all core logic.
|
||||
2. Develop integration tests for component interactions (e.g., Leader-Agent, Agent-Podman).
|
||||
3. Create an E2E test suite using `katcall` to simulate real user scenarios.
|
||||
4. Write detailed user documentation: installation, configuration, tutorials for all features, troubleshooting.
|
||||
5. Perform performance testing on key operations (e.g., deployment speed, agent density).
|
||||
6. Conduct a thorough security review/audit against RFC security considerations.
|
||||
7. Establish a release process: versioning, changelog, building release artifacts.
|
||||
* **Milestone**:
|
||||
* High test coverage.
|
||||
* Comprehensive user and API documentation is complete.
|
||||
* Known critical bugs are fixed.
|
||||
* KAT v1.0 is packaged and ready for its first official release.
|
81
docs/plan/phase1.md
Normal file
81
docs/plan/phase1.md
Normal file
@ -0,0 +1,81 @@
|
||||
# **Phase 1: State Management & Leader Election**
|
||||
|
||||
* **Goal**: Establish the foundational state layer using embedded etcd and implement a reliable leader election mechanism. A single `kat-agent` can initialize a cluster, become its leader, and store initial configuration.
|
||||
* **RFC Sections Primarily Used**: 2.2 (Embedded etcd), 3.9 (ClusterConfiguration), 5.1 (State Store Interface), 5.2 (etcd Implementation Details), 5.3 (Leader Election).
|
||||
|
||||
**Tasks & Sub-Tasks:**
|
||||
|
||||
1. **Define `StateStore` Go Interface (`internal/store/interface.go`)**
|
||||
* **Purpose**: Create the abstraction layer for all state operations, decoupling the rest of the system from direct etcd dependencies.
|
||||
* **Details**: Transcribe the Go interface from RFC 5.1 verbatim. Include `KV`, `WatchEvent`, `EventType`, `Compare`, `Op`, `OpType` structs/constants.
|
||||
* **Verification**: Code compiles. Interface definition matches RFC.
|
||||
|
||||
2. **Implement Embedded etcd Server Logic (`internal/store/etcd.go`)**
|
||||
* **Purpose**: Allow `kat-agent` to run its own etcd instance for single-node clusters or as part of a multi-node quorum.
|
||||
* **Details**:
|
||||
* Use `go.etcd.io/etcd/server/v3/embed`.
|
||||
* Function to start an embedded etcd server:
|
||||
* Input: configuration parameters (data directory, peer URLs, client URLs, name). These will come from `cluster.kat` or defaults.
|
||||
* Output: a running `embed.Etcd` instance or an error.
|
||||
* Graceful shutdown logic for the embedded etcd server.
|
||||
* **Verification**: A test can start and stop an embedded etcd server. Data directory is created and used.
|
||||
|
||||
3. **Implement `StateStore` with etcd Backend (`internal/store/etcd.go`)**
|
||||
* **Purpose**: Provide the concrete implementation for interacting with an etcd cluster (embedded or external).
|
||||
* **Details**:
|
||||
* Create a struct that implements the `StateStore` interface and holds an `etcd/clientv3.Client`.
|
||||
* Implement `Put(ctx, key, value)`: Use `client.Put()`.
|
||||
* Implement `Get(ctx, key)`: Use `client.Get()`. Handle key-not-found. Populate `KV.Version` with `ModRevision`.
|
||||
* Implement `Delete(ctx, key)`: Use `client.Delete()`.
|
||||
* Implement `List(ctx, prefix)`: Use `client.Get()` with `clientv3.WithPrefix()`.
|
||||
* Implement `Watch(ctx, keyOrPrefix, startRevision)`: Use `client.Watch()`. Translate etcd events to `WatchEvent`.
|
||||
* Implement `Close()`: Close the `clientv3.Client`.
|
||||
* Implement `Campaign(ctx, leaderID, leaseTTLSeconds)`:
|
||||
* Use `concurrency.NewSession()` to create a lease.
|
||||
* Use `concurrency.NewElection()` and `election.Campaign()`.
|
||||
* Return a context that is cancelled when leadership is lost (e.g., by watching the campaign context or session done channel).
|
||||
* Implement `Resign(ctx)`: Use `election.Resign()`.
|
||||
* Implement `GetLeader(ctx)`: Observe the election or query the leader key.
|
||||
* Implement `DoTransaction(ctx, checks, onSuccess, onFailure)`: Use `client.Txn()` with `clientv3.Compare` and `clientv3.Op`.
|
||||
* **Potential Challenges**: Correctly handling etcd transaction semantics, context propagation, and error translation. Efficiently managing watches.
|
||||
* **Verification**:
|
||||
* Unit tests for each `StateStore` method using a real embedded etcd instance (test-scoped).
|
||||
* Verify `Put` then `Get` retrieves the correct value and version.
|
||||
* Verify `List` with prefix.
|
||||
* Verify `Delete` removes the key.
|
||||
* Verify `Watch` receives correct events for puts/deletes.
|
||||
* Verify `DoTransaction` commits on success and rolls back on failure.
|
||||
|
||||
4. **Integrate Leader Election into `kat-agent` (`cmd/kat-agent/main.go`, `internal/leader/election.go` - new file maybe)**
|
||||
* **Purpose**: Enable an agent instance to attempt to become the cluster leader.
|
||||
* **Details**:
|
||||
* `kat-agent` main function will initialize its `StateStore` client.
|
||||
* A dedicated goroutine will call `StateStore.Campaign()`.
|
||||
* The outcome of `Campaign` (e.g., leadership acquired, context for leadership duration) will determine if the agent activates its Leader-specific logic (Phase 2+).
|
||||
* Leader ID could be `nodeName` or a UUID. Lease TTL from `cluster.kat`.
|
||||
* **Verification**:
|
||||
* Start one `kat-agent` with etcd enabled; it should log "became leader".
|
||||
* Start a second `kat-agent` configured to connect to the first's etcd; it should log "observing leader <leaderID>" or similar, but not become leader itself.
|
||||
* If the first agent (leader) is stopped, the second agent should eventually log "became leader".
|
||||
|
||||
5. **Implement Basic `kat-agent init` Command (`cmd/kat-agent/main.go`, `internal/config/parse.go`)**
|
||||
* **Purpose**: Initialize a new KAT cluster (single node initially).
|
||||
* **Details**:
|
||||
* Define `init` subcommand in `kat-agent` using a CLI library (e.g., `cobra`).
|
||||
* Flag: `--config <path_to_cluster.kat>`.
|
||||
* Parse `cluster.kat` (from Phase 0, now used to extract etcd peer/client URLs, data dir, backup paths etc.).
|
||||
* Generate a persistent Cluster UID and store it in etcd (e.g., `/kat/config/cluster_uid`).
|
||||
* Store `cluster.kat` relevant parameters (or the whole sanitized config) into etcd (e.g., under `/kat/config/cluster_config`).
|
||||
* Start the embedded etcd server using parsed configurations.
|
||||
* Initiate leader election.
|
||||
* **Potential Challenges**: Ensuring `cluster.kat` parsing is robust. Handling existing data directories.
|
||||
* **Milestone Verification**:
|
||||
* Running `kat-agent init --config examples/cluster.kat` on a clean system:
|
||||
* Starts the `kat-agent` process.
|
||||
* Creates the etcd data directory.
|
||||
* Logs "Successfully initialized etcd".
|
||||
* Logs "Became leader: <nodeName>".
|
||||
* Using `etcdctl` (or a simple `StateStore.Get` test client):
|
||||
* Verify `/kat/config/cluster_uid` exists and has a UUID.
|
||||
* Verify `/kat/config/cluster_config` (or similar keys) contains data from `cluster.kat` (e.g., `clusterCIDR`, `serviceCIDR`, `agentPort`, `apiPort`).
|
||||
* Verify the leader election key exists for the current leader.
|
98
docs/plan/phase2.md
Normal file
98
docs/plan/phase2.md
Normal file
@ -0,0 +1,98 @@
|
||||
# **Phase 2: Basic Agent & Node Lifecycle (Init, Join, PKI)**
|
||||
|
||||
* **Goal**: Implement the secure registration of a new agent node to an existing leader, including PKI for mTLS, and establish periodic heartbeating for status updates and failure detection.
|
||||
* **RFC Sections Primarily Used**: 2.3 (Node Communication Protocol), 4.1.1 (Initial Leader Setup - CA), 4.1.2 (Agent Node Join - CSR), 10.1 (API Security - mTLS), 10.6 (Internal PKI), 4.1.3 (Node Heartbeat), 4.1.4 (Node Departure and Failure Detection - basic).
|
||||
|
||||
**Tasks & Sub-Tasks:**
|
||||
|
||||
1. **Implement Internal PKI Utilities (`internal/pki/ca.go`, `internal/pki/certs.go`)**
|
||||
* **Purpose**: Create and manage the Certificate Authority and sign certificates for mTLS.
|
||||
* **Details**:
|
||||
* `GenerateCA()`: Creates a new RSA key pair and a self-signed X.509 CA certificate. Saves to disk (e.g., `/var/lib/kat/pki/ca.key`, `/var/lib/kat/pki/ca.crt`). Path from `cluster.kat` `backupPath` parent dir, or a new `pkiPath`.
|
||||
* `GenerateCertificateRequest(commonName, keyOutPath, csrOutPath)`: Agent uses this. Generates RSA key, creates a CSR.
|
||||
* `SignCertificateRequest(caKeyPath, caCertPath, csrData, certOutPath, duration)`: Leader uses this. Loads CA key/cert, parses CSR, issues a signed certificate.
|
||||
* Helper functions to load keys and certs from disk.
|
||||
* **Potential Challenges**: Handling cryptographic operations correctly and securely. Permissions for key storage.
|
||||
* **Verification**: Unit tests for `GenerateCA`, `GenerateCertificateRequest`, `SignCertificateRequest`. Generated certs should be verifiable against the CA.
|
||||
|
||||
2. **Leader: Initialize CA & Its Own mTLS Certs on `init` (`cmd/kat-agent/main.go`)**
|
||||
* **Purpose**: The first leader needs to establish the PKI and secure its own API endpoint.
|
||||
* **Details**:
|
||||
* During `kat-agent init`, after etcd is up and leadership is confirmed:
|
||||
* Call `pki.GenerateCA()` if CA files don't exist.
|
||||
* Generate its own server key and CSR (e.g., for `leader.kat.cluster.local`).
|
||||
* Sign its own CSR using the CA to get its server certificate.
|
||||
* Configure its (future) API HTTP server to use these server key/cert for TLS and require client certs (mTLS).
|
||||
* **Verification**: After `kat-agent init`, CA key/cert and leader's server key/cert exist in the configured PKI path.
|
||||
|
||||
3. **Implement Basic API Server with mTLS on Leader (`internal/api/server.go`, `internal/api/router.go`)**
|
||||
* **Purpose**: Provide the initial HTTP endpoints required for agent join, secured with mTLS.
|
||||
* **Details**:
|
||||
* Setup `http.Server` with `tls.Config`:
|
||||
* `Certificates`: Leader's server key/cert.
|
||||
* `ClientAuth: tls.RequireAndVerifyClientCert`.
|
||||
* `ClientCAs`: Pool containing the cluster CA certificate.
|
||||
* Minimal router (e.g., `gorilla/mux` or `http.ServeMux`) for:
|
||||
* `POST /internal/v1alpha1/join`: Endpoint for agent to submit CSR. (Internal as it's part of bootstrap).
|
||||
* **Verification**: An HTTPS client (e.g., `curl` with appropriate client certs) can connect to the leader's API port if it presents a cert signed by the cluster CA. Connection fails without a client cert or with a cert from a different CA.
|
||||
|
||||
4. **Agent: `join` Command & CSR Submission (`cmd/kat-agent/main.go`, `internal/cli/join.go` - or similar for agent logic)**
|
||||
* **Purpose**: Allow a new agent to request to join the cluster and obtain its mTLS credentials.
|
||||
* **Details**:
|
||||
* `kat-agent join` subcommand:
|
||||
* Flags: `--leader-api <ip:port>`, `--advertise-address <ip_or_interface_name>`, `--node-name <name>` (optional, leader can generate).
|
||||
* Generate its own key pair and CSR using `pki.GenerateCertificateRequest()`.
|
||||
* Make an HTTP POST to Leader's `/internal/v1alpha1/join` endpoint:
|
||||
* Payload: CSR data, advertise address, requested node name, initial WireGuard public key (placeholder for now).
|
||||
* For this *initial* join, the client may need to trust the leader's CA cert via an out-of-band mechanism or `--leader-ca-cert` flag, or use a token for initial auth if mTLS is strictly enforced from the start. *RFC implies mTLS is mandatory, so agent needs CA cert to trust leader, and leader needs to accept CSR perhaps based on a pre-shared token initially before agent has its own signed cert.* For simplicity in V1, the initial join POST might happen over HTTPS where the agent trusts the leader's self-signed cert (if leader has one before CA is used for client auth) or a pre-shared token authorizes the CSR signing. RFC 4.1.2 states "Leader, upon validating the join request (V1 has no strong token validation, relies on network trust)". This needs clarification. *Assume network trust for now: agent connects, sends CSR, leader signs.*
|
||||
* Receive signed certificate and CA certificate from Leader. Store them locally.
|
||||
* **Potential Challenges**: Securely bootstrapping trust for the very first communication to the leader to submit the CSR.
|
||||
* **Verification**: `kat-agent join` command:
|
||||
* Generates key/CSR.
|
||||
* Successfully POSTs CSR to leader.
|
||||
* Receives and saves its signed certificate and the CA certificate.
|
||||
|
||||
5. **Leader: CSR Signing & Node Registration (Handler for `/internal/v1alpha1/join`)**
|
||||
* **Purpose**: Validate joining agent, sign its CSR, and record its registration.
|
||||
* **Details**:
|
||||
* Handler for `/internal/v1alpha1/join`:
|
||||
* Parse CSR, advertise address, WG pubkey from request.
|
||||
* Validate (minimal for now).
|
||||
* Generate a unique Node Name if not provided. Assign a Node UID.
|
||||
* Sign the CSR using `pki.SignCertificateRequest()`.
|
||||
* Store Node registration data in etcd via `StateStore` (`/kat/nodes/registration/{nodeName}`: UID, advertise address, WG pubkey placeholder, join timestamp).
|
||||
* Return the signed agent certificate and the cluster CA certificate to the agent.
|
||||
* **Verification**:
|
||||
* After agent joins, its certificate is signed by the cluster CA.
|
||||
* Node registration data appears correctly in etcd under `/kat/nodes/registration/{nodeName}`.
|
||||
|
||||
6. **Agent: Establish mTLS Client for Subsequent Comms & Implement Heartbeating (`internal/agent/agent.go`)**
|
||||
* **Purpose**: Agent uses its new mTLS certs to communicate status to the Leader.
|
||||
* **Details**:
|
||||
* Agent configures its HTTP client to use its signed key/cert and the cluster CA cert for all future Leader communications.
|
||||
* Periodic Heartbeat (RFC 4.1.3):
|
||||
* Ticker (e.g., every `agentTickSeconds` from `cluster.kat`, default 15s).
|
||||
* On tick, gather basic node status (node name, timestamp, initial resource capacity stubs).
|
||||
* HTTP `POST` to Leader's `/v1alpha1/nodes/{nodeName}/status` endpoint using the mTLS-configured client.
|
||||
* **Verification**: Agent logs successful heartbeat POSTs.
|
||||
|
||||
7. **Leader: Receive Heartbeats & Basic Failure Detection (Handler for `/v1alpha1/nodes/{nodeName}/status`, `internal/leader/leader.go`)**
|
||||
* **Purpose**: Leader tracks agent status and detects failures.
|
||||
* **Details**:
|
||||
* API endpoint `/v1alpha1/nodes/{nodeName}/status` (mTLS required):
|
||||
* Receives status update from agent.
|
||||
* Updates node's actual state in etcd (`/kat/nodes/status/{nodeName}/heartbeat`: timestamp, reported status). Could use an etcd lease for this key, renewed by agent heartbeats.
|
||||
* Failure Detection (RFC 4.1.4):
|
||||
* Leader has a reconciliation loop or periodic check.
|
||||
* Scans `/kat/nodes/status/` in etcd.
|
||||
* If a node's last heartbeat timestamp is older than `nodeLossTimeoutSeconds` (from `cluster.kat`), update its status in etcd to `NotReady` (e.g., `/kat/nodes/status/{nodeName}/condition: NotReady`).
|
||||
* **Potential Challenges**: Efficiently scanning for dead nodes without excessive etcd load.
|
||||
* **Milestone Verification**:
|
||||
* `kat-agent init` runs as Leader, CA created, its API is up with mTLS.
|
||||
* A second `kat-agent join ...` process successfully:
|
||||
* Generates CSR, gets it signed by Leader.
|
||||
* Saves its cert and CA cert.
|
||||
* Starts sending heartbeats to Leader using mTLS.
|
||||
* Leader logs receipt of heartbeats from the joined Agent.
|
||||
* Node status (last heartbeat time) is updated in etcd by the Leader.
|
||||
* If the joined Agent process is stopped, after `nodeLossTimeoutSeconds`, the Leader updates the node's status in etcd to `NotReady`. This can be verified using `etcdctl` or a `StateStore.Get` call.
|
102
docs/plan/phase3.md
Normal file
102
docs/plan/phase3.md
Normal file
@ -0,0 +1,102 @@
|
||||
# **Phase 3: Container Runtime Interface & Local Podman Management**
|
||||
|
||||
* **Goal**: Abstract container management operations behind a `ContainerRuntime` interface and implement it using Podman CLI, enabling an agent to manage containers rootlessly based on (mocked) instructions.
|
||||
* **RFC Sections Primarily Used**: 6.1 (Runtime Interface Definition), 6.2 (Default Implementation: Podman), 6.3 (Rootless Execution Strategy).
|
||||
|
||||
**Tasks & Sub-Tasks:**
|
||||
|
||||
1. **Define `ContainerRuntime` Go Interface (`internal/runtime/interface.go`)**
|
||||
* **Purpose**: Abstract all container operations (build, pull, run, stop, inspect, logs, etc.).
|
||||
* **Details**: Transcribe the Go interface from RFC 6.1 precisely. Include all specified structs (`ImageSummary`, `ContainerStatus`, `BuildOptions`, `PortMapping`, `VolumeMount`, `ResourceSpec`, `ContainerCreateOptions`, `ContainerHealthCheck`) and enums (`ContainerState`, `HealthState`).
|
||||
* **Verification**: Code compiles. Interface and type definitions match RFC.
|
||||
|
||||
2. **Implement Podman Backend for `ContainerRuntime` (`internal/runtime/podman.go`) - Core Lifecycle Methods**
|
||||
* **Purpose**: Translate `ContainerRuntime` calls into `podman` CLI commands.
|
||||
* **Details (for each method, focus on these first):**
|
||||
* `PullImage(ctx, imageName, platform)`:
|
||||
* Cmd: `podman pull {imageName}` (add `--platform` if specified).
|
||||
* Parse output to get image ID (e.g., from `podman inspect {imageName} --format '{{.Id}}'`).
|
||||
* `CreateContainer(ctx, opts ContainerCreateOptions)`:
|
||||
* Cmd: `podman create ...`
|
||||
* Translate `ContainerCreateOptions` into `podman create` flags:
|
||||
* `--name {opts.InstanceID}` (KAT's unique ID for the instance).
|
||||
* `--hostname {opts.Hostname}`.
|
||||
* `--env` for `opts.Env`.
|
||||
* `--label` for `opts.Labels` (include KAT ownership labels like `kat.dws.rip/workload-name`, `kat.dws.rip/namespace`, `kat.dws.rip/instance-id`).
|
||||
* `--restart {opts.RestartPolicy}` (map to Podman's "no", "on-failure", "always").
|
||||
* Resource mapping: `--cpus` (for quota), `--cpu-shares`, `--memory`.
|
||||
* `--publish` for `opts.Ports`.
|
||||
* `--volume` for `opts.Volumes` (source will be host path, destination is container path).
|
||||
* `--network {opts.NetworkName}` and `--ip {opts.IPAddress}` if specified.
|
||||
* `--user {opts.User}`.
|
||||
* `--cap-add`, `--cap-drop`, `--security-opt`.
|
||||
* Podman native healthcheck flags from `opts.HealthCheck`.
|
||||
* `--systemd={opts.Systemd}`.
|
||||
* Parse output for container ID.
|
||||
* `StartContainer(ctx, containerID)`: Cmd: `podman start {containerID}`.
|
||||
* `StopContainer(ctx, containerID, timeoutSeconds)`: Cmd: `podman stop -t {timeoutSeconds} {containerID}`.
|
||||
* `RemoveContainer(ctx, containerID, force, removeVolumes)`: Cmd: `podman rm {containerID}` (add `--force`, `--volumes`).
|
||||
* `GetContainerStatus(ctx, containerOrName)`:
|
||||
* Cmd: `podman inspect {containerOrName}`.
|
||||
* Parse JSON output to populate `ContainerStatus` struct (State, ExitCode, StartedAt, FinishedAt, Health, ImageID, ImageName, OverlayIP if available from inspect).
|
||||
* Podman health status needs to be mapped to `HealthState`.
|
||||
* `StreamContainerLogs(ctx, containerID, follow, since, stdout, stderr)`:
|
||||
* Cmd: `podman logs {containerID}` (add `--follow`, `--since`).
|
||||
* Stream `os/exec.Cmd.Stdout` and `os/exec.Cmd.Stderr` to the provided `io.Writer`s.
|
||||
* **Helper**: A utility function to run `podman` commands as a specific rootless user (see Rootless Execution below).
|
||||
* **Potential Challenges**: Correctly mapping all `ContainerCreateOptions` to Podman flags. Parsing varied `podman inspect` output. Managing `os/exec` for logs. Robust error handling from CLI output.
|
||||
* **Verification**:
|
||||
* Unit tests for each implemented method, mocking `os/exec` calls to verify command construction and output parsing.
|
||||
* *Requires Podman installed for integration-style unit tests*: Tests that actually execute `podman` commands (e.g., pull alpine, create, start, inspect, stop, rm) and verify state changes.
|
||||
|
||||
3. **Implement Rootless Execution Strategy (`internal/runtime/podman.go` helpers, `internal/agent/runtime.go`)**
|
||||
* **Purpose**: Ensure containers are run by unprivileged users using systemd for supervision.
|
||||
* **Details**:
|
||||
* **User Assumption**: For Phase 3, *assume* the dedicated user (e.g., `kat_wl_mywebapp`) already exists on the system and `loginctl enable-linger <username>` has been run manually. The username could be passed in `ContainerCreateOptions.User` or derived.
|
||||
* **Podman Command Execution Context**:
|
||||
* The `kat-agent` process itself might run as root or a privileged user.
|
||||
* When executing `podman` commands for a workload, it MUST run them as the target unprivileged user.
|
||||
* This can be achieved using `sudo -u {username} podman ...` or more directly via `nsenter`/`setuid` if the agent has capabilities, or by setting `XDG_RUNTIME_DIR` and `DBUS_SESSION_BUS_ADDRESS` appropriately for the target user if invoking `podman` via systemd user session D-Bus API. *Simplest for now might be `sudo -u {username} podman ...` if agent is root, or ensuring agent itself runs as a user who can switch to other `kat_wl_*` users.*
|
||||
* The RFC prefers "systemd user sessions". This usually means `systemctl --user ...`. To control another user's systemd session, the agent process (if root) can use `machinectl shell {username}@.host /bin/bash -c "systemctl --user ..."` or `systemd-run --user --machine={username}@.host ...`. If the agent is not root, it cannot directly control other users' systemd sessions. *This is a critical design point: how does the agent (potentially root) interact with user-level systemd?*
|
||||
* RFC: "Agent uses `systemctl --user --machine={username}@.host ...`". This implies agent has permissions to do this (likely running as root or with specific polkit rules).
|
||||
* **Systemd Unit Generation & Management**:
|
||||
* After `podman create ...` (or instead of direct create, if `podman generate systemd` is used to create the definition), generate systemd unit:
|
||||
`podman generate systemd --new --name {opts.InstanceID} --files --time 10 {imageNameUsedInCreate}`. This creates a `{opts.InstanceID}.service` file.
|
||||
* The `ContainerRuntime` implementation needs to:
|
||||
1. Execute `podman create` to establish the container definition (this allows Podman to manage its internal state for the container ID).
|
||||
2. Execute `podman generate systemd --name {containerID}` (using the ID from create) to get the unit file content.
|
||||
3. Place this unit file in the target user's systemd path (e.g., `/home/{username}/.config/systemd/user/{opts.InstanceID}.service` or `/etc/systemd/user/{opts.InstanceID}.service` if agent is root and wants to enable for any user).
|
||||
4. Run `systemctl --user --machine={username}@.host daemon-reload`.
|
||||
5. Start/Enable: `systemctl --user --machine={username}@.host enable --now {opts.InstanceID}.service`.
|
||||
* To stop: `systemctl --user --machine={username}@.host stop {opts.InstanceID}.service`.
|
||||
* To remove: `systemctl --user --machine={username}@.host disable {opts.InstanceID}.service`, then `podman rm {opts.InstanceID}`, then remove the unit file.
|
||||
* Status: `systemctl --user --machine={username}@.host status {opts.InstanceID}.service` (parse output), or rely on `podman inspect` which should reflect systemd-managed state.
|
||||
* **Potential Challenges**: Managing permissions for interacting with other users' systemd sessions. Correctly placing and cleaning up systemd unit files. Ensuring `XDG_RUNTIME_DIR` is set correctly for rootless Podman if not using systemd units for direct `podman run`. Systemd unit generation nuances.
|
||||
* **Verification**:
|
||||
* A test in `internal/agent/runtime_test.go` (or similar) can take mock `ContainerCreateOptions`.
|
||||
* It calls the (mocked or real) `ContainerRuntime` implementation.
|
||||
* Verify:
|
||||
* Podman commands are constructed to run as the target unprivileged user.
|
||||
* A systemd unit file is generated for the container.
|
||||
* `systemctl --user --machine...` commands are invoked correctly to manage the service.
|
||||
* The container is actually started (verify with `podman ps -a --filter label=kat.dws.rip/instance-id={instanceID}` as the target user).
|
||||
* Logs can be retrieved.
|
||||
* The container can be stopped and removed, including its systemd unit.
|
||||
|
||||
* **Milestone Verification**:
|
||||
* The `ContainerRuntime` Go interface is fully defined as per RFC 6.1.
|
||||
* The Podman implementation for core lifecycle methods (`PullImage`, `CreateContainer` (leading to systemd unit generation), `StartContainer` (via systemd enable/start), `StopContainer` (via systemd stop), `RemoveContainer` (via systemd disable + podman rm + unit file removal), `GetContainerStatus`, `StreamContainerLogs`) is functional.
|
||||
* An `internal/agent` test (or a temporary `main.go` test harness) can:
|
||||
1. Define `ContainerCreateOptions` for a simple image like `docker.io/library/alpine` with a command like `sleep 30`.
|
||||
2. Specify a (manually pre-created and linger-enabled) unprivileged username.
|
||||
3. Call the `ContainerRuntime` methods.
|
||||
4. **Result**:
|
||||
* The alpine image is pulled (if not present).
|
||||
* A systemd user service unit is generated and placed correctly for the specified user.
|
||||
* The service is started using `systemctl --user --machine...`.
|
||||
* `podman ps --all --filter label=kat.dws.rip/instance-id=...` (run as the target user or by root seeing all containers) shows the container running or having run.
|
||||
* Logs can be retrieved using the `StreamContainerLogs` method.
|
||||
* The container can be stopped and removed (including its systemd unit file).
|
||||
* All container operations are verifiably performed by the specified unprivileged user.
|
||||
|
||||
This detailed plan should provide a clearer path for implementing these initial crucial phases. Remember to keep testing iterative and focused on the RFC specifications.
|
1014
docs/rfc/RFC001-KAT.md
Normal file
1014
docs/rfc/RFC001-KAT.md
Normal file
File diff suppressed because it is too large
Load Diff
Loading…
x
Reference in New Issue
Block a user