kat/docs/plan/phase3.md
2025-05-09 19:15:50 -04:00

102 lines
10 KiB
Markdown

# **Phase 3: Container Runtime Interface & Local Podman Management**
* **Goal**: Abstract container management operations behind a `ContainerRuntime` interface and implement it using Podman CLI, enabling an agent to manage containers rootlessly based on (mocked) instructions.
* **RFC Sections Primarily Used**: 6.1 (Runtime Interface Definition), 6.2 (Default Implementation: Podman), 6.3 (Rootless Execution Strategy).
**Tasks & Sub-Tasks:**
1. **Define `ContainerRuntime` Go Interface (`internal/runtime/interface.go`)**
* **Purpose**: Abstract all container operations (build, pull, run, stop, inspect, logs, etc.).
* **Details**: Transcribe the Go interface from RFC 6.1 precisely. Include all specified structs (`ImageSummary`, `ContainerStatus`, `BuildOptions`, `PortMapping`, `VolumeMount`, `ResourceSpec`, `ContainerCreateOptions`, `ContainerHealthCheck`) and enums (`ContainerState`, `HealthState`).
* **Verification**: Code compiles. Interface and type definitions match RFC.
2. **Implement Podman Backend for `ContainerRuntime` (`internal/runtime/podman.go`) - Core Lifecycle Methods**
* **Purpose**: Translate `ContainerRuntime` calls into `podman` CLI commands.
* **Details (for each method, focus on these first):**
* `PullImage(ctx, imageName, platform)`:
* Cmd: `podman pull {imageName}` (add `--platform` if specified).
* Parse output to get image ID (e.g., from `podman inspect {imageName} --format '{{.Id}}'`).
* `CreateContainer(ctx, opts ContainerCreateOptions)`:
* Cmd: `podman create ...`
* Translate `ContainerCreateOptions` into `podman create` flags:
* `--name {opts.InstanceID}` (KAT's unique ID for the instance).
* `--hostname {opts.Hostname}`.
* `--env` for `opts.Env`.
* `--label` for `opts.Labels` (include KAT ownership labels like `kat.dws.rip/workload-name`, `kat.dws.rip/namespace`, `kat.dws.rip/instance-id`).
* `--restart {opts.RestartPolicy}` (map to Podman's "no", "on-failure", "always").
* Resource mapping: `--cpus` (for quota), `--cpu-shares`, `--memory`.
* `--publish` for `opts.Ports`.
* `--volume` for `opts.Volumes` (source will be host path, destination is container path).
* `--network {opts.NetworkName}` and `--ip {opts.IPAddress}` if specified.
* `--user {opts.User}`.
* `--cap-add`, `--cap-drop`, `--security-opt`.
* Podman native healthcheck flags from `opts.HealthCheck`.
* `--systemd={opts.Systemd}`.
* Parse output for container ID.
* `StartContainer(ctx, containerID)`: Cmd: `podman start {containerID}`.
* `StopContainer(ctx, containerID, timeoutSeconds)`: Cmd: `podman stop -t {timeoutSeconds} {containerID}`.
* `RemoveContainer(ctx, containerID, force, removeVolumes)`: Cmd: `podman rm {containerID}` (add `--force`, `--volumes`).
* `GetContainerStatus(ctx, containerOrName)`:
* Cmd: `podman inspect {containerOrName}`.
* Parse JSON output to populate `ContainerStatus` struct (State, ExitCode, StartedAt, FinishedAt, Health, ImageID, ImageName, OverlayIP if available from inspect).
* Podman health status needs to be mapped to `HealthState`.
* `StreamContainerLogs(ctx, containerID, follow, since, stdout, stderr)`:
* Cmd: `podman logs {containerID}` (add `--follow`, `--since`).
* Stream `os/exec.Cmd.Stdout` and `os/exec.Cmd.Stderr` to the provided `io.Writer`s.
* **Helper**: A utility function to run `podman` commands as a specific rootless user (see Rootless Execution below).
* **Potential Challenges**: Correctly mapping all `ContainerCreateOptions` to Podman flags. Parsing varied `podman inspect` output. Managing `os/exec` for logs. Robust error handling from CLI output.
* **Verification**:
* Unit tests for each implemented method, mocking `os/exec` calls to verify command construction and output parsing.
* *Requires Podman installed for integration-style unit tests*: Tests that actually execute `podman` commands (e.g., pull alpine, create, start, inspect, stop, rm) and verify state changes.
3. **Implement Rootless Execution Strategy (`internal/runtime/podman.go` helpers, `internal/agent/runtime.go`)**
* **Purpose**: Ensure containers are run by unprivileged users using systemd for supervision.
* **Details**:
* **User Assumption**: For Phase 3, *assume* the dedicated user (e.g., `kat_wl_mywebapp`) already exists on the system and `loginctl enable-linger <username>` has been run manually. The username could be passed in `ContainerCreateOptions.User` or derived.
* **Podman Command Execution Context**:
* The `kat-agent` process itself might run as root or a privileged user.
* When executing `podman` commands for a workload, it MUST run them as the target unprivileged user.
* This can be achieved using `sudo -u {username} podman ...` or more directly via `nsenter`/`setuid` if the agent has capabilities, or by setting `XDG_RUNTIME_DIR` and `DBUS_SESSION_BUS_ADDRESS` appropriately for the target user if invoking `podman` via systemd user session D-Bus API. *Simplest for now might be `sudo -u {username} podman ...` if agent is root, or ensuring agent itself runs as a user who can switch to other `kat_wl_*` users.*
* The RFC prefers "systemd user sessions". This usually means `systemctl --user ...`. To control another user's systemd session, the agent process (if root) can use `machinectl shell {username}@.host /bin/bash -c "systemctl --user ..."` or `systemd-run --user --machine={username}@.host ...`. If the agent is not root, it cannot directly control other users' systemd sessions. *This is a critical design point: how does the agent (potentially root) interact with user-level systemd?*
* RFC: "Agent uses `systemctl --user --machine={username}@.host ...`". This implies agent has permissions to do this (likely running as root or with specific polkit rules).
* **Systemd Unit Generation & Management**:
* After `podman create ...` (or instead of direct create, if `podman generate systemd` is used to create the definition), generate systemd unit:
`podman generate systemd --new --name {opts.InstanceID} --files --time 10 {imageNameUsedInCreate}`. This creates a `{opts.InstanceID}.service` file.
* The `ContainerRuntime` implementation needs to:
1. Execute `podman create` to establish the container definition (this allows Podman to manage its internal state for the container ID).
2. Execute `podman generate systemd --name {containerID}` (using the ID from create) to get the unit file content.
3. Place this unit file in the target user's systemd path (e.g., `/home/{username}/.config/systemd/user/{opts.InstanceID}.service` or `/etc/systemd/user/{opts.InstanceID}.service` if agent is root and wants to enable for any user).
4. Run `systemctl --user --machine={username}@.host daemon-reload`.
5. Start/Enable: `systemctl --user --machine={username}@.host enable --now {opts.InstanceID}.service`.
* To stop: `systemctl --user --machine={username}@.host stop {opts.InstanceID}.service`.
* To remove: `systemctl --user --machine={username}@.host disable {opts.InstanceID}.service`, then `podman rm {opts.InstanceID}`, then remove the unit file.
* Status: `systemctl --user --machine={username}@.host status {opts.InstanceID}.service` (parse output), or rely on `podman inspect` which should reflect systemd-managed state.
* **Potential Challenges**: Managing permissions for interacting with other users' systemd sessions. Correctly placing and cleaning up systemd unit files. Ensuring `XDG_RUNTIME_DIR` is set correctly for rootless Podman if not using systemd units for direct `podman run`. Systemd unit generation nuances.
* **Verification**:
* A test in `internal/agent/runtime_test.go` (or similar) can take mock `ContainerCreateOptions`.
* It calls the (mocked or real) `ContainerRuntime` implementation.
* Verify:
* Podman commands are constructed to run as the target unprivileged user.
* A systemd unit file is generated for the container.
* `systemctl --user --machine...` commands are invoked correctly to manage the service.
* The container is actually started (verify with `podman ps -a --filter label=kat.dws.rip/instance-id={instanceID}` as the target user).
* Logs can be retrieved.
* The container can be stopped and removed, including its systemd unit.
* **Milestone Verification**:
* The `ContainerRuntime` Go interface is fully defined as per RFC 6.1.
* The Podman implementation for core lifecycle methods (`PullImage`, `CreateContainer` (leading to systemd unit generation), `StartContainer` (via systemd enable/start), `StopContainer` (via systemd stop), `RemoveContainer` (via systemd disable + podman rm + unit file removal), `GetContainerStatus`, `StreamContainerLogs`) is functional.
* An `internal/agent` test (or a temporary `main.go` test harness) can:
1. Define `ContainerCreateOptions` for a simple image like `docker.io/library/alpine` with a command like `sleep 30`.
2. Specify a (manually pre-created and linger-enabled) unprivileged username.
3. Call the `ContainerRuntime` methods.
4. **Result**:
* The alpine image is pulled (if not present).
* A systemd user service unit is generated and placed correctly for the specified user.
* The service is started using `systemctl --user --machine...`.
* `podman ps --all --filter label=kat.dws.rip/instance-id=...` (run as the target user or by root seeing all containers) shows the container running or having run.
* Logs can be retrieved using the `StreamContainerLogs` method.
* The container can be stopped and removed (including its systemd unit file).
* All container operations are verifiably performed by the specified unprivileged user.
This detailed plan should provide a clearer path for implementing these initial crucial phases. Remember to keep testing iterative and focused on the RFC specifications.