kat/docs/plan/phase3.md

# **Phase 3: Container Runtime Interface & Local Podman Management**

*   **Goal**: Abstract container management operations behind a `ContainerRuntime` interface and implement it using Podman CLI, enabling an agent to manage containers rootlessly based on (mocked) instructions.
*   **RFC Sections Primarily Used**: 6.1 (Runtime Interface Definition), 6.2 (Default Implementation: Podman), 6.3 (Rootless Execution Strategy).

**Tasks & Sub-Tasks:**

1.  **Define `ContainerRuntime` Go Interface (`internal/runtime/interface.go`)**
    *   **Purpose**: Abstract all container operations (build, pull, run, stop, inspect, logs, etc.).
    *   **Details**: Transcribe the Go interface from RFC 6.1 precisely. Include all specified structs (`ImageSummary`, `ContainerStatus`, `BuildOptions`, `PortMapping`, `VolumeMount`, `ResourceSpec`, `ContainerCreateOptions`, `ContainerHealthCheck`) and enums (`ContainerState`, `HealthState`).
    *   **Verification**: Code compiles. Interface and type definitions match RFC.

2.  **Implement Podman Backend for `ContainerRuntime` (`internal/runtime/podman.go`) - Core Lifecycle Methods**
    *   **Purpose**: Translate `ContainerRuntime` calls into `podman` CLI commands.
    *   **Details (for each method, focus on these first):**
        *   `PullImage(ctx, imageName, platform)`:
            *   Cmd: `podman pull {imageName}` (add `--platform` if specified).
            *   Parse output to get image ID (e.g., from `podman inspect {imageName} --format '{{.Id}}'`).
        *   `CreateContainer(ctx, opts ContainerCreateOptions)`:
            *   Cmd: `podman create ...`
            *   Translate `ContainerCreateOptions` into `podman create` flags:
                *   `--name {opts.InstanceID}` (KAT's unique ID for the instance).
                *   `--hostname {opts.Hostname}`.
                *   `--env` for `opts.Env`.
                *   `--label` for `opts.Labels` (include KAT ownership labels like `kat.dws.rip/workload-name`, `kat.dws.rip/namespace`, `kat.dws.rip/instance-id`).
                *   `--restart {opts.RestartPolicy}` (map to Podman's "no", "on-failure", "always").
                *   Resource mapping: `--cpus` (for quota), `--cpu-shares`, `--memory`.
                *   `--publish` for `opts.Ports`.
                *   `--volume` for `opts.Volumes` (source will be host path, destination is container path).
                *   `--network {opts.NetworkName}` and `--ip {opts.IPAddress}` if specified.
                *   `--user {opts.User}`.
                *   `--cap-add`, `--cap-drop`, `--security-opt`.
                *   Podman native healthcheck flags from `opts.HealthCheck`.
                *   `--systemd={opts.Systemd}`.
            *   Parse output for container ID.
        *   `StartContainer(ctx, containerID)`: Cmd: `podman start {containerID}`.
        *   `StopContainer(ctx, containerID, timeoutSeconds)`: Cmd: `podman stop -t {timeoutSeconds} {containerID}`.
        *   `RemoveContainer(ctx, containerID, force, removeVolumes)`: Cmd: `podman rm {containerID}` (add `--force`, `--volumes`).
        *   `GetContainerStatus(ctx, containerOrName)`:
            *   Cmd: `podman inspect {containerOrName}`.
            *   Parse JSON output to populate `ContainerStatus` struct (State, ExitCode, StartedAt, FinishedAt, Health, ImageID, ImageName, OverlayIP if available from inspect).
            *   Podman health status needs to be mapped to `HealthState`.
        *   `StreamContainerLogs(ctx, containerID, follow, since, stdout, stderr)`:
            *   Cmd: `podman logs {containerID}` (add `--follow`, `--since`).
            *   Stream `os/exec.Cmd.Stdout` and `os/exec.Cmd.Stderr` to the provided `io.Writer`s.
    *   **Helper**: A utility function to run `podman` commands as a specific rootless user (see Rootless Execution below).
    *   **Potential Challenges**: Correctly mapping all `ContainerCreateOptions` to Podman flags. Parsing varied `podman inspect` output. Managing `os/exec` for logs. Robust error handling from CLI output.
    *   **Verification**:
        *   Unit tests for each implemented method, mocking `os/exec` calls to verify command construction and output parsing.
        *   *Requires Podman installed for integration-style unit tests*: Tests that actually execute `podman` commands (e.g., pull alpine, create, start, inspect, stop, rm) and verify state changes.

3.  **Implement Rootless Execution Strategy (`internal/runtime/podman.go` helpers, `internal/agent/runtime.go`)**
    *   **Purpose**: Ensure containers are run by unprivileged users using systemd for supervision.
    *   **Details**:
        *   **User Assumption**: For Phase 3, *assume* the dedicated user (e.g., `kat_wl_mywebapp`) already exists on the system and `loginctl enable-linger <username>` has been run manually. The username could be passed in `ContainerCreateOptions.User` or derived.
        *   **Podman Command Execution Context**:
            *   The `kat-agent` process itself might run as root or a privileged user.
            *   When executing `podman` commands for a workload, it MUST run them as the target unprivileged user.
            *   This can be achieved using `sudo -u {username} podman ...` or more directly via `nsenter`/`setuid` if the agent has capabilities, or by setting `XDG_RUNTIME_DIR` and `DBUS_SESSION_BUS_ADDRESS` appropriately for the target user if invoking `podman` via systemd user session D-Bus API. *Simplest for now might be `sudo -u {username} podman ...` if agent is root, or ensuring agent itself runs as a user who can switch to other `kat_wl_*` users.*
            *   The RFC prefers "systemd user sessions". This usually means `systemctl --user ...`. To control another user's systemd session, the agent process (if root) can use `machinectl shell {username}@.host /bin/bash -c "systemctl --user ..."` or `systemd-run --user --machine={username}@.host ...`. If the agent is not root, it cannot directly control other users' systemd sessions. *This is a critical design point: how does the agent (potentially root) interact with user-level systemd?*
            *   RFC: "Agent uses `systemctl --user --machine={username}@.host ...`". This implies agent has permissions to do this (likely running as root or with specific polkit rules).
        *   **Systemd Unit Generation & Management**:
            *   After `podman create ...` (or instead of direct create, if `podman generate systemd` is used to create the definition), generate systemd unit:
                `podman generate systemd --new --name {opts.InstanceID} --files --time 10 {imageNameUsedInCreate}`. This creates a `{opts.InstanceID}.service` file.
            *   The `ContainerRuntime` implementation needs to:
                1.  Execute `podman create` to establish the container definition (this allows Podman to manage its internal state for the container ID).
                2.  Execute `podman generate systemd --name {containerID}` (using the ID from create) to get the unit file content.
                3.  Place this unit file in the target user's systemd path (e.g., `/home/{username}/.config/systemd/user/{opts.InstanceID}.service` or `/etc/systemd/user/{opts.InstanceID}.service` if agent is root and wants to enable for any user).
                4.  Run `systemctl --user --machine={username}@.host daemon-reload`.
                5.  Start/Enable: `systemctl --user --machine={username}@.host enable --now {opts.InstanceID}.service`.
            *   To stop: `systemctl --user --machine={username}@.host stop {opts.InstanceID}.service`.
            *   To remove: `systemctl --user --machine={username}@.host disable {opts.InstanceID}.service`, then `podman rm {opts.InstanceID}`, then remove the unit file.
            *   Status: `systemctl --user --machine={username}@.host status {opts.InstanceID}.service` (parse output), or rely on `podman inspect` which should reflect systemd-managed state.
    *   **Potential Challenges**: Managing permissions for interacting with other users' systemd sessions. Correctly placing and cleaning up systemd unit files. Ensuring `XDG_RUNTIME_DIR` is set correctly for rootless Podman if not using systemd units for direct `podman run`. Systemd unit generation nuances.
    *   **Verification**:
        *   A test in `internal/agent/runtime_test.go` (or similar) can take mock `ContainerCreateOptions`.
        *   It calls the (mocked or real) `ContainerRuntime` implementation.
        *   Verify:
            *   Podman commands are constructed to run as the target unprivileged user.
            *   A systemd unit file is generated for the container.
            *   `systemctl --user --machine...` commands are invoked correctly to manage the service.
            *   The container is actually started (verify with `podman ps -a --filter label=kat.dws.rip/instance-id={instanceID}` as the target user).
            *   Logs can be retrieved.
            *   The container can be stopped and removed, including its systemd unit.

*   **Milestone Verification**:
    *   The `ContainerRuntime` Go interface is fully defined as per RFC 6.1.
    *   The Podman implementation for core lifecycle methods (`PullImage`, `CreateContainer` (leading to systemd unit generation), `StartContainer` (via systemd enable/start), `StopContainer` (via systemd stop), `RemoveContainer` (via systemd disable + podman rm + unit file removal), `GetContainerStatus`, `StreamContainerLogs`) is functional.
    *   An `internal/agent` test (or a temporary `main.go` test harness) can:
        1.  Define `ContainerCreateOptions` for a simple image like `docker.io/library/alpine` with a command like `sleep 30`.
        2.  Specify a (manually pre-created and linger-enabled) unprivileged username.
        3.  Call the `ContainerRuntime` methods.
        4.  **Result**:
            *   The alpine image is pulled (if not present).
            *   A systemd user service unit is generated and placed correctly for the specified user.
            *   The service is started using `systemctl --user --machine...`.
            *   `podman ps --all --filter label=kat.dws.rip/instance-id=...` (run as the target user or by root seeing all containers) shows the container running or having run.
            *   Logs can be retrieved using the `StreamContainerLogs` method.
            *   The container can be stopped and removed (including its systemd unit file).
    *   All container operations are verifiably performed by the specified unprivileged user.

This detailed plan should provide a clearer path for implementing these initial crucial phases. Remember to keep testing iterative and focused on the RFC specifications.