kat/docs/plan/phase3.md
2025-05-09 19:15:50 -04:00

10 KiB

Phase 3: Container Runtime Interface & Local Podman Management

  • Goal: Abstract container management operations behind a ContainerRuntime interface and implement it using Podman CLI, enabling an agent to manage containers rootlessly based on (mocked) instructions.
  • RFC Sections Primarily Used: 6.1 (Runtime Interface Definition), 6.2 (Default Implementation: Podman), 6.3 (Rootless Execution Strategy).

Tasks & Sub-Tasks:

  1. Define ContainerRuntime Go Interface (internal/runtime/interface.go)

    • Purpose: Abstract all container operations (build, pull, run, stop, inspect, logs, etc.).
    • Details: Transcribe the Go interface from RFC 6.1 precisely. Include all specified structs (ImageSummary, ContainerStatus, BuildOptions, PortMapping, VolumeMount, ResourceSpec, ContainerCreateOptions, ContainerHealthCheck) and enums (ContainerState, HealthState).
    • Verification: Code compiles. Interface and type definitions match RFC.
  2. Implement Podman Backend for ContainerRuntime (internal/runtime/podman.go) - Core Lifecycle Methods

    • Purpose: Translate ContainerRuntime calls into podman CLI commands.
    • Details (for each method, focus on these first):
      • PullImage(ctx, imageName, platform):
        • Cmd: podman pull {imageName} (add --platform if specified).
        • Parse output to get image ID (e.g., from podman inspect {imageName} --format '{{.Id}}').
      • CreateContainer(ctx, opts ContainerCreateOptions):
        • Cmd: podman create ...
        • Translate ContainerCreateOptions into podman create flags:
          • --name {opts.InstanceID} (KAT's unique ID for the instance).
          • --hostname {opts.Hostname}.
          • --env for opts.Env.
          • --label for opts.Labels (include KAT ownership labels like kat.dws.rip/workload-name, kat.dws.rip/namespace, kat.dws.rip/instance-id).
          • --restart {opts.RestartPolicy} (map to Podman's "no", "on-failure", "always").
          • Resource mapping: --cpus (for quota), --cpu-shares, --memory.
          • --publish for opts.Ports.
          • --volume for opts.Volumes (source will be host path, destination is container path).
          • --network {opts.NetworkName} and --ip {opts.IPAddress} if specified.
          • --user {opts.User}.
          • --cap-add, --cap-drop, --security-opt.
          • Podman native healthcheck flags from opts.HealthCheck.
          • --systemd={opts.Systemd}.
        • Parse output for container ID.
      • StartContainer(ctx, containerID): Cmd: podman start {containerID}.
      • StopContainer(ctx, containerID, timeoutSeconds): Cmd: podman stop -t {timeoutSeconds} {containerID}.
      • RemoveContainer(ctx, containerID, force, removeVolumes): Cmd: podman rm {containerID} (add --force, --volumes).
      • GetContainerStatus(ctx, containerOrName):
        • Cmd: podman inspect {containerOrName}.
        • Parse JSON output to populate ContainerStatus struct (State, ExitCode, StartedAt, FinishedAt, Health, ImageID, ImageName, OverlayIP if available from inspect).
        • Podman health status needs to be mapped to HealthState.
      • StreamContainerLogs(ctx, containerID, follow, since, stdout, stderr):
        • Cmd: podman logs {containerID} (add --follow, --since).
        • Stream os/exec.Cmd.Stdout and os/exec.Cmd.Stderr to the provided io.Writers.
    • Helper: A utility function to run podman commands as a specific rootless user (see Rootless Execution below).
    • Potential Challenges: Correctly mapping all ContainerCreateOptions to Podman flags. Parsing varied podman inspect output. Managing os/exec for logs. Robust error handling from CLI output.
    • Verification:
      • Unit tests for each implemented method, mocking os/exec calls to verify command construction and output parsing.
      • Requires Podman installed for integration-style unit tests: Tests that actually execute podman commands (e.g., pull alpine, create, start, inspect, stop, rm) and verify state changes.
  3. Implement Rootless Execution Strategy (internal/runtime/podman.go helpers, internal/agent/runtime.go)

    • Purpose: Ensure containers are run by unprivileged users using systemd for supervision.
    • Details:
      • User Assumption: For Phase 3, assume the dedicated user (e.g., kat_wl_mywebapp) already exists on the system and loginctl enable-linger <username> has been run manually. The username could be passed in ContainerCreateOptions.User or derived.
      • Podman Command Execution Context:
        • The kat-agent process itself might run as root or a privileged user.
        • When executing podman commands for a workload, it MUST run them as the target unprivileged user.
        • This can be achieved using sudo -u {username} podman ... or more directly via nsenter/setuid if the agent has capabilities, or by setting XDG_RUNTIME_DIR and DBUS_SESSION_BUS_ADDRESS appropriately for the target user if invoking podman via systemd user session D-Bus API. Simplest for now might be sudo -u {username} podman ... if agent is root, or ensuring agent itself runs as a user who can switch to other kat_wl_* users.
        • The RFC prefers "systemd user sessions". This usually means systemctl --user .... To control another user's systemd session, the agent process (if root) can use machinectl shell {username}@.host /bin/bash -c "systemctl --user ..." or systemd-run --user --machine={username}@.host .... If the agent is not root, it cannot directly control other users' systemd sessions. This is a critical design point: how does the agent (potentially root) interact with user-level systemd?
        • RFC: "Agent uses systemctl --user --machine={username}@.host ...". This implies agent has permissions to do this (likely running as root or with specific polkit rules).
      • Systemd Unit Generation & Management:
        • After podman create ... (or instead of direct create, if podman generate systemd is used to create the definition), generate systemd unit: podman generate systemd --new --name {opts.InstanceID} --files --time 10 {imageNameUsedInCreate}. This creates a {opts.InstanceID}.service file.
        • The ContainerRuntime implementation needs to:
          1. Execute podman create to establish the container definition (this allows Podman to manage its internal state for the container ID).
          2. Execute podman generate systemd --name {containerID} (using the ID from create) to get the unit file content.
          3. Place this unit file in the target user's systemd path (e.g., /home/{username}/.config/systemd/user/{opts.InstanceID}.service or /etc/systemd/user/{opts.InstanceID}.service if agent is root and wants to enable for any user).
          4. Run systemctl --user --machine={username}@.host daemon-reload.
          5. Start/Enable: systemctl --user --machine={username}@.host enable --now {opts.InstanceID}.service.
        • To stop: systemctl --user --machine={username}@.host stop {opts.InstanceID}.service.
        • To remove: systemctl --user --machine={username}@.host disable {opts.InstanceID}.service, then podman rm {opts.InstanceID}, then remove the unit file.
        • Status: systemctl --user --machine={username}@.host status {opts.InstanceID}.service (parse output), or rely on podman inspect which should reflect systemd-managed state.
    • Potential Challenges: Managing permissions for interacting with other users' systemd sessions. Correctly placing and cleaning up systemd unit files. Ensuring XDG_RUNTIME_DIR is set correctly for rootless Podman if not using systemd units for direct podman run. Systemd unit generation nuances.
    • Verification:
      • A test in internal/agent/runtime_test.go (or similar) can take mock ContainerCreateOptions.
      • It calls the (mocked or real) ContainerRuntime implementation.
      • Verify:
        • Podman commands are constructed to run as the target unprivileged user.
        • A systemd unit file is generated for the container.
        • systemctl --user --machine... commands are invoked correctly to manage the service.
        • The container is actually started (verify with podman ps -a --filter label=kat.dws.rip/instance-id={instanceID} as the target user).
        • Logs can be retrieved.
        • The container can be stopped and removed, including its systemd unit.
  • Milestone Verification:
    • The ContainerRuntime Go interface is fully defined as per RFC 6.1.
    • The Podman implementation for core lifecycle methods (PullImage, CreateContainer (leading to systemd unit generation), StartContainer (via systemd enable/start), StopContainer (via systemd stop), RemoveContainer (via systemd disable + podman rm + unit file removal), GetContainerStatus, StreamContainerLogs) is functional.
    • An internal/agent test (or a temporary main.go test harness) can:
      1. Define ContainerCreateOptions for a simple image like docker.io/library/alpine with a command like sleep 30.
      2. Specify a (manually pre-created and linger-enabled) unprivileged username.
      3. Call the ContainerRuntime methods.
      4. Result:
        • The alpine image is pulled (if not present).
        • A systemd user service unit is generated and placed correctly for the specified user.
        • The service is started using systemctl --user --machine....
        • podman ps --all --filter label=kat.dws.rip/instance-id=... (run as the target user or by root seeing all containers) shows the container running or having run.
        • Logs can be retrieved using the StreamContainerLogs method.
        • The container can be stopped and removed (including its systemd unit file).
    • All container operations are verifiably performed by the specified unprivileged user.

This detailed plan should provide a clearer path for implementing these initial crucial phases. Remember to keep testing iterative and focused on the RFC specifications.