The Necklace That Failed Five Times: Building a Self-Critiquing Multi-Agent Portrait Pipeline
Introduction
Most AI image generation workflows are open-loop: you write a prompt, you get an image, you decide if it's good. If it's not, you tweak the prompt and try again. The feedback loop runs through the human.
This project closes that loop. It's a multi-agent pipeline built on Google's Agent Development Kit (ADK) that generates portrait images and self-critiques them — iterating autonomously until every visual attribute passes a strict per-attribute evaluation, or the iteration limit is reached.
The necklace failed all five times. That failure is the most interesting result.
Architecture overview
The pipeline has four components arranged as a SequentialAgent wrapping a LoopAgent:
User message
│
▼
IntakeAgent (LlmAgent)
│ Parses description → state["person_description"]
▼
PortraitRefinementLoop (LoopAgent, max 5 iterations)
│
├── PortraitWriterAgent (LlmAgent)
│ └── generate_image tool
│ • gemini-3.1-flash-image-preview
│ • saves PNG to disk + ADK artifact
│
└── PortraitCriticAgent (LlmAgent)
└── critique_image tool
• gemini-3.5-flash (vision)
• per-attribute evaluation
└── exit_loop tool
• sets actions.escalate = True
Each agent has a single, clearly scoped responsibility. The writer generates. The critic evaluates. The loop orchestrates. No agent does more than one thing.
Decision 1: Per-attribute evaluation instead of holistic scoring
The critic doesn't ask "does this image match the description?" It operates in three steps:
Parse every distinct visual attribute from the description into an explicit sub-checklist — hair colour, iris detail, clothing texture, lighting direction, necklace position, composition.
Verify each attribute individually. A vague resemblance does not pass. The match must be exact and unambiguous.
Apply general photographic standards regardless of description — face fully visible and sharply rendered, professional lighting, portrait-standard composition.
A single failing attribute triggers another iteration with feedback naming exactly what is wrong and why.
This is more expensive than a holistic score. It's also dramatically more precise. The example run demonstrates why.
The necklace problem
The test description included: "gold chain necklace visible above the collar."
The necklace failed all five iterations. Every generated version rendered it draped on the sweater body below the fold — not above the collar as specified.
This is a physically precise spatial constraint. The necklace is present in every image. It's gold. It's near the collar. A holistic "does this match the description?" check would likely pass it.
But above the collar is an exact positional requirement. The image model consistently defaulted to the more statistically common visual — necklace on chest — regardless of what the prompt said. The per-attribute critic caught this every time. A pass/fail prompt would not have.
Iteration 1 — Fail: Gold chain necklace positioned below the turtleneck collar rather than above it
Iteration 2 — Fail: Necklace still below collar; iris shows golden-brown inner ring instead of green
Iteration 3 — Fail: All other attributes pass; necklace remains draped onto the sweater body below the fold
Iteration 4 — Fail: Necklace below collar; shadow cast on wrong cheek; iris lacks defined green inner ring
Iteration 5 — Fail: Necklace below collar; eyes appear fully green rather than hazel with a green inner ring
The iris detail showed a similar pattern. The description specified "hazel eyes with a distinct green inner ring." The model repeatedly collapsed this to a uniform colour — fully green or fully brown — unable to render the two-tone structure at portrait scale.
These aren't prompt engineering failures. They're the edges of what current image generation models can reliably produce, surfaced by evaluation that's precise enough to find them.
Decision 2: Tools for image bytes, strings for inter-agent communication
ADK session state is JSON-serialisable. Image bytes are not. This constraint shapes the tool design.
The generate_image tool handles the google.genai API call, saves the PNG to disk, and saves an ADK artifact for inline UI rendering. It returns a file path string to the agent — not image bytes.
The critique_image tool reads the image from disk, sends it to Gemini's vision model alongside the structured evaluation prompt, and returns a text critique string.
All inter-agent communication flows as plain strings: a file path and a critique. The tools own the binary I/O; the agents never touch it. This keeps the session state clean and the agent logic simple.
Decision 3: Async tools
adk web runs on a FastAPI/uvicorn event loop. Image generation via the Gemini API can take several seconds per call.
Making generate_image and critique_image async — using client.aio.models.generate_content — prevents these calls from blocking the server. tool_context.save_artifact() is also an async ADK method, which requires the tool function itself to be async.
This is a small implementation detail that becomes a significant reliability issue at scale. A synchronous image tool in a web-served agent will stall the event loop under any real load.
Decision 4: IntakeAgent as a state bridge
When running via adk web, the user's input arrives as a chat message in the conversation history — not pre-loaded into session state. Downstream agents that expect state["person_description"] would fail to find it.
The IntakeAgent solves this by reading the conversation and writing a clean, normalised value into session state. Downstream agents consume it via {person_description} template substitution in their system prompts.
This is a pattern worth noting for any ADK pipeline built for adk web: the first agent in the sequence often needs to be a state initialisation agent, bridging between the chat interface and the structured state that the rest of the pipeline expects.
Decision 5: Early exit with escalate
The LoopAgent runs up to 5 iterations unconditionally unless told to stop. The critic calls exit_loop, which sets tool_context.actions.escalate = True. ADK treats this as a termination signal and breaks the loop.
The practical effect: if the image passes on iteration 2, you pay for 2 Gemini API calls, not 5. Cost and latency scale with the quality of the generation, not the iteration ceiling. For a pipeline that makes multiple API calls per iteration, this matters.
What I'd change at larger scale
Parallelise critique and generation where possible. The current loop is strictly sequential. On a pipeline with more agents, some evaluation steps could run in parallel to reduce wall-clock time per iteration.
Add a confidence threshold to the critic. Currently it's binary — pass or fail. A confidence score per attribute would allow the writer to focus on low-confidence attributes rather than regenerating from scratch on each iteration.
Persist iteration history for analysis. The current run saves each portrait to disk but doesn't log the full critique history in a queryable format. A structured log of attribute-level pass/fail per iteration would let you analyse which attributes current models struggle with most — useful data for prompt engineering research.
HTTP SSE transport for remote deployment. The current setup runs locally. Exposing the pipeline as a network-accessible service requires switching the ADK transport and adding authentication.
Conclusion
Closing the generation-evaluation loop with a per-attribute critic surfaces failure modes that holistic evaluation misses. The necklace-above-the-collar failure isn't a prompt engineering problem — it's the image model's spatial reasoning hitting its limit, made visible by an evaluator precise enough to find it.
The architecture pattern — sequential agents, loop with early exit, tools owning binary I/O — is reusable for any iterative generation task where quality criteria can be made explicit.
Full source + example run images: https://github.com/f2015537/iterative-portrait-agent

