Paper deep dive
Teaching an Agent to Sketch One Part at a Time
Xiaodan Du, Ruize Xu, David Yunis, Yael Vinker, Greg Shakhnarovich
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%
Last extracted: 3/23/2026, 12:05:26 PM
Summary
The paper introduces a method for progressive, part-by-part vector sketch generation using a multi-modal language model-based agent. The authors propose a scalable, automated annotation pipeline to create 'ControlSketch-Part', a dataset with semantic part-level annotations. The agent is trained using a two-stage framework: supervised fine-tuning (SFT) followed by a novel multi-turn process-reward Group Relative Policy Optimization (GRPO) algorithm, which enables interpretable, controllable, and locally editable text-to-vector sketch generation.
Entities (5)
Relation Signals (3)
GRPO → optimizes → VLM
confidence 95% · we further train our agent with a reinforcement learning algorithm, Group Relative Policy Optimization (GRPO)
DreamSim → providesrewardfor → GRPO
confidence 95% · We use two rewards to supervise GRPO training: DreamSim reward
ControlSketch-Part → trains → VLM
confidence 95% · we use it to train a VLM on the text-guided part-by-part generation task.
Cypher Suggestions (2)
Find all models trained on the ControlSketch-Part dataset · confidence 90% · unvalidated
MATCH (d:Dataset {name: 'ControlSketch-Part'})<-[:TRAINS]-(m:Model) RETURN m.nameIdentify algorithms used to optimize the VLM agent · confidence 90% · unvalidated
MATCH (a:Algorithm)-[:OPTIMIZES]->(m:Model {name: 'VLM'}) RETURN a.nameAbstract
Abstract:We develop a method for producing vector sketches one part at a time. To do this, we train a multi-modal language model-based agent using a novel multi-turn process-reward reinforcement learning following supervised fine-tuning. Our approach is enabled by a new dataset we call ControlSketch-Part, containing rich part-level annotations for sketches, obtained using a novel, generic automatic annotation pipeline that segments vector sketches into semantic parts and assigns paths to parts with a structured multi-stage labeling process. Our results indicate that incorporating structured part-level data and providing agent with the visual feedback through the process enables interpretable, controllable, and locally editable text-to-vector sketch generation.
Tags
Links
- Source: https://arxiv.org/abs/2603.19500v1
- Canonical: https://arxiv.org/abs/2603.19500v1
Full Text
63,714 characters extracted from source content.
Expand or collapse full text
Teaching an Agent to Sketch One Part at a Time Xiaodan Du ∗1 , Ruize Xu ∗2 , David Yunis ∗1 , Yael Vinker 3 , and Greg Shakhnarovich 1 1 TTI-Chicago 2 University of Chicago 3 MIT CSAIL xdu,dyunis,greg@ttic.edu richard1xur@uchicago.edu yaelvink@mit.edu Fig. 1: Progressive vector sketch generation using our VLM agent. Trained on our new dataset via SFT + RL training, our agent generates sketches part-by-part, conditioned on text instructions and the evolving canvas. It produces diverse, struc- turally plausible sketches and supports localized editing via arbitrary stroke removal and replacement. Abstract. We develop a method for producing vector sketches one part at a time. To do this, we train a multi-modal language model-based agent using a novel multi-turn process-reward reinforcement learning following supervised fine-tuning. Our approach is enabled by a new dataset we call ControlSketch-Part, containing rich part-level annotations for sketches, obtained using a novel, generic automatic annotation pipeline that seg- ments vector sketches into semantic parts and assigns paths to parts with a structured multi-stage labeling process. Our results indicate that incorporating structured part-level data and providing agent with the visual feedback through the process enables interpretable, controllable, and locally editable text-to-vector sketch generation. 1 Introduction Sketching provides a structured abstraction of visual content and enables rapid ideation and concept exploration in domains from industrial design to digi- tal art. Sketches represented as vector graphics offer many advantages over * Equal contribution. arXiv:2603.19500v1 [cs.AI] 19 Mar 2026 2X. Du et al. rasterized canvasses, making them useful in creative workflows: infinite scala- bility, structured visual elements, support for precise and localized modifica- tions, and more. Automatic text-to-vector sketch generation has been widely explored [6–10,13,25,36,39,42]. However, the majority of existing works gener- ate the full sketch at once, overlooking the progressive, step-by-step nature of the sketching process. Having the strokes in a sketch grouped into meaningful parts makes the sketch more easily editable: parts could be removed, replaced or modified in isolation from the rest of the sketch more efficiently than by modifying individual strokes. It also makes the sketch more interpretable to a human user. Finally, one-shot generation from a long, compositional prompts may lead to failures that are localized but difficult to mitigate. In contrast, incorporating the notion of parts into the generation pipeline provides the designer fine-grained control: if a generated sketch of a part is not right, it can be replaced, and multiple choices can be explored at any intermediate stage before proceeding to other parts. Most prior work on text-to-vector sketch generation does not allow for part- by-part generation. The one exception is SketchAgent [36] which relies on close- sourced vision-language model (VLM) as backend (and thus is not easily adapt- able to desired domain or style) and produces simplistic, icon-style outputs. This leaves a gap: existing methods are unable to achieve free-text guided part-by-part generation and highly detailed vector sketch with a unified model, struggling to support human-friendly workflows which include branching possi- bilities and creative exploration. In contrast, our method makes these possible, as illustrated in Fig. 1. We believe that a necessary element for closing this gap is the right training data. Work like SketchAgent has demonstrated the potential of VLMs in itera- tively generating sketches conditioned on text. However, large language models (LLMs) are known to be extremely data-hungry [16, 33] and collecting a large amount of high-quality part-annotated datasets for vector sketch created by professionals can be costly and difficult to scale. To overcome this obstacle, we propose a scalable pipeline for annotating parts in vector sketches. Our pipeline relies on a multi-stage labeling process for part decomposition and path assign- ment. It includes proposal, critique, and revision stages. This pipeline is generic and can be applied to any vector sketch data. We apply the aforementioned data collection pipeline on the ControlSketch dataset [1]. We call the resulting part-annotated dataset (which we will release as one of our contributions) ControlSketch-Part, and use it to train a VLM on the text-guided part-by-part generation task. The training uses a two-stage supervised fine-tuning (SFT) and reinforcement learning (RL) framework, where the SFT stage teaches format and initializes the sketching policy for a single turn, and an innovative multi-turn process-reward GRPO RL training stage aligns multi-turn rollouts using intermediate-state rewards [29,30]. Our automatic data pipeline and the proposed training strategy enable free- text guided multi-turn interactive sketch generation. We show, quantitatively Teaching an Agent to Sketch One Part at a Time3 using automated metric and user studies, and qualitatively in the paper and the supplementary, that our results significantly improve on prior work. In summary, our contributions are: – A generic scalable pipeline for automated VLM-based part annotation of vector sketches, yielding a short overall caption, a set of semantic part de- scriptions, and a complete path-to-part assignment for any vector sketch. – A high-quality part-annotated sketch dataset, ControlSketch-Part, and an associated new benchmark for multi-turn text-to-vector sketch generation. – A novel multi-turn process-reward GRPO algorithm for training, enabling us to train a sketching agent with novel capabilities: multi-turn vector sketch generation and progressive editing of sketches with text guidance. The qualitative and quantitative experiment results show the potential of our data pipeline combined with VLM in the field of text-to-vector sketch synthesis. 2 Related Works 2.1 Text-to-Vector Sketch Synthesis Previous works on text-to-vector sketch generation fall into two main categories: learning-based approaches and test-time optimization-based approaches. Learning-based approaches Sketch-RNN [13] is one of the first works to ex- plore this task by learning to generate polylines autoregressively. BézierSketch [7] improves upon Sketch-RNN by replacing polylines with Bézier curves for better smoothness. SketchODE [8] further extends autoregressive stroke generation by modeling sketches as continuous-time functions via Neural ODEs. More recently, inspired by the success of score-based generative modeling [15,31], methods in- cluding ChiroDiff [9] and StrokeFusion [42] apply diffusion models to sketch synthesis and denoise all strokes simultaneously, offering no progressive, part- level control over the generation process. These methods are conditioned on pre- defined discrete class/attribute labels rather than free-form natural language, greatly limiting their real-world applicability. They also fall short of producing complex, high-fidelity sketches. Test-time optimization-based approaches These methods take a longer time to produce an output but offer greater flexibility and higher visual qual- ity. CLIPDraw [10] pioneers text-guided vector sketch synthesis by optimizing SVG paths with a CLIP-based [26] objective. Later works utilize CLIP-based optimization for versatile image-to-sketch generation [34, 35]. DiffSketcher [39], AutoSketch [6], and SketchDreamer [25] leverage a wider range of supervision, such as LPIPS [41] loss and score distillation sampling (SDS) [24] loss, to achieve higher visual quality. However, these methods optimize all strokes jointly for a single text input, producing sketches without meaningful stroke ordering or se- mantic part structure. The most directly relevant work to part-aware, text-guided sketch synthe- sis is SketchAgent [36], which uses a closed-source Claude Sonnet model in a zero-shot prompting framework to perform text-guided sequential sketching. 4X. Du et al. 1.Initialpartdecomposition decompose the object into meaningful components. 1.Describe all visible details 2. ... Renderedsketch Instructions VLM 1.A looped head .. 2.A rectangular torso .. 3.Two arms ... 4.Atriangular waist, twolegs.. Listofparts 2,3.Partcritique&refinement ...Requirements 2 and 11 prefer a finer- grained decomposition.Split the fourth part into two separate parts... 1.A looped head ... 2.A rectangular torso ... 3.Two arms... 4.A triangular waist... 5.Two legs ... Listofrefinedparts 4.Initialpathassignment Renderedsketch <path d=“M 209 400 C 146 270 240 249 208 422” /><path d="M 120 ... SVG Listofrefinedparts Assign each path to one of the parts provided below. Instructions "Path1": "Part5", "Path10": "Part5", "Path11": "Part1",... Pathassignment A looped head ... A rectangular torso... Two arms ... Triangular waist ... Two legs with knee... ...Path 1 is the limb of the left arm (viewer's left). It is currently assigned to Part5 (Legs), which is incorrect. Change Path 1 from Part5 to Part3... 5,6.Pathassignmentcritique&refinement Diagnostic visualization 7.Captiongeneration general caption "Path1": "Part3", "Path10": "Part5", "Path11": "Part1", ... Refinedpathassignment + + VLM VLM VLM + + + VLM + Listofrefinedparts Fig. 2: An illustration of our automated part annotation pipeline. The same VLM is used to produce part designations and assignments in some stages and to critique these assignments and suggest improvements in other stages. Green check marks indicate outputs retained in the final dataset. SketchAgent’s zero-shot nature constrains it to doodle-style outputs that can- not be adapted to higher visual fidelity or specific domains. It also exhibits low spatial grounding accuracy. 2.2 Reinforcement Learning for Large Language Models Reinforcement learning (RL) has long provided a principled framework for op- timizing sequential decision-makers in Markov decision processes (MDPs). This MDP perspective is increasingly relevant for modern LLMs, since autoregres- sive token generation itself can be interpreted as a sequential decision process and, more broadly, many agentic applications expose the model to multi-step environments where errors can accumulate over time. Recently, DeepSeekMath introduced Group Relative Policy Optimization (GRPO), which has been effi- cient and successful in various reasoning tasks [30]. Multimodal RL and dense credit assignment Extensions of GRPO training go beyond text-only reasoning. In the domain of vector graphics, for example, Reason-SVG [38] proposes a two-stage scheme (SFT followed by GRPO) to generate SVG sketches via a hybrid reward combining programmatic correctness and visual similarity. A complementary line, Rendering-Aware Reinforcement Learning [27], uses rendering feedback to compute a visual reward: the similarity between a rendered SVG output and a target image guides policy improvement through GRPO. However, these methods do not use intermediate states of the generation process, whereas we leverage intermediate state representations (i.e., partial sketch) to provide dense credit assignment. Teaching an Agent to Sketch One Part at a Time5 3 Automated Part Annotation We start with an existing dataset of sketches. Our goal is to enrich each sketch, given as a vector graphics (e.g., SVG) file, with detailed part information: – A short caption describing the entire sketch; – A set of part descriptions, on a semantic level related to the sketch content; – A path-to-part assignment for each path. 3.1 Data Collection Pipeline We design a multi-step automatic data annotation pipeline that progressively derives semantic structure from the raw SVG input. An overview of the pipeline is presented in Fig. 2. (1) Initial part decomposition The input sketch is rendered into a raster image. Based on this rendering, a VLM proposes a semantic decomposition as a small set of parts. Each part is written as a concise textual description of a distinct object component. The VLM prompt (see Supplementary for all prompt details) instructs it to output non-overlapping yet collectively exhaustive parts. (2) Part critique Like others [14, 20], we find that even current state-of-the- art VLMs struggle to follow all rules in a complicated task. Therefore, we run an improvement step: the VLM (acting now as a critic) audits the current set of parts against all the instructions from Step 1 (and the rendered sketch) and returns a structured list of issues, enforced by a schema [12]. Each issue contains “type of violation”, “severity”, “reasoning” and “suggested fix”. The critique also contains an overall “summary” of the issues and a boolean “should revise” flag. (3) Part refinement If the flag is “should revise”, the VLM is instructed to revise the provided previous part decomposition using the critique from (2) and the sketch rendering. The output format is the same as that in (1). (4) Initial path assignment Based on the refined parts, the sketch’s SVG text, and the sketch rendering, we instruct the VLM to assign every path to one part. The output is schema-constrained so that: – parts are assigned part labels as “Part1”, “Part2”,. . . ; – each path index (“Path1”, “Path2”,. . . ) is assigned to exactly one part; – each part contains at least one path. (5) Path assignment critique with diagnostic visualization. We critique the path assignment similarly to (2), with the addition of a diagnostic visual- ization (shown in Fig. 2), as input to the VLM critic. First, we assign each part label a unique color from a pre-defined color palette, and build two panel. In the left panel of the diagnostic visualization, we render a color marker, the part la- bel, and the part description text for each part, in the corresponding color. Thus each part description has an unambiguous visual identity. In the right panel, we recolor the sketch by rendering each path in the color of its assigned part. The two color-coded panels (descriptions and sketch) are concatenated side-by-side, making it easier for the VLM to capture the correspondence between the part de- scription and the path assignment. The VLM receives the original sketch image, the diagnostic image, the previous path assignment, and the task instructions 6X. Du et al. Caption: A horse stands facing right with an arched neck, straight front legs, rounded hindquarters, and a long tail. Part 1: torso - chest, back, and rounded hindquarters. Part 2: two angled rear legs.Part 3: long tail flowing down from the rear.Part 4: two straight front legs.Part 5: head with pointed ears and arched neck with mane. Fig. 3: Examples from the ControlSketch-Part dataset. We show part decomposition for 4 sketches with various objects and number of parts. The actual caption and part descriptions are shown for the rightmost sketch. The black text is the overall caption. The color-coded part descriptions and stroke groups demonstrate the part-level seman- tic annotations. for (4), and is asked to identify incorrect path assignments and provide concrete correction suggestions. The output schema is exactly the same as that of (2). (6) Path assignment refinement During this step, a refinement pass receives the sketch rendering, the sketch paths, the refined parts from (3), and initial part assignments from (4) along with step (4) instruction, and the path assignment critique from (5). It updates the path assignment with necessary edits under the same schema constraints as (4). (7) Caption generation Finally, we use the VLM to generate a short gen- eral caption that summarizes the object based solely on the refined parts. This ensures the overall text caption remains consistent with part-level semantics. 3.2 Our Dataset: ControlSketch-Part The procedure described above is designed to generalize to any sketch dataset with SVG (or vector-convertible) sketches. To reduce the data gap discussed in Sec. 1, we apply it to a complex, realistic-looking sketch dataset: ControlSketch. ControlSketch is a professional-quality dataset that consists of 35,000 image- sketch pairs [1] generated by SDXL [23] and the SDS [24] loss-based optimization algorithm. It contains sketches for 15 object categories; we do not use or refer to the category labels in any way in training, and only mention them for reference when organizing examples in this paper. We construct a schema so that the number of parts of a sketch is between 2 and 5, and apply our pipeline using Gemini 3.0 Pro as the VLM. We call the resulting dataset with the newly added captions, part descriptions, and path-to-part assignment ControlSketch-Part. An illustration of examples of ControlSketch-Part data can be found in Fig. 3. 4 Method We aim to have a VLM agent generate a vector sketch iteratively: draw a part → look and reason → draw the next part. An overview of our method’s pipeline can be found in Fig. 4. At each turn, the VLM receives: (1) the rendering of the current canvas, (2) an overall short caption of the object it is drawing, (3) Teaching an Agent to Sketch One Part at a Time7 M 240 90 C 310 86 265 10 256 41 M1320C26939933311153354... π θ Overallcaption:Astandingrobot Chathistory:[empty] Nextpartdescription:Aheadwithtwoantennae Numberofpartsleftafterthisturn:N-1 Rastercanvas M 261 82 C 312 88 286 12 239 40 M1225C111062016345... Turn1 GTpaths render SFTcross-entropyloss RLDreamSimreward GTrendering M 200 422 C 195 2113 380 209 303 M60146C794796350306514... π θ Overallcaption:Astandingrobot Chathistory:Aheadwithtwoantennae 261 82 C 312 88 286 12 239 40 1225C111062016345... Nextpartdescription:Upperbodywithshoulders Numberofpartsleftafterthisturn:N-2 Rastercanvas M 221 441 C 262 4211 383 210 311 M60144C704816479288502... Turn2 render SFTcross-entropyloss RLDreamSimreward M 399 311 C 291 2502291 320 34 M12461C222277320209863... π θ Overallcaption:Astandingrobot Chathistory:Aheadwithtwoantennae 261 82 C 312 88 286 12 239 40 1225C26541033620163 345... Upperbodywithshoulders...Twolegsextending downward 21 23 C 12 502 506 50 39 11 265 196C2943815514215450... Nextpartdescription:Leftandrightarmswithhands Numberofpartsleftafterthisturn:0 Rastercanvas M 326 341 C 307 245269 5 314 21 M12361C294257360268367... TurnN render SFTcross-entropyloss RLDreamSimreward ... Fig. 4: The visualization of the training pipeline. The task of generating vector sketches based on text prompts is split into multiple turns. Blue arrows: sequential computation; red arrows: loss. Cross-entropy loss and DreamSim reward are used at training signal at SFT and RL stages, respectively. π θ is the policy model, i.e., our VLM. description of the next part, (4) descriptions of previously drawn parts of the sketch along with their corresponding vector paths, and (5) the number of parts left to sketch after the current turn. The output is a sequence of paths (strokes) each coded as a curve. Since all strokes share the same set of SVG attributes (width, opacity, etc.) we instruct the model to output only the eight coordinates that define a cubic Bézier curve along with the SVG command letters M and C. The paths are separated by newline . For example, a sequence of two paths will be presented as M 212 146 C 6 89 303 88 322 14 213 17 C 213 269 18 157 218 32 (1) Our method consists of two training stages: (1) supervised fine-tuning stage in which the model learns the correct output format and sketching policy for a single turn, and (2) multi-turn process-reward GRPO training to improve the visual quality of output. 4.1 Stage 1: Supervised Fine-Tuning We conduct SFT on the VLM agent using the standard cross-entropy loss (next token prediction) on input/output pairs. We augment our dataset by randomly sampling a maximum of 20 part permutations per sketch, yielding for each per- mutation the corresponding sequence of part descriptions, incomplete sketches (as strokes) and the associated incomplete renderings. For instance, suppose a sketch has parts A,B,C,D and E. A permutation of these might be C,B,D,E,A. The corresponding set of input/output pairs will include empty canvas + de- scription of C (with the output being the ground truth strokes for C); canvas with C rendered + description of B (with the output consisting of the strokes for 8X. Du et al. B); canvas with C+B rendered + description of D (with the output consisting of D), etc. See Fig. 4 for visualization. All permutation for a given sketch share the same “global” caption. This approach provides the agent with example of completing a sketch with arbitrary ordering of parts/turns. The main purpose of the SFT stage is to train the agent to produce valid paths, and to train the model to generate a single part to extend an existing ground truth partial sketch (which will prepare it to the second stage in which it learns multi-turn generation). 4.2 Stage 2: RL with Multi-turn Process-Reward GRPO After the SFT stage, the agent is capable of progressive generation when applied autoregressively (generate the first part, then generate the second part condi- tioned on observing the just generated first part, etc.) However, this creates a gap between the SFT training regime, in which the agent has only seen “oracle” intermediate states sampled from ground truth during SFT, and the inference time when it is given its own generations from previous steps. Indeed we observe a resulting deterioration in visual quality as the generation progresses. To bridge this gap, we further train our agent with a reinforcement learning algorithm, Group Relative Policy Optimization (GRPO) [30]. GRPO computes the mean reward over multiple sampled trajectories (a group) as the baseline, replacing the need for an additional value function approximation model, which is usually of comparable size as the policy model [30]. This makes GRPO more efficient than its predecessors like [22,29]. GRPO preliminary We call a trajectory a sampled sequence of responses o 1 ,...,o T for a given input q ∼ P(Q). In our case, the trajectory is a sequence of sketch parts each adding to the previously generated parts. Assuming the group size (number of sampled trajectories for a given problem) is G and the length of steps for trajectory g is T g , then the collection of all the rewards for a group n o t g T g t=1 o G g=1 can be expressed as: R = n r t g T g t=1 o G g=1 . (2) The standard GRPO normalizes the rewards with the mean and standard deviation of the entireR, i.e., ̃r t g = r t g − mean(R) / std(R). The advantage ˆ A t g of the current step is calculated as the sum of the normalized rewards from the current and the following steps, ˆ A t g = T g X t ′ ≥t ̃r t ′ g .(3) Process-reward calculation In iterative sketch generation the number of steps (parts) is fixed for a given sketch. Moreover, the ground truth of any inter- mediate state in a trajectory is also available to the reward model by simply Teaching an Agent to Sketch One Part at a Time9 assembling the ground truth paths for each previous parts together. There- fore, we can estimate intermediate rewards more precisely. Since all trajecto- ries in a group have identical lengths (the number of parts), let us denote it by T = T 1 = T 2 =· = T G . Thus, the reward collection in (2) becomes R = r t g T t=1 G g=1 . (4) Instead of estimating a unified baseline with all rewards in R, we compute normalized rewards and advantages within each step. Let R t = r t g ′ G g=1 , then ̃r t g = r t g − mean(R t ) std(R t ) ,(5) and ˆ A t g = ̃r t g . (6) We use two rewards to supervise GRPO training: DreamSim reward intended to capture visual quality, and path count reward encouraging appropriate brevity. DreamSim reward In each step, we render the current canvas with CairoSVG [19], a lightweight rendering engine, and measure its (image-to-image) similarity to ground truth rendering at the same step. For this we use DreamSim [11] pre-trained ensemble model to compute cosine similarity between two images in embedded space. DreamSim is a learned perceptual similarity metric for images that aligns bet- ter with how humans judge visual similarity compared to CLIP [26], DINO [4] and LPIPS [41]. Let dreamsim(I) be the embedding of an image I. The Dream- Sim reward is r dreamsim = cos (dreamsim(I gen ), dreamsim(I gt )),(7) where I gen is the current generated rendering andI gt is the current ground truth rendering for the same set of parts. Path count reward As identified by Liu et al. [21], GRPO objective induces a bias towards longer trajectories. To keep the response length close to the distribution of the training data, we introduce the path count reward: r pc = max(0, 1−|N gt − N|/N gt ),(8) where Nis the number of paths in the final output andN gt is the number of paths in the final ground truth. We only regularize the agent on the final number of paths rather than the number of paths for each individual part because empirically we find per-part path count signal to be too noisy. The combined reward is a weighted combination of the two rewards: r t g = r dreamsim + λr pc .(9) Before computing the rewards, we run the responses through a validity veri- fier. Any response that does not conform with the format will be assigned with 10X. Du et al. GT Ours (SFT + RL) Ours (SFT) SketchAgent Gemini 3.1 Pro SDXL + SwiftSketch Random 0.150 0.175 0.200 0.225 0.250 0.275 0.300 0.325 Cosine similarity (Long-CLIP) 0.312 0.307 0.301 0.288 0.283 0.281 0.186 Fig. 5: The Long-CLIP cosine similarity across all tested mod- els. The Ground Truth (GT) value and the Random value are the co- sine similarity scores of text to the ground truth sketches from ControlSketch-Part and sketches of randomly sampled paths. min(r t g )and its trajectory will be terminated at the current step; in such a case, NandN gt are the cumulative path count up to the last successful step. Learning algorithm Our multi-turn process-reward GRPO learning objective builds on DeepSeekMath [30]. Let ρ t g,k (θ) = π θ (o t g,k |q,o t g,<k ) π θ old (o t g,k |q,o t g,<k ) be the token level ratios where π θ ,π θ old are the current and the old policy models during policy update (our VLM agent) and q,o are questions and outputs sampled from the question dataset and the old policy π θ old . k indicates the token position of the re- sponse and o t g,<k indicates the conditional generation process of g-th trajectory’s t-th step response conditioned on its first < k tokens. The learning objective, multi-turn and thus different from [30], is J GRPO (θ) =E h q ∼ P(Q), o t g T t=1 G g=1 ∼ π θ old (O|q) i 1 GT G X g=1 T X t=1 1 |o t g | |o t g | X k=1 n min h ρ t g,k (θ) ˆ A t g,k , clip ρ t g,k (θ), 1− ε, 1 + ε ˆ A t g,k i − βD KL [π θ ||π ref ] o , (10) where ε and β are hyper-parameters, and ˆ A t g,k is the k-th token level advan- tage. We estimate the KL divergence with the following unbiased estimator [28]: D KL [π θ ||π ref ] = ν t i,k (θ)− logν t i,k (θ)− 1,(11) where ν t g,k (θ) = π ref (o t g,k |q,o t g,<k ) π θ (o t g,k |q,o t g,<k ) , where π θ ref is the reference model. We present the pseudocode in the supplementary materials. 5 Experiments We experimentally assess generation quality across both the step-by-step sketch- ing procedure and the final output, comparing against state-of-the-art methods. We further validate the contribution of our multi-turn process-reward GRPO training through ablation studies. Evaluation is conducted using both automatic metrics and user studies. 5.1 Experimental Setup Training data We follow an established practice in two-stage LLM training pipelines [5, 37] that uses separate data for SFT and RL to prevent imitation bias, which has been found [18] to reduce exploration potential at the RL stage. Teaching an Agent to Sketch One Part at a Time11 We reserve the high-quality (and relatively costly to create) ControlSketch-Part dataset for RL, and prepare an alternative dataset for the SFT stage. The SFT training dataset is obtained with the same pipeline described in Sec. 3.1 but annotated with Gemini 2.5 Flash, a model 6.7× cheaper than Gemini 3.0 Pro. Implementation details We fine-tune Qwen3-VL-30B-A3B [2] as the backbone of our sketching agent. For both stages, LoRA [17] with rank 64 is used for fine- tuning. We run SFT training with a learning rate of 2e-4 and a batch size of 128 for 5400 steps. RL training takes an additional 1000 steps with a batch size of 8, a group size of 8 and a learning rate of 3e-6. We use the reward proposed in Eq. (9) with λ = 1.0 and turn off KL divergence loss for RL training. Adam with β 1 = 0.9,β 2 = 0.95 and ε = 1e− 8 is used throughout the entire training. We use Thinking Machines Lab’s Tinker [32] for both training stages. Bézier curve coordinates are rounded to the nearest ten for SFT training, while the original integer coordinates are retained for RL training. Baseline methods We benchmark our method against three methods: SketchA- gent [36], Gemini 3.1 Pro and SDXL [23]+SwiftSketch [1]. SketchAgent is a Claude Sonnet-based sequential sketch generation method through zero-shot prompting. The original paper uses Claude Sonnet 3.5, which is no longer avail- able, so we switch to the more recent Claude Sonnet 4.5. We also compare against Gemini 3.1 Pro, one of the latest general-purpose VLMs at the time of writing, used as a direct whole sketch generator. SwiftSketch is an image-to- sketch diffusion model. Since it requires an image as input, we first use SDXL, a text-to-image diffusion model, to generate images from text, and then apply SwiftSketch to convert them into sketches. For methods that require a single text caption, we concatenate all part descriptions. Evaluation metrics Both automatic metrics and user studies are used to assess the visual quality of the generated sketches. Because the lengths of the concate- nated text captions often exceed the maximum input length of CLIP [26], we use Long-CLIP [40] with a maximum input length of 248 tokens, to evaluate how faithful the final sketch is to the text caption. For each sketch rendering, we compute the cosine similarity of image embedding to the embedding of the concatenated part descriptions, as a measurement of faithfulness to text input. Note that this uses different embeddings, and thus is a different metric, than DreamSim used in our reward mechanism (Eq. (7)). We also conducted double-blind forced choice user preference studies between our method and baselines. These include two questions. In the first question, we ask the user to pick one from a pair of (whole) sketches based on the overall visual quality according to the associated text caption. In the second question, we present a looping animation showing part-by-part generation of a pair of sketches, and ask users to choose the sketch whose generation procedure better matches the part descriptions. We ask the first question for comparison with all methods and ask both questions for comparison with SketchAgent, the only baseline capable of part-by-part generation. 12X. Du et al. Step quality Final quality 83.1% vs. Ours (SFT) 70% vs. SketchAgent 84.1% vs. Ours (SFT) 77.5% vs. SketchAgent 66.1% vs. Gemini 3.1 Pro 91.1% vs. SDXL+SwiftSketch Ours (SFT+RL) Baselines Fig. 6: Pairwise preference studies conducted between our final model (SFT + RL) and the baselines. The title for each bar describes the baseline method that we are comparing against. The first column is the ablation between our final method (SFT + RL) versus the SFT only variant, which demonstrates the effectiveness of RL. 5.2 Experiment Results and Analysis Long-CLIP cosine similarity Fig. 5 reports the Long-CLIP cosine similarity scores of all methods and reference baselines including ground truth (GT) and randomly generated sketches (Random). The GT value is the mean Long-CLIP cosine similarity between concatenated part descriptions and the corresponding GT sketch in ControlSketch-Part, which can be viewed as the upper bound of the performance. The Random baseline is the mean Long-CLIP cosine similarity between concatenated part descriptions and sketches with randomly sampled strokes, where the number of cubic Bézier curves is sampled uniformly from [0, 32] and curve coordinates are sampled uniformly from [0, 512]. It establishes a lower bound on the metric. Our full model (SFT + RL) achieves the best performance across all meth- ods, surpassing the SFT-only variant, validating the contribution of both train- ing stages. Among prior methods, SketchAgent performs best, suggesting that progressive, part-by-part generation holds a meaningful advantage over holistic approaches. Gemini 3.1 Pro, despite its general strength, falls short of specialist agents, highlighting that text-to-vector sketch generation remains a challenging domain where task-specific training still matters. SDXL + SwiftSketch trails all baselines, as errors from the text-to-image stage compound in the subsequent image-to-sketch generation. While the numerical range for all the non-random methods is fairly compressed, our user studies and qualitative results confirm that the metric differences correspond to meaningful visual quality distinctions. User studies We conduct user studies via Prolific, an online crowdsourcing platform. For step quality preference, we collect 426 responses per baseline com- parison from 142 participants, and for final output quality preference, 560 re- sponses per baseline comparison from 146 participants, totalling 3,092 pairwise comparisons. Fig. 6 reports the percentage of comparisons in which our method was preferred. Across both evaluation settings and all baselines, participants consistently favored our results. Qualitative results Fig. 7 shows our generated sketches alongside those of the baselines. Our sketches tend to contain smooth paths, and have a natural style with identifiable, meaningful parts. SketchAgent produces relatively clean part structure but favors simple geometric primitives and symmetric layouts, which limits visual quality. In a portion of outputs it produces misplaced or distorted components (e.g., car, dog, rabbit). Gemini 3.1 Pro shares a similar preference for simple geometries and symmetric layouts, and occasionally fails to produce a Teaching an Agent to Sketch One Part at a Time13 Ours (SFT+RL) Sketch -Agent Gemini 3.1 pro SDXL+ Swift Sketch Ours (SFT+RL) Sketch -Agent Gemini 3.1 pro SDXL+ Swift Sketch Ours (SFT+RL) Sketch -Agent Gemini 3.1 pro SDXL+ Swift Sketch Angel Astronaut Bear Bicycle Car Cat Chair Crab Dog Fish Horse Rabbit Robot Sculpture Woman Fig. 7: Qualitative comparison. Part-by-part generated samples are color-coded to illustrate different parts. One-shot generations are rendered in black. Samples in each group are generated with the same text input. Our model and training process do not rely on the class labels in any way, and we only show these for reference. Angel Astronaut Bear Bicycle CarCat Chair Crab Dog Fish Horse Rabbit Robot Sculpture Woman Fig. 8: Example outputs from Ours (SFT + RL) trained on ControlSketch-Part. Our model and training process do not rely on the class labels in any way, and we only show these for reference. complete object (e.g., car). It also struggles to capture the distinguishing features of certain animals (e.g., bear, cat, and dog). SDXL + SwiftSketch can produce smooth, naturalistic sketches (e.g., bike and car), but is hindered by SDXL’s lack of ability to adhere to long text inputs, which often causes it to miss details or misinterpret compositional relationships. SwiftSketch further degrades when the image generated by SDXL is low quality or lacks a clear foreground subject. Note that class labels are shown purely for reference and are not used by our model or training process in any way. We include more sketch examples of Ours(SFT + RL) in Fig. 8. More progressive editing examples can be found in Fig. 9 14X. Du et al. circlebackrest circlebackrest,angled leftarmrestrightarmrestflatseatfourlegs ovalhead longhair shorthair earpendulumneck outergarment Fig. 9: Additional progressive editing examples. Left: identical part descriptions with different initial canvasses lead to different outputs. Right: changing the description for an early part but keeping the subsequent part descriptions same produces two sketches with significant differences localized to the affected part. Table 1: Average Long- CLIP scores across dif- ferent ablation configura- tions, using Qwen2.5-VL- 3B. MethodLong-CLIP ↑ Single-turn RL0.281 Multi-turn outcome-reward0.286 Multi-turn process-reward (ours)0.298 5.3 Ablation Study The performance gap (Fig. 5, Fig. 6) between Ours (SFT + RL) and Ours (SFT) shows the benefits of RL training. We further investigate the usefulness of cer- tain training strategies of our RL training. We ablate our multi-turn process- reward GRPO formulation against a single-turn GRPO baseline and a multi- turn outcome-reward GRPO baseline. The single-turn baseline treats the entire sketching process as a single completion with only a terminal reward on the final rendering. The multi-turn outcome-reward GRPO baseline uses the reward of the final rendering to compute advantages for all the steps, whereas our multi- turn process-reward GRPO setup uses intermediate-state rewards at each step, enabling dense credit assignment over the evolving canvas. We run the controlled ablation on Qwen2.5-VL-3B [3], and observed that multi-turn GRPO setups out- perform single-turn GRPO while process-reward outperforms outcome-reward, as shown in Table 1. The ablation study suggests that both multi-turn formula- tion and dense process-level rewards are important contributors to our model’s final performance, each providing complementary benefits. 6 Conclusion We argue that part-level semantic annotation is a missing but critical ingredient for learning text-to-vector sketch generation. To close this gap, we developed a scalable automatic annotation pipeline and applied it to produce ControlSketch- Part, a dataset that enriches vector sketches with semantic part decompositions, per-part text descriptions, and path-to-part assignments. With this data in hand, we trained a VLM agent using a two-stage SFT+RL framework: SFT grounds the agent in output format and initializes the sketching policy for a single turn, Teaching an Agent to Sketch One Part at a Time15 while a novel multi-turn process-reward GRPO stage optimizes visual quality via intermediate visual rewards, closing the distribution gap between oracle train- ing states and free-form inference. The resulting agent can generate structured sketches one part at a time. It outperforms prior methods across automatic met- rics and user studies, and naturally supports localized editing operations (such as removal and replacement of strokes) with high visual quality. We expect that ControlSketch-Part and the proposed training framework will serve as useful re- sources for future research on structured multi-turn processes that benefit from visual feedback. 16X. Du et al. References 1. Arar, E., Frenkel, Y., Cohen-Or, D., Shamir, A., Vinker, Y.: Swiftsketch: A dif- fusion model for image-to-vector sketch generation. In: Proceedings of the Spe- cial Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers. SIGGRAPH Conference Papers ’25, Association for Comput- ing Machinery, New York, NY, USA (2025). https://doi.org/10.1145/3721238. 3730612, https://doi.org/10.1145/3721238.3730612 2. Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X.H., Cheng, Z., Deng, L., Ding, W., Fang, R., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, Q., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L.Y., Ren, X., yi Ren, X., Song, S., Sun, Y.C., Tang, J., Tu, J., Wan, J., Wang, P., Wang, P., Wang, Q., Wang, Y., Xie, T., Xu, Y., Xu, H., Xu, J., Yang, Z., Yang, M., Yang, J., Yang, A., Yu, B., Zhang, F., Zhang, H., Zhang, X., Zheng, B., Zhong, H., Zhou, J., Zhou, F., Zhou, J., Zhu, Y., Zhu, K.: Qwen3-vl technical report. ArXiv abs/2511.21631 (2025), https://api.semanticscholar.org/CorpusID: 283262018 3. Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report (2025), https://arxiv.org/abs/2502.13923 4. Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the International Conference on Computer Vision (ICCV) (2021) 5. Chen, H., Tu, H., Wang, F., Liu, H., Tang, X., Du, X., Zhou, Y., Xie, C.: Sft or rl? an early investigation into training r1-like reasoning large vision-language models (2025), https://arxiv.org/abs/2504.11468 6. Chin, H.Y., Shen, I.C., Chiu, Y.T., Shamir, A., Chen, B.Y.: Autosketch: Vlm- assisted style-aware vector sketch completion. In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers. p. 1–11 (2025) 7. Das, A., Yang, Y., Hospedales, T., Xiang, T., Song, Y.Z.: Béziersketch: A gen- erative model for scalable vector sketches. In: European conference on computer vision. p. 632–647. Springer (2020) 8. Das, A., Yang, Y., Hospedales, T., Xiang, T., Song, Y.Z.: SketchODE: Learn- ing neural sketch representation in continuous time. In: International Confer- ence on Learning Representations (2022), https://openreview.net/forum?id=c- 4HSDAWua5 9. Das, A., Yang, Y., Hospedales, T., Xiang, T., Song, Y.Z.: Chirodiff: Modelling chirographic data with diffusion models. In: The Eleventh International Confer- ence on Learning Representations (2023), https://openreview.net/forum?id= 1ROAstc9jv 10. Frans, K., Soros, L., Witkowski, O.: Clipdraw: Exploring text-to-drawing synthe- sis through language-image encoders. Advances in Neural Information Processing Systems 35, 5207–5218 (2022) 11. Fu, S., Tamir, N., Sundaram, S., Chai, L., Zhang, R., Dekel, T., Isola, P.: Dream- sim: Learning new dimensions of human visual similarity using synthetic data. In: Advances in Neural Information Processing Systems. vol. 36, p. 50742–50768 (2023) Teaching an Agent to Sketch One Part at a Time17 12. Geng, S., Cooper, H., Moskal, M., Jenkins, S., Berman, J., Ranchin, N., West, R., Horvitz, E., Nori, H.: Jsonschemabench: A rigorous benchmark of structured outputs for language models. arXiv preprint arXiv:2501.10868 (2025) 13. Ha, D., Eck, D.: A neural representation of sketch drawings. In: International Conference on Learning Representations (2018) 14. Harada, K., Yamazaki, Y., Taniguchi, M., Kojima, T., Iwasawa, Y., Matsuo, Y.: Curse of instructions: Large language models cannot follow multiple instructions at once. OpenReview (2024), https://openreview.net/pdf?id=R6q67CDBCH 15. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems 33, 6840–6851 (2020) 16. Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L.A., Welbl, J., Clark, A., et al.: Training compute- optimal large language models. In: Proceedings of the 36th International Confer- ence on Neural Information Processing Systems. p. 30016–30030 (2022) 17. Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR 1(2), 3 (2022) 18. Kang, F., Kuchnik, M., Padthe, K., Vlastelica, M., Jia, R., Wu, C.J., Ardalani, N.: Quagmires in sft-rl post-training: When high sft scores mislead and what to use instead (2025), https://arxiv.org/abs/2510.01624 19. Kozea: Cairosvg: Convert your svg files to pdf and png (https://cairosvg.org/) (2025), https://cairosvg.org/ 20. Liu, N.F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., Liang, P.: Lost in the middle: How language models use long contexts. Transactions of the association for computational linguistics 12, 157–173 (2024) 21. Liu, Z., Chen, C., Li, W., Qi, P., Pang, T., Du, C., Lee, W.S., Lin, M.: Understand- ing r1-zero-like training: A critical perspective. arXiv preprint arXiv:2503.20783 (2025) 22. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., Lowe, R.: Training language models to follow instructions with human feedback (2022), https:// arxiv.org/abs/2203.02155 23. Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. In: The Twelfth International Conference on Learning Representations (2024) 24. Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988 (2022) 25. Qu, Z., Xiang, T., Song, Y.Z.: Sketchdreamer: Interactive text-augmented creative sketch ideation. arXiv preprint arXiv:2308.14191 (2023) 26. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. p. 8748–8763. PmLR (2021) 27. Rodriguez, J.A., Zhang, H., Puri, A., Feizi, A., Pramanik, R., Wichmann, P., Mon- dal, A., Samsami, M.R., Awal, R., Taslakian, P., Gella, S., Rajeswar, S., Vazquez, D., Pal, C., Pedersoli, M.: Rendering-aware reinforcement learning for vector graph- ics generation (2025), https://arxiv.org/abs/2505.20793 28. Schulman, J.: Approximating kl divergence (2020), http://joschu.net/blog/kl- approx.html 18X. Du et al. 29. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) 30. Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024) 31. Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score- based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020) 32. Thinking Machines Lab: Tinker (2025), https://thinkingmachines.ai/tinker/ 33. Villalobos, P., Ho, A., Sevilla, J., Besiroglu, T., Heim, L., Hobbhahn, M.: Position: Will we run out of data? limits of llm scaling based on human-generated data. In: Forty-first International Conference on Machine Learning (2024) 34. Vinker, Y., Alaluf, Y., Cohen-Or, D., Shamir, A.: Clipascene: Scene sketching with different types and levels of abstraction. In: Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV). p. 4146–4156 (October 2023) 35. Vinker, Y., Pajouheshgar, E., Bo, J.Y., Bachmann, R.C., Bermano, A.H., Cohen- Or, D., Zamir, A., Shamir, A.: Clipasso: Semantically-aware object sketching. ACM Trans. Graph. 41(4) (2022), https://doi.org/10.1145/3528223.3530068 36. Vinker, Y., Shaham, T.R., Zheng, K., Zhao, A., Fan, J.E., Torralba, A.: Sketcha- gent: Language-driven sequential sketch generation (2024), https://arxiv.org/ abs/2411.17673 37. Wang, H., Unsal, M., Lin, X., Baksys, M., Liu, J., Santos, M.D., Sung, F., Vinyes, M., Ying, Z., Zhu, Z., Lu, J., de Saxcé, H., Bailey, B., Song, C., Xiao, C., Zhang, D., Zhang, E., Pu, F., Zhu, H., Liu, J., Bayer, J., Michel, J., Yu, L., Dreyfus-Schmidt, L., Tunstall, L., Pagani, L., Machado, M., Bourigault, P., Wang, R., Polu, S., Barroyer, T., Li, W.D., Niu, Y., Fleureau, Y., Hu, Y., Yu, Z., Wang, Z., Yang, Z., Liu, Z., Li, J.: Kimina-prover preview: Towards large formal reasoning models with reinforcement learning (2025), https://arxiv.org/abs/2504.11354 38. Xing, X., Guan, Y., Zhang, J., Xu, D., Yu, Q.: Reason-svg: Hybrid reward rl for aha-moments in vector graphics generation (2025), https://arxiv.org/abs/2505. 24499 39. Xing, X., Wang, C., Zhou, H., Zhang, J., Yu, Q., Xu, D.: Diffsketcher: Text guided vector sketch synthesis through latent diffusion models. In: Thirty-seventh Con- ference on Neural Information Processing Systems (2023), https://openreview. net/forum?id=CY1xatvEQj 40. Zhang, B., Zhang, P., Dong, X., Zang, Y., Wang, J.: Long-clip: Unlocking the long- text capability of clip. In: European conference on computer vision. p. 310–325. Springer (2024) 41. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018) 42. Zhou, J., Zhou, Y., Yang, H., Xu, P., Huang, H.: Strokefusion: Vector sketch gen- eration via joint stroke-udf encoding and latent sequence diffusion. arXiv preprint arXiv:2503.23752 (2025) Teaching an Agent to Sketch One Part at a Time1 Appendix Sec. A presents the pseudo code for our multi-turn process-reward GRPO training. Sec. B documents the prompt templates for our automatic data an- notation pipeline. Sec. C shows additional part-by-part sketching results of our model. Sec. D provides further examples from our ControlSketch-Part dataset. In Sec. E, we present failure cases and discuss limitations and future work. A Pseudo Code Algorithm 1 Multi-turn Process-reward GRPO Input initial policy model π θ init ; reward models r φ ; task prompts D; hyperparameters ε, β, μ 1: policy model π θ ← π θ init 2: for iteration = 1, . . . , I do 3: reference model π ref ← π θ 4: for step = 1, . . . , M do 5:Sample a batch D b from D 6:Update the old policy model π θ old ← π θ 7:Sample G trajectories o t g T t=1 G g=1 ∼ π θ old (·| q) for each question q ∈D b 8:Compute rewards r t g T t=1 G g=1 for each sampled output o t g by running r φ 9:Compute ˆ A t g,k for the k-th token of o t g through group relative advantage estimation. 10:for GRPO iteration = 1, . . . , μ do 11:Update the policy model π θ by maximizing the GRPO objective (Eq. (10)) Output π θ B Prompt Templates for the Data Collection Pipeline This section documents the seven prompt templates used in the automatic an- notation pipeline described in the main paper. Placeholders • <min_parts> and <max_parts> are the minimum and maximum number of parts the agent can return, respectively. • <rendering> represents the rendered image of the sketch. • <diagnostic_vis> represents the rendered image of the diagnostic visual- ization. • <svg_text> denotes the raw SVG source code, 2X. Du et al. • <joined_parts> expands to one line per part in the form Part1: ..., Part2: ..., • <old_parts_json> is the previous part decomposition results. • <old_assignments_json> is the previous path assignment results. • <critique_json> is the critique output from the previous step. • <stepx_instruction> is the complete instruction of step x. Step 1: Initial Part Decomposition Prompt <rendering> You are given a black-and-white sketch image. By examining the inputs, propose a set of parts that can effectively decompose the object into meaningful components. 1. Describe all visible details exhaustively in each part, with concise language. The set of parts must be collectively exhaustive and complementary. 2. When appropriate, prefer a finer-grained decomposition into meaningful parts, but do not split a single coherent part artificially. 3. Avoid high-level part such as "a dog" or "a woman". 4. Avoid semantically meaningless part such as "two lines" or "a curve ". 5. You cannot have more than one parts describing the same or overlapping component of the object. I.e. **Information about the same part MUST be a single part**. 6. Do not explicitly mention strokes, lines, dots, or marks as such, but rather what they represent in the real world. 7. Do not describe drawing marks (e.g., "lines indicating legs" or " strokes forming a wheel"); name the actual object parts directly ( e.g., "legs", "wheel"). 8. Ignore isolated, clearly unintended marks or strokes that do not contribute to the main object structure. 9. Do not mention colors, medium, art/style/linework, lighting/ composition/camera, emotions, intent, or subjective qualities. 10. Use specific, concrete part names (e.g., "expanded wings", "long tail") and avoid vague descriptions such as "expanded structure" or "long object". 11. Do not merge two clearly separate structures into one part for brevity. 12. Be specific about quantities when clearly visible; use exact numbers (e.g., "four legs") instead of generic terms like "legs". 13. Include details about the object's orientation, posture and motion when they are clearly depicted and visually distinctive. Note: " facing left/right" should mean facing the viewer's left/right, not the object's own left/right. The number of parts should be between <min_parts> and <max_parts>, inclusive. Teaching an Agent to Sketch One Part at a Time3 Provide your output as a JSON array of strings, each string being one part description. Only return the JSON array, nothing else. Response Schema "type": "array", "items": "type": "string", "minItems": <min_parts>, "maxItems": <max_parts> Step 2: Part Critique Prompt <rendering> You are auditing a previous decomposition answer. Original task instruction: <step1_instruction> Previous answer (JSON array of parts): <old_parts_json> You are also provided with the original sketch image. Please closely read the original task instruction and check whether the previous answer follows each numbered requirement in that instruction, one by one. For every violation, add an issue that explicitly references the violated requirement number(s), explains why it is violated, and suggests a concrete fix. If a requirement is satisfied, do not add an issue for it. If you believe any part is not correctly described, also add an issue for it. Focus on strict requirement compliance rather than style preference. Return ONLY JSON matching schema. Response Schema "type": "object", "properties": "issues": "type": "array", "items": "type": "object", 4X. Du et al. "properties": "type": "type": "string", "severity": "type": "string", "enum": ["low", " medium", "high"], "reason": "type": "string", "suggested_fix": "type": "string", , "required": ["type", "reason"], , , "summary": "type": "string", "should_revise": "type": "boolean", , "required": ["issues", "summary", "should_revise"] Step 3: Part Refinement Prompt <rendering> Revise the previous part decomposition using the critique. Previous answer: <old_parts_json> Critique JSON: <critique_json> You are also provided with the original sketch image. Revision rules: - If current parts are already good, keep them unchanged. - Otherwise edit to fix errors. - Output <min_parts> to <max_parts> non-overlapping semantic parts. - Use concise but descriptive phrases. Return ONLY JSON matching schema. Response Schema "type": "array", "items": "type": "string", "minItems": <min_parts>, "maxItems": <max_parts> Teaching an Agent to Sketch One Part at a Time5 Step 4: Initial Path Assignment Assume that the current sketch contains K paths. Prompt <rendering> Here is the svg file of a sketch. <svg_text> You are also provided with the rendering image of this svg. The image contains K paths. By examining this svg code and its corresponding rendered raster image, assign each path to one of the parts provided below. <joined_parts> 1. Return your answer in JSON format with keys Path1, Path2, ..., PathK and values being the part label (e.g., Part1). 2. Use only the provided part labels. 3. Each path must be assigned to exactly one part. 4. All K paths must be assigned, and every provided part must be used at least once. Response Schema "type": "object", "properties": f"Pathi": "type": "string", "enum": [f"Parti" for i in range(1,num_parts+1)] for i in range(1, K+1), "required": [f"Pathi" for i in range(1, K+1)], Step 5: Path Assignment Critique with Diagnostic Visualization Prompt <rendering> <diagnostic_vis> You are auditing a previous path-to-part assignment for an SVG sketch. Original assignment task prompt: <step4_instruction> Previous assignment JSON: <old_assignments_json> Inputs provided: - Original sketch rendering image. - A color-coded paired image where left panel is part descriptions and right panel is the sketch with paths colored by assigned part. 6X. Du et al. Critique rules: 1. Check compliance against each numbered requirement in the task prompt. 2. Verify semantic correctness between part descriptions and assigned colored paths. 3. For each part, reason about whether there are any paths incorrectly assigned or missing. 4. For each issue, give concrete fix suggestions (what paths should move and why). 5. If no problems are found, return empty issues and should_revise= false. Return ONLY JSON matching the schema. Response Schema "type": "object", "properties": "issues": "type": "array", "items": "type": "object", "properties": "type": "type": "string", "severity": "type": "string", "enum": ["low", " medium", "high"], "reason": "type": "string", "suggested_fix": "type": "string", , "required": ["type", "reason"], , , "summary": "type": "string", "should_revise": "type": "boolean", , "required": ["issues", "summary", "should_revise"], Step 6: Path Assignment Refinement Prompt <rendering> Revise the previous path-to-part assignment. Original assignment task prompt: <step4_instruction> Teaching an Agent to Sketch One Part at a Time7 Previous assignment JSON: <old_assignments_json> Critique JSON: <critique_json> You are also provided with the original sketch image. Revision rules: - Follow every requirement in the original task prompt. - If critique indicates no issue, keep assignment unchanged. - Otherwise apply minimal but sufficient edits. - Output only a JSON object with keys Path1..PathK and values among allowed Part labels. Return ONLY the JSON object. Response Schema "type": "object", "properties": f"Pathi": "type": "string", "enum": [f"Parti" for i in range(1,num_parts+1)] for i in range(1, K+1), "required": [f"Pathi" for i in range(1, K+1)], Step 7: Caption Generation Prompt You are given a black-and-white sketch image. You are also provided with candidate object parts for your reference: <joined_parts> Write a short, strictly objective and literal caption describing the depicted objects. 1. Interpret visible marks, lines, and shapes as real-world object features, not as artistic or drawing elements. 2. Do not refer to the image as a sketch, drawing, or artwork. 3. Do not mention colors, materials, artistic style, linework, lighting, composition, camera, emotions, intent, or subjective qualities. 4. Do not add inferred, speculative, or imaginative details beyond what is directly visible. 5. Ignore isolated, clearly unintended marks or strokes that do not contribute to the main object structure. 6. Include only essential, clearly visible, and iconic information. 7. Focus exclusively on the visual content of the image. 8. Include details about the object's orientation, posture and motion when they are clearly depicted and visually distinctive. Note: " 8X. Du et al. facing left/right" should mean facing the viewer's left/right, not the object's own left/right. 9. Limit the caption to 25 words or fewer. Teaching an Agent to Sketch One Part at a Time9 C Additional Part-by-Part Results Fig. A1: Additional part-by-part results of our model. Part descriptions and caption appear above each sketch’s cumulative frames, with newly added parts color-coded to match corresponding part labels. 10X. Du et al. Fig. A2: Additional part-by-part results of our model (continued). Teaching an Agent to Sketch One Part at a Time11 Fig. A3: Additional part-by-part results of our model (continued). 12X. Du et al. Fig. A4: Additional part-by-part results of our model (continued). Teaching an Agent to Sketch One Part at a Time13 Fig. A5: Additional part-by-part results of our model (continued). 14X. Du et al. Fig. A6: Additional part-by-part results of our model (continued). Teaching an Agent to Sketch One Part at a Time15 Fig. A7: Additional part-by-part results of our model (continued). 16X. Du et al. Fig. A8: Additional part-by-part results of our model (continued). Teaching an Agent to Sketch One Part at a Time17 Fig. A9: Additional part-by-part results of our model (continued). 18X. Du et al. Fig. A10: Additional part-by-part results of our model (continued). Teaching an Agent to Sketch One Part at a Time19 Fig. A11: Additional part-by-part results of our model (continued). 20X. Du et al. Fig. A12: Additional part-by-part results of our model (continued). Teaching an Agent to Sketch One Part at a Time21 Fig. A13: Additional part-by-part results of our model (continued). 22X. Du et al. Fig. A14: Additional part-by-part results of our model (continued). Teaching an Agent to Sketch One Part at a Time23 Fig. A15: Additional part-by-part results of our model (continued). 24X. Du et al. D Additional ControlSketch-Part Dataset Examples Table A1: Additional Examples of the ControlSketch-Part Dataset. Teaching an Agent to Sketch One Part at a Time25 Table A2: Additional Examples of the ControlSketch-Part Dataset (continued). 26X. Du et al. Table A3: Additional Examples of the ControlSketch-Part Dataset (continued). Teaching an Agent to Sketch One Part at a Time27 E Failure cases, Limitations and Future Work E.1 Failure cases Sketches in the ControlSketch dataset all contain a fixed number of paths. As a result, the path count reward incentivizes the agent to match the “ground-truth” path count, which may lead to premature stopping once this count is reached, even if the corresponding part is not fully drawn. This behavior can be found in the omitted right wheel in Fig. A16a. A second failure mode is erroneous topolo- gies for unfamiliar semantic concepts. For example, the agent fails to correctly depict the “vertically oriented oval rear wheel” in Fig. A16b, a structure that is relatively rare in the dataset. In addition, while RL training substantially miti- gates part-misalignment errors, occasional misplacements remain. In Fig. A16c, for example, the jacket is positioned too far to the right, creating an unnatural gap between the jacket and the upper legs. (a) (b) (c) Fig. A16: Failure cases. E.2 Limitations The primary bottleneck of a more general sketching agent is the data. Our work is limited in its coverage of a wide variety of objects other than the ones present 28X. Du et al. in the ControlSketch dataset. In addition, our agent is not yet capable of self- critique. Errors introduced in earlier stages can compound as the number of generation steps increases. E.3 Future Work The current pipeline is designed for generating one part at a time. In future work, a planning agent could coordinate multiple agents to generate different parts in parallel. Furthermore, as mentioned before, enabling the system to refine unsatisfactory intermediate outputs may further improve overall sketch quality. Another promising direction is to incorporate richer natural language rea- soning into the generation process, for example by introducing chain-of-thought reasoning before generating each part. The sketching capabilities of our agent could also be leveraged to support visual reasoning tasks. One possible direction is to extend the agent’s abilities to generate auxiliary figures for tasks such as geometry problems.