Paper deep dive
Edit-As-Act: Goal-Regressive Planning for Open-Vocabulary 3D Indoor Scene Editing
Seongrae Noh, SeungWon Seo, Gyeong-Moon Park, HyeongYeop Kang
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%
Last extracted: 3/22/2026, 5:55:29 AM
Summary
Edit-As-Act is a framework for open-vocabulary 3D indoor scene editing that treats the process as goal-regressive planning rather than direct generation. By utilizing EditLang, a PDDL-inspired symbolic language, the system decomposes natural language instructions into verifiable goal predicates. A planner-validator loop iteratively selects actions that satisfy these goals while ensuring physical feasibility, monotonicity, and semantic consistency, significantly outperforming existing generative and constraint-based editing paradigms.
Entities (6)
Relation Signals (3)
Edit-As-Act → evaluatedon → E2A-Bench
confidence 100% · On E2A-Bench, our benchmark of 63 editing tasks... Edit-As-Act significantly outperforms prior approaches
Edit-As-Act → uses → EditLang
confidence 100% · Edit-As-Act predicts symbolic goal predicates and plans in EditLang
Edit-As-Act → outperforms → LayoutGPT
confidence 95% · Edit-As-Act significantly outperforms prior approaches across all edit types
Cypher Suggestions (2)
List all benchmarks associated with the research. · confidence 95% · unvalidated
MATCH (e:Framework {name: 'Edit-As-Act'})-[:evaluated_on]->(b:Benchmark) RETURN b.nameFind all baseline methods compared against the Edit-As-Act framework. · confidence 90% · unvalidated
MATCH (e:Framework {name: 'Edit-As-Act'})-[:outperforms]->(b:BaselineMethod) RETURN b.nameAbstract
Abstract:Editing a 3D indoor scene from natural language is conceptually straightforward but technically challenging. Existing open-vocabulary systems often regenerate large portions of a scene or rely on image-space edits that disrupt spatial structure, resulting in unintended global changes or physically inconsistent layouts. These limitations stem from treating editing primarily as a generative task. We take a different view. A user instruction defines a desired world state, and editing should be the minimal sequence of actions that makes this state true while preserving everything else. This perspective motivates Edit-As-Act, a framework that performs open-vocabulary scene editing as goal-regressive planning in 3D space. Given a source scene and free-form instruction, Edit-As-Act predicts symbolic goal predicates and plans in EditLang, a PDDL-inspired action language that we design with explicit preconditions and effects encoding support, contact, collision, and other geometric relations. A language-driven planner proposes actions, and a validator enforces goal-directedness, monotonicity, and physical feasibility, producing interpretable and physically coherent transformations. By separating reasoning from low-level generation, Edit-As-Act achieves instruction fidelity, semantic consistency, and physical plausibility - three criteria that existing paradigms cannot satisfy together. On E2A-Bench, our benchmark of 63 editing tasks across 9 indoor environments, Edit-As-Act significantly outperforms prior approaches across all edit types and scene categories.
Tags
Links
- Source: https://arxiv.org/abs/2603.17583v1
- Canonical: https://arxiv.org/abs/2603.17583v1
Full Text
95,600 characters extracted from source content.
Expand or collapse full text
Edit-As-Act: Goal-Regressive Planning for Open-Vocabulary 3D Indoor Scene Editing Seongrae NohSeungWon SeoGyeong-Moon Park † HyeongYeop Kang † Korea University rhosunr99, ssw03270, gm-park, siamiz hkang@korea.ac.kr https://seongraenoh.github.io/edit-as-act/ Abstract Editing a 3D indoor scene from natural language is con- ceptually straightforward but technically challenging. Ex- isting open-vocabulary systems often regenerate large por- tions of a scene or rely on image-space edits that disrupt spatial structure, resulting in unintended global changes or physically inconsistent layouts. These limitations stem from treating editing primarily as a generative task. We take a different view. A user instruction defines a desired world state, and editing should be the minimal sequence of actions that makes this state true while preserving everything else. This perspective motivates Edit-As-Act, a framework that performs open-vocabulary scene editing as goal-regressive planning in 3D space. Given a source scene and free- form instruction, Edit-As-Act predicts symbolic goal predi- cates and plans in EditLang, a PDDL-inspired action lan- guage that we design with explicit preconditions and effects encoding support, contact, collision, and other geometric relations.A language-driven planner proposes actions, and a validator enforces goal-directedness, monotonicity, and physical feasibility, producing interpretable and phys- ically coherent transformations. By separating reasoning from low-level generation, Edit-As-Act achieves instruc- tion fidelity, semantic consistency, and physical plausibil- ity—three criteria that existing paradigms cannot satisfy to- gether. On E2A-Bench, our benchmark of 63 editing tasks across 9 indoor environments, Edit-As-Act significantly out- performs prior approaches across all edit types and scene categories. 1. Introduction Training and evaluating embodied agents increasingly de- pends on environments that can be modified with precision. † Corresponding authors. While recent models can synthesize open-vocabulary in- door scenes with remarkable diversity, their usefulness is limited without an equally reliable way to edit those scenes in response to high-level goals. Tasks such as rearrang- ing furniture, preparing simulation curricula, or adapting a space for downstream interaction require more than scene generation. They require transformations that are inten- tional, localized to the relevant region, and physically valid. Achieving this level of control requires meeting three criteria at once. The edit must follow the instructions. It must preserve the rest of the scene. It must remain physi- cally plausible. As shown in Tab. 1, representative systems from layout generation, constraint optimization, and image- based editing each satisfy only part of this triad. They of- ten drift semantically, modify regions that should remain unchanged, or violate geometric constraints. This pattern indicates that these limitations arise from the underlying paradigms rather than from specific implementations. Motivated by both embodied agents, which learn through sequential decision-making, and classical automated plan- ning [9, 13, 16, 29], which frames tasks as goal satisfaction through structured actions, we view scene editing as an in- herently goal-driven process. A natural-language instruc- tion implicitly defines a desired world state. The correct edit is the minimal sequence of actions that achieves this state while remaining consistent with the geometry of the original scene. This joint perspective from embodied intelli- gence and symbolic planning naturally leads to a backward formulation, where reasoning starts from the desired out- come and works in reverse to identify the necessary actions and their preconditions. We propose Edit-As-Act, a framework that performs open-vocabulary 3D scene editing through goal-regressive reasoning. Natural-language instructions are mapped to symbolic goal predicates expressed in EditLang, a PDDL- inspired editing language that we design specifically for this task. It defines atomic edit actions through explicit precon- 1 arXiv:2603.17583v1 [cs.CV] 18 Mar 2026 Table 1. Comparison of three essential requirements for different scene editing paradigms. LayoutGPT represents direct 3D layout editing. AnyHome represents constraint-based optimization. Ar- tiScene represents image-driven editing followed by 3D lifting. InstructionSemanticPhysical FidelityConsistencyPlausibility LayoutGPT [8]✓✗ AnyHome [12]✗✓ ArtiScene [15]✗✓✗ Edit-As-Act (Ours)✓ ditions and effects. Two large language model-driven mod- ules then carry out backward planning: a planner proposes actions that satisfy current goals, while a validator enforces goal-directedness, monotonicity, and geometric feasibility. Unmet preconditions recursively generate subgoals, and the process continues until all requirements hold in the source scene. We formulate 3D scene editing as goal satisfac- tion over state transitions rather than global re-generation, enabling executable and verifiable editing through source- aware regression. By grounding edits in symbolic structure and verifiable geometry, Edit-As-Act produces interpretable, physically consistent, and semantically faithful transformations be- yond what generative or optimization-based methods can ensure. To assess these capabilities, we introduce E2A- Bench, a set of 63 open-vocabulary editing tasks across 9 indoor environments that measure instruction alignment, se- mantic stability, and physical realism. Our contributions can be summarized as follows: • Reasoning-driven editing framework:Introducing Edit-As-Act, the first framework to cast open-vocabulary 3D scene editing as a sequential, goal-regressive reason- ing problem. • Verifiable symbolic action language: Developing Edit- Lang, a symbolic editing language with explicit precon- ditions and effects that guarantee logical coherence and spatial validity. • Open-vocabulary benchmark: Creating E2A-Bench, a suite of 63 editing tasks across 9 indoor environments for standardized evaluation of semantic alignment, edit relia- bility, and physical realism. 2. Related Work 2.1. Data Driven Scene Layout Editing Many approaches perform scene editing using data-driven generators trained on layout or scene graph datasets such as 3D-FRONT [11]. Methods including DiffuScene [28], InstructScene [20], EditRoom [35], and EchoScene [34] diffuse over discrete scene structures to produce text- conditioned layouts. These models generate visually plau- sible edits when instructions remain within the training dis- tribution, but their performance degrades for novel rooms or compositional queries. Because edits occur through a single forward generative step, local modifications can in- duce unintended global changes. Physical validity is also only implicit; for instance, EditRoom [35] reports that erro- neous language model commands may produce colliding or unsupported objects. 2.2. Constraint-Based Scene Layout Synthesis Other methods convert language into spatial relations or op- timization objectives. Earlier work mapped descriptions to fixed scene graphs for retrieval [21]. Recent systems such as Holodeck [32], I-Design [4], LLPlace [31], AnyHome [12], and LayoutVLM [27] use language models to derive place- ment rules or constraints and then optimize layouts to sat- isfy them. These approaches are physically grounded but often re-optimize whole rooms, causing non-target objects to shift and reducing scene-level consistency [4, 32]. When constraints conflict, solvers may settle for partially satisfied solutions, lowering instruction fidelity. LLM-based plan- ners such as LayoutGPT [8] also struggle with 3D spatial reasoning, leading to misaligned placements. 2.3. 2D-to-3D Image-Based Editing A third direction edits scenes by generating an instruction- consistent image and lifting it to 3D. ArtiScene [15], Text2Room [17], and SceneScape [10] reconstruct geom- etry from edited images, while ControlRoom3D [26] and Ctrl-Room [7] add geometric guidance. These approaches inherit strong 2D priors but lack explicit 3D reasoning, of- ten producing structural artifacts or physically implausible layouts [15, 17]. Edits made in image space are also difficult to localize cleanly and can introduce unintended changes in unedited regions. Generative diffusion, constraint-based optimization, and image-driven lifting each contribute valuable insight to language-conditioned scene editing, but all face challenges: reliance on training data manifolds, difficulty maintaining edit locality, and limited explicit reasoning about 3D struc- ture and physics. Our work addresses these limitations by formulating scene editing as goal-regressive planning in a symbolic action space. 3. Problem Definition We reformulate open-vocabulary indoor scene editing as a goal-regressive planning problem inspired by classical STRIPS reasoning [9, 13, 16]. This formulation turns free- form language instructions into verifiable symbolic goals, defining an editing problem through the symbolic vocabu- lary, available actions, initial scene, and desired post-edit conditions. 2 Initial Goal Conditions G T Action a t pre add del “Given Source Scene <SOURCE_SCENE S 0 >,Swap the positions of the sideboard on the right and the bookcase on the left.” Input UserInstruction: Step 2. Backward Goal-Regression Planning Action a t+1 pre add del Action a t+2 pre add del Pred #2 Pred #3 Pre(a t ) G t Pred. #2 Pred. #3 G t+1 Pred. #1 Pred. #2 Pre(a t+2 ) G t+2 ... Edited Scene S T Step 3. Forward Action Execution Action a t Action a t+1 Action a t+2 Source Scene S 0 ... “You are a goal condition extractor for scene editing. Given a natural language instruction and Initial Scene, Describe target scene and extract ALL predicates that must be true in target scene....” ... ... Target Scene Description Planner Source Scene S 0 Validator “Wooden sideboard is placed in the right side of dining room. Bookcases are aligned left ...” Step 1.Directly PredictGoal Conditions Goal Regression G t-1 = (G t (a t ))∪(Pre(a t ) 0 ) ✓Goal directed ✓Monotonic ✓Consistent ✓Format a t :=LLM(G T ,⟨a t+1 , ..., a T ⟩, S 0 ) Pred.#1 next_to<sideboard# 1, bookcase#1> Pred.#2 aligned_to<bookcas e#2, chair#3> UnmetPre(a t )added to G t-1 Pred. #1 Pred.#1 next_to<sideboard# 1, bookcase#2> Pred.#2 aligned_to<sideboar d#1, chair#3> a t =<pre(a t ),add(a t ),del(a t )> Pred.#3 on<table_lamp#1, sideboard#1> ... ... Pred.#3 next_to< table_lamp #1, wall#1> Figure 1. Overview of Edit-As-Act. Step 1: an LLM converts a source scene S 0 and instruction into symbolic goal predicates G T in EditLang. Step 2: a planner–validator loop iteratively selects EditLang actions a t that satisfy goals G t and regresses remaining goals until all are grounded in S 0 . Step 3: the resulting action sequence is applied to S 0 to obtain the edited scene S T . We use a PDDL-style [1] vocabulary of predicates and actions. Let F be the finite set of ground predicates de- scribing facts in the 3D world (e.g., on(chair,floor) or facing(chair,table)). A symbolic state s is any subset of F , with the source scene encoded as s 0 . Goal conditions G ⊆ F are derived from the instruction I . The available edits form a finite action setA, where each action a is a triplet⟨pre(a), add(a), del(a)⟩. Preconditions spec- ify when a is executable, add(a) describes predicates it es- tablishes, and del(a) removes incompatible ones. Mutually exclusive relations are automatically removed (for example, removing on(x,y) before adding on(x,z)). Executing a updates the state as s ′ = (s\ del(a))∪ add(a).(1) The goal is to find a sequence π = ⟨a 1 ,...,a T ⟩ that trans- forms s 0 into a target state satisfying all goals in G. Classical STRIPS regression [13, 16] works backward from the goal G to derive the minimal subgoals. However, directly applying it to 3D scenes causes redundant reason- ing because many goals are already satisfied in s 0 We there- fore introduce a source-aware regression operator that prop- agates only unsatisfied conditions: Regress ∗ (G,a;s 0 ) = G\ add(a) ∪ pre(a) 0 . (2) This preserves the logical rigor of STRIPS while avoiding unnecessary reconstruction of scene aspects already satis- fied in the source, forming the formal basis for our method. 4. Method 4.1. Framework Overview Edit-As-Act performs open-vocabulary three dimensional scene editing as a process of goal-regressive reasoning rather than direct generation. Given a natural language in- struction and a source scene, the system constructs a ver- ifiable sequence of edit actions that transforms the source into a target state satisfying all goal conditions. Two large language model based modules, a planner and a valida- tor, operate within our EditLang domain to propose ac- tions, regress goals, and check logical and physical va- lidity. Before planning, we evaluate all EditLang predi- cates on the source scene S 0 under a closed world assump- tion to obtain the initial symbolic state s 0 . Conceptually, actions update symbolic states via the STRIPS transition s ′ = (s\ del(a))∪ add(a) defined in Sec. 3, while in our implementation we recompute predicates from the updated geometry after each accepted edit to keep symbols and the three dimensional scene aligned. This backward, source aware process yields minimal edits, preserves semantic con- text and physical plausibility, and avoids layout hallucina- 3 tion. Our framework is summarized in Fig. 1 4.2. EditLang EditLang provides the symbolic foundation of Edit- As-Act.It defines a PDDL-style domain tailored foropen-vocabularysceneediting,consistingof predicates and actions that bridge language instruc- tions and geometric reasoning.Predicates capture geometric,topological,and physical relations such as supported(x,y), contact(x,y), clear(x), stable(x), colliding(x,y), and reachable(x), all evaluated directly from s 0 . Actions are atomic and deterministic, defined as ⟨pre, add, del⟩ introduced in Sec. 3. EditLang uses the same symbolic vocabulary across tasks, but unlike tradi- tional benchmark planning domains with a fixed, hand de- signed object set, it is instantiated per scene by dynamically binding typed variables to concrete objects in the source scene. This instantiation allows the domain to reason over unseen object categories and layouts while keeping the set of ground predicates finite. All objects and their attributes are registered and accessible at plan time. Beyond geomet- ric rearrangement, EditLang also supports two non geomet- ric edit primitives, Add and Stylize. Add inserts assets from a catalog associated with the source scene into col- lision free, supported locations that respect room specific constraints. Stylize modifies appearance level attributes such as material or color by updating an object’s descrip- tion, while leaving its geometric configuration unchanged. Together with the rearrangement actions, these primitives cover position, existence, and appearance edits in a single symbolic language. See Supplementary Sec. 4 for a formal- ization of EditLang. 4.3. Planner Module The planner P translates an instruction I and source scene S 0 into an initial goal predicate set G T written in Edit- Lang. At each step t, P receives G t , s 0 , and the partial plan⟨a 1 ,...,a t−1 ⟩ and proposes a single action a t =⟨pre(a t ), add(a t ), del(a t )⟩ that satisfies at least one goal in G t . We prompt P to pro- duce minimal but sufficient preconditions pre(a t ) that guar- antee physical executability across scenes. This is enforced through geometric checks for collision, support, and stabil- ity with fixed numeric tolerances in scene units. The planner submits a t to the validator and only revises the proposal in response to validator feedback; in practice, we cap the num- ber of revisions per step at three. Please see Supplementary Sec. 5 for implementation details and pseudo code of the Planner module. 4.4. Validator Module The validator V evaluates each proposed action a t on four criteria that directly operationalize our desiderata in Sec. 1. Goal directedness. The added predicates make progress on the current goals: add(a t ) must satisfy at least one ele- ment of G t . This prevents tangential edits that do not con- tribute to the instruction. Monotonicity. The action must not undo progress on goals that have already been achieved. Let G sat ≤t denote the subset of goals satisfied before step t; V enforces del(a t ) ∩ G sat ≤t =∅. Together with the finite EditLang state space, this monotonicity constraint rules out cycles and guarantees that the regression loop terminates in finitely many steps. Contextual consistency. When the first two criteria hold, V checks that the resulting configuration remains plausi- ble with respect to the source scene and room specific con- straints (for example, seating around a table or clearance in front of doors), capturing semantic coherence beyond a pure checklist of predicates. Formal validity. Finally, V verifies that the action con- forms to the EditLang schema: predicate and argument types are well formed, mutually exclusive relations are up- dated consistently, and action names and parameters match the domain specification. We maintain a set of domain invariants I that includes collision freedom, a single stable support per object, and wall or floor attachment rules; V rejects any action that vi- olatesI, even if it is otherwise goal directed. On failure, V returns a refusal with a brief natural language explanation, which is fed back to the planner. On success, V accepts a t and passes it to the regression step. See Supplementary Sec. 6 for implementation details of the Validator module and additional failure cases. 4.5. Source Aware Goal Regression After acceptance, goals are updated by source-aware regres- sion with the source scene S 0 , G t−1 = G t \ add(a t ) ∪ pre(a t )\ S 0 . We assume disjoint effects add(a) ∩ del(a) =∅ and consistent variable bindings and remove mutually exclusive predicates via del before applying add. This preserves the logic of classical STRIPS regression while avoiding recon- struction of parts of the scene that are already satisfied in the source state. The planner then proceeds with the up- dated goal set G t−1 . 4.6. Execution and Termination The planning process continues until G t = ∅, indicating that all goal conditions have been regressed back to the source scene. The backward plan ⟨a T ,...,a 1 ⟩ is then re- versed and executed by a deterministic Python DSL runtime 4 that invokes the corresponding implementation for each Ed- itLang action. After each invocation, predicates are recom- puted from geometry before proceeding to the next step, ensuring that the symbolic state and 3D scene remain con- sistent. Because EditLang induces a finite set of ground predicates over the objects of the source scene and the val- idator rules out cyclic or non-monotone updates, the overall planning loop always terminates in a finite number of steps. Examples of planning scenarios, the full EditLang specifi- cation, and the DSL implementations are provided in the supplementary material. 5. Experiments 5.1. Benchmark Setup We evaluate our Edit-As-Act on an open-vocabulary in- door editing benchmark designed to test reasoning rather than pattern matching. The benchmark contains 63 edit- ing tasks across 9 diverse indoor environments (bathrooms, bedrooms, computer rooms, dining rooms, game rooms, kids’ rooms, kitchens, living rooms, and offices). Each task provides a source 3D layout with geometry and object meta- data (pose, category labels, front face orientation), along with a free-form language instruction. Instructions range from simple rearrangements (e.g., “move the table near the sofa”) to multi-step compositional edits (e.g., “rotate the chair to face the window and place a lamp beside it”). Unlike prior evaluations that focus on reproducing tar- get layouts, our benchmark is constructed to stress goal- conditioned reasoning, symbolic interpretability, and spa- tial generalization across unseen scene configurations. It tests whether a model can understand an instruction, iden- tify the minimal required changes, and update the scene without disturbing unrelated context. See the supplemen- tary material for details on the dataset generation pipeline. 5.2. Baseline Methods and Metrics We compare Edit-As-Act against three state-of-the-art baselines, each representing a different editing paradigm. LayoutGPT-E [8] performs forward reasoning in layout space. We provide the source layout directly in the prompt. Then, the model outputs an edited layout. For add and styl- ize operations, it is further prompted to describe the inserted or modified objects so that the edited layout can be recon- structed consistently. AnyHome [12] is a constraint-based optimization framework that converts language instructions into symbolic spatial constraints and solves for a layout that satisfies them. Because AnyHome is originally designed for multi-room floorplan synthesis, we disable all modules that operate across rooms. ArtiScene-E [15] represents an image first editing pipeline. The source scene is rendered with fixed camera and lighting. A pretrained text-to-image model generates an edited image conditioned on the instruc- tion. ArtiScene then lifts this edited image into a structured 3D scene. We evaluate all methods with three complementary met- rics that assess different aspects of editing quality: 1. Instruction Fidelity (IF): An LVLM-based score that measures how faithfully the edited scene satisfies the ex- plicit instruction. 2. Semantic Consistency (SC): An LVLM-based score that evaluates whether the nontargeted regions of the scene remain unchanged. 3. Physical Plausibility (P): An LVLM-based score that rates the overall physical plausibility, including colli- sions, support relations (e.g., objects resting on appro- priate supports such as floors, tables, or shelves), and stability(e.g., furniture not floating or tipping over in im- plausible ways). Details of the prompts are provided in the supplementary material. 5.3. Implementation Details We ensure a fair comparison by standardizing the model backbone, asset pipeline, and text-to-3D generation across all methods.LayoutGPT-E, AnyHome, and our plan- ner–validator modules all rely on the OpenAI GPT-5 API [22] with identical decoding parameters. Whenever a method outputs a textual description for object addition or stylization, we generate the corresponding image with Gemini 2.5 Flash Image and convert it into consistent 3D geometry using Hyper3D Gen-2 V1.8 [19], identical to that used to construct the source scenes. For ArtiScene-E, the edited view is likewise produced using the same text-to- image interface before being lifted. This unified pipeline ensures the performance difference arises not from external generative modules, but from each method’s reasoning strategy, constraint handling, and edit- ing mechanism. 5.4. Quantitative Evaluation Across all 63 editing tasks, Edit-As-Act achieves the strongest overall performance in IF, SC, and P. Results are shown in Tab. 2. Instruction Fidelity. Tab. 2 reveals that editing instruc- tions challenge all baselines. LayoutGPT-E struggles due to one-shot generation and often fails to react to multi- predicate instructions. AnyHome performs moderately but sometimes satisfies constraints in its abstract graph while failing to align all object placements in 3D. ArtiScene-E performs well on P due to strong image priors, but its weaker instruction fidelity on IF and SC stems from the text-to-image stage, which often produces edits that are con- servative and insufficiently responsive to the instruction, re- 5 Table 2. Quantitative comparison across nine scene categories. Edit-As-Act achieves the strongest and most consistent performance in instruction fidelity (IF), semantic consistency (SC), and physical plausibility (P), demonstrating robust generalization across diverse spatial configurations. BathroomBedroomComputer RoomDining RoomGame Room MethodsIFSCPPIFSCPPIFSCPPIFSCPPIFSCPP LayoutGPT-E50.162.578.335.730.253.853.832.184.750.531.466.252.148.983.5 AnyHome69.756.781.764.062.682.759.056.390.757.166.976.658.367.191.4 ArtiScene-E61.773.483.743.039.090.441.941.090.637.337.190.461.354.489.4 Edit-As-Act (ours)58.788.989.145.773.191.973.688.094.189.795.392.758.779.991.1 Kids RoomKitchenLiving RoomOfficeAverage MethodsIFSCPPIFSCPPIFSCPPIFSCPPIFSCPP LayoutGPT-E28.330.188.659.458.288.338.640.783.932.145.477.142.348.878.6 AnyHome72.082.092.749.351.082.944.758.181.644.643.980.457.660.584.5 ArtiScene-E35.037.192.465.667.694.148.968.189.140.042.992.948.351.290.3 Edit-As-Act (ours)91.189.096.355.092.393.772.990.193.676.482.981.969.186.691.7 sulting in outputs that remain overly close to the input im- age. Semantic Consistency. As shown in Tab. 2, SC exceeds 85 to 95 across kitchens, living rooms, and bedrooms, while LayoutGPT-E and AnyHome degrade sharply in scenes with many small objects. This pattern reflects a major lim- itation of forward generative and constraint optimization approaches: small changes often propagate globally, lead- ing to context drift. Edit-As-Act avoids this failure mode through localized symbolic reasoning, modifying only what the goal demands. Physical Plausibility. Edit-As-Act consistently delivers physically plausible configurations even for large geomet- ric edits. In Tab. 2, our P scores remain near or above 90 across most categories, matching or outperforming base- lines. Since every step in our plan is validated by explicit geometric checks, collisions and unstable placements are rarely introduced. ArtiScene-E reports competitive P because the gener- ated 2D rendering often avoids visible collisions from the chosen viewpoint. However, these scores do not reflect full scene stability. When the edited render is lifted into 3D, many placements lack valid support or produce hid- den interpenetrations that are not penalized by a single-view LVLM evaluator. As a result, ArtiScene’s P deteriorates in cluttered or occluded regions, where 2D edits fail to encode full geometric constraints. Overall, the quantitative results reveal a clear struc- tural pattern: forward generation struggles with compo- sitional reasoning, constraint solving loses context during layout optimization, and image-to-3D lifting inherits ambi- Table 3. Comparison with contemporary reasoning baselines. MethodIF↑SC↑P↑Latency (s)Avg. CallsAvg. Tok GPT-549.652.373.318.91.03357 Gemini-3-Pro-preview43.148.771.747.71.06959 Claude-4.5-opus50.243.568.511.31.03514 SceneWeaver68.778.382.1102.57.318335 Edit-As-Act (ours)69.186.691.787.25.919056 guity from the 2D model. By contrast, Edit-As-Act’s goal- regressive reasoning provides a principled path that inte- grates instruction alignment, locality, and physical valid- ity in a way that existing paradigms cannot. We provide a fine-grained analysis by edit operation type in supplemen- tary material. Comparison with Reasoning-based Baselines. Beyond the representative editing baselines above, we compare Edit-As-Act with two strong reasoning baselines. First, we evaluate direct reasoning with GPT-5[22], Gemini-3-Pro- preview[14], and Claude-4.5-opus[3], each prompted with the task specification, source scene image, and scene meta- data to predict an edit plan in a single pass. Second, we compare with SceneWeaver[30], an iterative action and re- flection framework adapted to E2A-Bench. For a controlled comparison, both SceneWeaver and Edit-As-Act use GPT- 5 as the underlying model. As shown in Tab. 3, Edit-As- Act substantially outperforms direct reasoning in instruc- tion fidelity, semantic consistency, and physical plausibility. Compared with SceneWeaver, Edit-As-Act achieves higher semantic consistency and physical plausibility while using fewer model calls at a comparable token budget. These re- sults further support our claim that goal-regressive planning provides a stronger editing prior than either direct reasoning 6 Source SceneArtiScene-E “Rotate 90 degrees clockwise the conference table and all its conference chairs, change the color of the conference table to a dark wood finish, remove the coffee table and two lounge chairs.” AnyHomeEdit-As-Act (Ours) “Translate the armchairs and nearby side tables closer to the sectional sofa, rearrange the floor lamp to the opposite side of the room, and scale the coffee table to twice its current size.” Figure 2. Representative qualitative results. Baseline methods often introduce unintended global changes, fail to satisfy multi-step in- structions, or generate incomplete edits. Edit-As-Act produces precise, instruction-aligned modifications that remain physically valid and preserve the overall scene identity. or iterative forward editing. 5.5. Qualitative Evaluation Fig. 2 shows the qualitative differences among the three paradigms. The baselines often succeed on isolated sub- goals but fail to keep edits localized, producing global shifts or incomplete transformations. LayoutGPT-E frequently reshapes nearby furniture because its one-shot generation lacks explicit constraints. AnyHome satisfies high-level spatial constraints but frequently alters unrelated regions during re-optimization. ArtiScene-E preserves the scene ap- pearance but underreacts to multi-step instructions, so com- positional edits are partly applied or entirely missed. 5.6. Ablation Study Tab. 4 reveals how each component contributes to the full system. Removing the validator leads to the largest drop in semantic consistency and physical plausibility. Without ex- Table 4. Ablation study showing the impact of each component. IFSCPP Edit-As-Act (ours)69.186.691.7 w/o Validator55.375.186.0 w/o Source-Awareness58.275.189.2 w/o EditLang55.473.688.3 Forward Planning61.278.790.3 Coord. Prediction52.868.185.5 plicit validation, the planner occasionally proposes actions that satisfy the instruction textually but break spatial invari- ants or undo earlier goals, confirming that symbolic check- ing is essential for stable planning. Disabling source-aware regression also lowers performance. Conventional regres- sion adds unnecessary subgoals because it ignores which 7 w/o EditLang Edit-As-Act (Ours) Source scene “Rotate the chair 45 degrees relative to the desk.” UserInstruction Figure 3. Effect of removing EditLang, which provides explicit preconditions that allow the chair to be rotated around the table. conditions already hold in the source scene. The source- aware operator avoids redundant work by regressing only unmet conditions. Replacing EditLang with a generic scene graph substantially reduces SC and P. As shown in Fig. 3, the system can no longer interpret relational constraints and rotates the chair in place rather than around the table, demonstrating that explicit preconditions and effects are es- sential for coherent and physically grounded edits. We also compare against alternative planning strategies. A forward planning variant that searches directly in the space of actions, without backward goal regression, un- derperforms our full model, indicating that backward rea- soning provides a stronger inductive bias for instruction- driven editing. Finally, the Coordinate Prediction setting, which bypasses symbolic reasoning and directly outputs 3D bounding box coordinates, yields the weakest results among all ablations. This suggests that purely geometric, unguided prediction is insufficient for reliable, instruction- aligned scene editing and that a source-aware, symbolic planning framework with validation is essential. 5.7. User Study We conducted a user study with ten participants (three fe- male, seven male; mean age 26.4). For a subset of bench- mark scenes, participants viewed edited results from Edit- As-Act, ArtiScene, and AnyHome and rated each method on three seven-point scales measuring instruction fidelity, semantic consistency, and physical plausibility. As shown in Fig. 4, Edit-As-Act is consistently preferred, achieving Instruction Fidelity Semantic Consistency Physical Plausibility 0 1 2 3 4 5 6 7 Average Scores 5.49 5.65 5.92 3.11 2.93 3.93 3.46 4.04 4.36 Edit-As-Act (ours)ArtiSCeneAnyHome Figure 4. User study results. Ten participants rated edited scenes produced by Edit-As-Act, ArtiScene, and AnyHome on three cri- teria. Edit-As-Act obtains the highest perceived instruction fi- delity, semantic consistency, and physical plausibility. average scores of 5.49, 5.65, and 5.92 scross the three cri- teria, compared to 3.11, 2.93, 3.93 for ArtiScene and 3.46, 4.04, and 4.36 for AnyHome. 5.8. Failure Modes We observe three recurring failure modes. First, highly am- biguous instructions can yield underspecified goal predi- cates, such as make the room messy or clean up the space. Second, some edits are geometrically valid but stylistically suboptimal, for example when add two bean bags facing the table leads to a literal yet unnatural arrangement. Third, rare planning deadlocks arise when competing subgoals cannot be resolved under strict monotonicity constraints. In prac- tice, these cases are mitigated by bounded retries in the planner-validator loop. Representative examples are pro- vided in the supplementary material. 6. Conclusion Edit-As-Act reframes 3D indoor scene editing as a reason- ing problem rather than a generative one. Our central in- sight is that editing is not merely placing objects but satis- fying a desired world state. By grounding edits in EditLang and regressing goals through symbolic actions with explicit preconditions and effects, Edit-As-Act performs edits that are minimal and meaningful. The planner-validator loop further ensures that plans remain faithful to the instruction and consistent with the original scene. 7. Acknowledgments This work was supported by the National Research Foun- dation of Korea (NRF) grants funded by the Korean gov- ernment (MSIT) (No. RS-2025-00518643 (30%), No. RS- 2025-24802983 (30%)), and by the ICT Creative Con- silience Program through the Institute of Information & 8 Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (IITP-2026- RS-2020-I201819 (20%)), and Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (RS-2025- 02653113, High-Performance Research AI Computing In- frastructure Support at the 2 PFLOPS Scale (20%)). References [1] Constructions Aeronautiques, Adele Howe, Craig Knoblock, ISI Drew McDermott, Ashwin Ram, Manuela Veloso, Daniel Weld, David Wilkins Sri, Anthony Barrett, Dave Christian- son, et al. Pddl—the planning domain definition language. Technical Report, Tech. Rep., 1998. 3, 2 [2] Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss- 20b model card. arXiv preprint arXiv:2508.10925, 2025. 8 [3] Anthropic. Claude opus 4.5 system card. https://w. anthropic.com/claude- opus- 4- 5- system- card, 2025. Accessed: 2026-03-09. 6 [4] Ata C ̧ elen, Guo Han, Konrad Schindler, Luc Van Gool, Iro Armeni, Anton Obukhov, and Xi Wang. I-design: Personal- ized llm interior designer. In European Conference on Com- puter Vision, pages 217–234. Springer, 2024. 2 [5] Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025. 1 [6] Wei Deng, Mengshi Qi, and Huadong Ma. Global-local tree search in vlms for 3d indoor scene generation. In Proceed- ings of the Computer Vision and Pattern Recognition Con- ference, pages 8975–8984, 2025. 1 [7] Chuan Fang, Yuan Dong, Kunming Luo, Xiaotao Hu, Rakesh Shrestha, and Ping Tan. Ctrl-room: Controllable text-to-3d room meshes generation with layout constraints. In 2025 International Conference on 3D Vision (3DV), pages 692–701. IEEE, 2025. 2 [8] Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Ar- jun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, and William Yang Wang. Layoutgpt: Compositional visual plan- ning and generation with large language models. Advances in Neural Information Processing Systems, 36:18225–18250, 2023. 2, 5, 1 [9] Richard E Fikes and Nils J Nilsson. Strips: A new approach to the application of theorem proving to problem solving. Artificial intelligence, 2(3-4):189–208, 1971. 1, 2 [10] Rafail Fridman, Amit Abecasis, Yoni Kasten, and Tali Dekel. Scenescape: Text-driven consistent scene genera- tion. Advances in Neural Information Processing Systems, 36:39897–39914, 2023. 2 [11] Huan Fu, Bowen Cai, Lin Gao, Ling-Xiao Zhang, Jiaming Wang, Cao Li, Qixun Zeng, Chengyue Sun, Rongfei Jia, Bin- qiang Zhao, et al. 3d-front: 3d furnished rooms with layouts and semantics. In Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 10933–10942, 2021. 2 [12] Rao Fu, Zehao Wen, Zichen Liu, and Srinath Sridhar. Any- home: Open-vocabulary generation of structured and tex- tured 3d homes. In European Conference on Computer Vi- sion, pages 52–70. Springer, 2024. 2, 5 [13] Hector Geffner and Blai Bonet. A concise introduction to models and methods for automated planning. Morgan & Claypool Publishers, 2013. 1, 2, 3 [14] Google DeepMind. A new era of intelligence with gem- ini 3. https://blog.google/products-and- platforms/products/gemini/gemini-3/, 2025. Accessed: 2026-03-09. 6 [15] Zeqi Gu, Yin Cui, Zhaoshuo Li, Fangyin Wei, Yunhao Ge, Jinwei Gu, Ming-Yu Liu, Abe Davis, and Yifan Ding. Artiscene: Language-driven artistic 3d scene generation through image intermediary. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 2891– 2901, 2025. 2, 5 [16] Patrik Haslum, Nir Lipovetzky, Daniele Magazzeni, Chris- tian Muise, Ronald Brachman, Francesca Rossi, and Peter Stone. An introduction to the planning domain definition lan- guage. Springer, 2019. 1, 2, 3 [17] Lukas H ̈ ollein, Ang Cao, Andrew Owens, Justin Johnson, and Matthias Nießner. Text2room: Extracting textured 3d meshes from 2d text-to-image models. In Proceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 7909–7920, 2023. 2 [18] Ian Huang, Yanan Bao, Karen Truong, Howard Zhou, Cordelia Schmid, Leonidas Guibas, and Alireza Fathi. Fire- place: Geometric refinements of llm common sense reason- ing for 3d object placement. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 13466– 13476, 2025. 1 [19] hyper3D. Hyper3dai. https://hyper3d.ai, 2025. 5, 2, 3 [20] Chenguo Lin and Yadong Mu. Instructscene: Instruction- driven 3d indoor scene synthesis with semantic graph prior. arXiv preprint arXiv:2402.04717, 2024. 2 [21] Rui Ma, Akshay Gadi Patil, Matthew Fisher, Manyi Li, S ̈ oren Pirk, Binh-Son Hua, Sai-Kit Yeung, Xin Tong, Leonidas Guibas, and Hao Zhang. Language-driven synthe- sis of 3d scenes from scene databases. ACM Transactions on Graphics (TOG), 37(6):1–16, 2018. 2 [22] OpenAI. Gpt-5 system card. 2025-08-13. 5, 6, 1 [23] Zhenyu Pan and Han Liu. Metaspatial: Reinforcing 3d spa- tial reasoning in vlms for the metaverse. arXiv preprint arXiv:2503.18470, 2025. 1 [24] Despoina Paschalidou, Amlan Kar, Maria Shugrina, Karsten Kreis, Andreas Geiger, and Sanja Fidler. Atiss: Autoregres- sive transformers for indoor scene synthesis. Advances in Neural Information Processing Systems, 34:12013–12026, 2021. 1 [25] Xingjian Ran, Yixuan Li, Linning Xu, Mulin Yu, and Bo Dai.Direct numerical layout generation for 3d in- door scene synthesis via spatial reasoning. arXiv preprint arXiv:2506.05341, 2025. 1 9 [26] Jonas Schult, Sam Tsai, Lukas H ̈ ollein, Bichen Wu, Jialiang Wang, Chih-Yao Ma, Kunpeng Li, Xiaofang Wang, Felix Wimbauer, Zijian He, et al. Controlroom3d: Room gen- eration using semantic proxy rooms.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6201–6210, 2024. 2 [27] Fan-Yun Sun, Weiyu Liu, Siyi Gu, Dylan Lim, Goutam Bhat, Federico Tombari, Manling Li, Nick Haber, and Jia- jun Wu. Layoutvlm: Differentiable optimization of 3d layout via vision-language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 29469– 29478, 2025. 2, 1 [28] Jiapeng Tang, Yinyu Nie, Lev Markhasin, Angela Dai, Justus Thies, and Matthias Nießner. Diffuscene: Denoising diffu- sion models for generative indoor scene synthesis. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20507–20518, 2024. 2, 1 [29] Pulkit Verma, Ngoc La, Anthony Favier, Swaroop Mishra, and Julie A Shah. Teaching llms to plan: Logical chain- of-thought instruction tuning for symbolic planning. arXiv preprint arXiv:2509.13351, 2025. 1 [30] Yandan Yang, Baoxiong Jia, Shujie Zhang, and Siyuan Huang. Sceneweaver: All-in-one 3d scene synthesis with an extensible and self-reflective agent. In The Thirty-ninth An- nual Conference on Neural Information Processing Systems. 6, 1 [31] Yixuan Yang, Junru Lu, Zixiang Zhao, Zhen Luo, James JQ Yu, Victor Sanchez, and Feng Zheng. Llplace: The 3d in- door scene layout generation and editing via large language model. arXiv preprint arXiv:2406.03866, 2024. 2 [32] Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Al- varo Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, et al. Holodeck: Language guided gen- eration of 3d embodied ai environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16227–16237, 2024. 2, 1 [33] Guangyao Zhai, Evin Pınar ̈ Ornek, Shun-Cheng Wu, Yan Di, Federico Tombari, Nassir Navab, and Benjamin Busam. Commonscenes: Generating commonsense 3d indoor scenes with scene graph diffusion. Advances in Neural Information Processing Systems, 36:30026–30038, 2023. 1 [34] Guangyao Zhai, Evin Pınar ̈ Ornek, Dave Zhenyu Chen, Ruo- tong Liao, Yan Di, Nassir Navab, Federico Tombari, and Benjamin Busam. Echoscene: Indoor scene generation via information echo over scene graph diffusion. In European Conference on Computer Vision, pages 167–184. Springer, 2024. 2 [35] Kaizhi Zheng, Xiaotong Chen, Xuehai He, Jing Gu, Lin- jie Li, Zhengyuan Yang, Kevin Lin, Jianfeng Wang, Lijuan Wang, and Xin Eric Wang. Editroom: Llm-parameterized graph diffusion for composable 3d room layout editing. arXiv preprint arXiv:2410.12836, 2024. 2 10 Edit-As-Act: Goal-Regressive Planning for Open-Vocabulary 3D Indoor Scene Editing Supplementary Material A. Overview This supplementary document provides additional details, experimental results, and analyses for our paper, “Edit-As- Act”. We organize the material to enhance implementation transparency and provide deeper insights into our method’s components and evaluation protocols. The contents are as follows: • Sec. B: Extended Intuition. We elaborate on the intu- ition behind our work, specifically focusing on the design choice of constraining the LLM to ground its reasoning in 3D layouts via symbolic predicates, rather than directly generating 3D geometry. • Sec. C: Object Generation and Stylization Pipeline. We provide detailed specifications of the generative pipeline used to instantiate and stylize objects. • Sec. D: EditLang Formalization. We present the com- plete formal grammar for EditLang and provide numer- ous examples of translating natural language instructions into EditLang goals. • Sec. E: Goal-Regressive Planner Details. This sec- tion details the planner’s algorithm, including pseudo- code and an execution trace of the source-aware regres- sion mechanism. • Sec. F: Validator Details. We demonstrate the Valida- tor’s functionality with specific examples of successful and failed validation cases. • Sec. G: LVLM Metric Reliability. We present a corre- lation study between our proposed LVLM-based metrics and human judgments to validate their reliability. • Sec. H: Extended Quantitative Analysis. We provide a detailed breakdown of how the method behaves across different types of editing operations. • Sec. I: Failure Case Visualization. We visualize repre- sentative failure cases of our model. • Sec. J: Additional Ablation Studies. We report addi- tional ablation studies on the backbone LLM models, the size of the predicate set, and the parameter sensitivity of the validator. • Sec. K: Additional Quantitative Experiments. We re- port additional quantitative results on the prompt sensitiv- ity of goal condition prediction and geometry-based met- rics. • Sec. L: Limitations and Discussion. We discuss the cur- rent limitations of our framework, including the reliance on hand-designed predicates and the scope of single- scene evaluation, and outline promising directions for fu- ture work. • Sec. M: Source Scene Visualization. We provide visu- alizations for the full list of source scenes included in the benchmark. • Sec. N: Additional Qualitative Results. We present fur- ther qualitative examples demonstrating the system’s ca- pabilities. • Sec. O: Full Prompts for Model and Evaluation. We provide the LLM prompts used in our model, as well as the prompts utilized for the LVLM-based evaluation. B. Extended Intuition Core Intuition Summary.Traditional editing ap- proaches treat the task as Generative Simulation by attempting to predict the visual appearance of a scene after a change. This method often fails because LLMs lack the capacity for precise geometric forecasting. In contrast, Edit-As-Act redefines editing as Goal Specifi- cation and focuses on defining the conditions that must be satisfied in the final state. Our framework capitalizes on a critical asymmetry in LLMs, which are unreliable at geometric simulation yet highly proficient in sym- bolic reasoning. By reasoning backward from a desired target state to the current source scene, we ensure that edits are minimal, physically valid, and semantically faithful. A central challenge in LLM-based indoor scene editing is interpreting complex 3D environments with interacting objects [8, 32]. Conventional approaches, such as fine- tuning on scene datasets or using iterative generation, often fail in open-vocabulary settings because they rely heavily on predefined object distributions [24, 28, 33]. While re- cent RL-driven or search-based methods improve physical plausibility (e.g., avoiding collisions) [6, 23, 25, 30], they often address geometric validity at the expense of semantic intent and compositional grounding [18, 27]. Our formulation is motivated by a critical asymmetry observed in frontier LLMs, as illustrated in Fig. 5: they are poor at geometric simulation but excellent at symbolic specification [5, 22, 29]. In the early stage of research, we found that LLMs struggle to predict the precise geometric outcome of an action (e.g., “What will the scene look like after sliding the chair?”). This requires multi-object spa- tial forecasting and stability inference, which are outside the models’ reliable operating range. Conversely, LLMs are remarkably robust when reasoning about the conditions a fi- 1 ) ) “computer_monitor_1": "dim":[1.2, 0.6, 0.04], "center": [0.0, 0.0, 0.73], “computer_desk_1": "dim":[0.5, 0.05, 0.35], "center": [0.0, 0.15,0.925], “computer_chair_1": "dim": [0.6, 0.6, 1.0], "center": [0.0, -0.7, 0.5], ... 3D layout “Describe given 3D layout” “Generate 3D layout given instruction” “computer_monitor_1": "dim":[1.2, 0.6, 0.04], "center": [1.2, -1.0, 1.10], “computer_desk_1": "dim":[0.5, 0.05, 0.35], "center": [0.0, 0.15,0.925], “computer_chair_1": "dim": [0.6, 0.6, 1.0], "center": [-1.0, 1.0, 0.5], ... 3D layout “This layout depicts a modern and organized computer workstation. A widescreen monitor is placed centrally on a sleek desk, with an ergonomic office chair positioned neatly in front, creating a functional and ready-to-use workspace.” “Arrange the objects to create a modern computer workstation. Place the widescreen monitor centrally on the surface of the sleek desk. Finally, position the ergonomic office chair directly in front of the desk, facing the monitor, to complete the setup.” Instruction Description Monitor is placed centrally on sleekdesk The monitor is floating far off in a corner instead of being placed on the desk The chair is positioned diagonally far away rather than in front of the desk Chair positioned neatly in front of desk Figure 5. LLMs demonstrate strong capabilities in interpreting existing 3D layouts (top) but remain unreliable when directly generating 3D layouts from instructions (bottom). This asymmetry motivates our goal-regressive formulation. nal scene must satisfy (e.g., “The chair must face the desk” or “The lamp must rest on a supported surface”). This observation suggests that tasking an LLM with di- rect layout generation or step-by-step simulation is funda- mentally misaligned with its capabilities. Instead, we refor- mulate 3D scene editing as a goal-specification problem. Rather than predicting how a scene transforms, the model defines the declarative constraints that must hold in the tar- get state. To bridge these symbolic constraints with the 3D envi- ronment, we introduce EditLang, a PDDL [1, 16]-inspired domain that defines explicit preconditions and effects for geometric actions. EditLang serves as a structural interface: the LLM extracts symbolic goals, and a goal-regressive planner identifies the minimal sequence of physically fea- sible actions to satisfy them. This division of labor—using LLMs for semantic reasoning and a planner-validator loop for geometric grounding—eliminates the need for layout hallucination, ensuring that every edit is interpretable, phys- ically valid, and faithful to the user’s instruction. C. Object Generation and Stylization Pipeline To physically realize the editing operations proposed by our planner, specifically Add and Stylize actions, we employ a unified generative pipeline powered by Hyper3D Gen-2 (Rodin Gen-2) [19]. This state-of-the-art generative model allows us to produce high-fidelity 3D assets that are visually consistent with the user’s textual instructions. As illustrated in Fig. 6, our pipeline handles two distinct work- flows depending on the editing requirement. Please note that our framework is agnostic to the generative backbone; thus, any arbitrary text-to-3D or image-to-3D models can be employed as substitutes. C.1. Object Generation When the planner specifies an Add action (e.g., “Add a modern chair”), the system requires a completely new 3D asset. In this mode, the pipeline takes a descriptive text prompt as the sole input. Therefore, Gen-2 first synthesizes the corresponding image of given text, then sequentially synthesizes the object’s geometry and texture, outputting a 3D mesh that semantically aligns with the description. C.2. Object Stylization For Stylize actions (e.g., “Change the desk to a charcoal metal finish”), it is critical to modify the visual appearance (texture and material) while strictly preserving the original object’s shape and dimensions. As highlighted by the blue dashed boxes in Fig. 6, we utilized the point cloud condi- tioning feature of Gen-2. The target 3D object is first con- verted to a point cloud representation as a structural control signal. This point cloud, combined with the synthesized im- age generated from the text prompt and the rendered image of the object, guides the image-to-3D generation process.In Fig. 6, by conditioning the generation process on the point cloud representation, the model updates the texture to match the “charcoal metal finish” while ensuring the output 3D ob- ject retains the exact pose and structure of the original input. 2 3D Object/Image Hyper3D Gen-2 “Stylize: bench desk cluster, charcoal metal finish” Text Prompt Text-to-Image / Image-to-Image Point Cloud Output 3D Object : Stylization only Figure 6. Overview of the object generation and stylization pipeline. We utilize Hyper3D Gen-2 [19] to synthesize high-fidelity 3D assets. The pipeline operates in two modes: (1) Text-to-3D for generating new objects from scratch, and (2) Point Cloud-Guided Stylization (highlighted in blue dashed boxes) where the input 3D object is converted into a point cloud to condition the generation, ensuring the geometric structure remains preserved while the texture is updated according to the text prompt. D. EditLang Formalization To bridge the gap between vague natural language instruc- tions and rigid 3D geometric data, we need a structured in- termediate representation. EditLang serves as this bridge, defining the domain through two core components: • Predicates (The “State”): A vocabulary to describe geometric and semantic relationships (e.g., on(lamp, table)). They act as the “eyes” of the planner, trans- lating continuous 3D coordinates into discrete symbolic states to verify if a goal is met. • Actions (The “Operators”): A set of atomic operations (e.g., move to, stylize) defined with explicit pre- conditions and effects. These act as the “hands” of the planner, ensuring that every modification is physically grounded and logically sound before execution. This section presents the complete specification of Edit- Lang used in our experiments, supporting hierarchical ob- ject manipulation, fine-grained directional placement, and style propagation. D.1. Syntax EditLang uses a typed, PDDL-inspired syntax defined as follows: D.2. Predicate Library Our predicate set captures geometric state, topology, physi- cal constraints, and semantic relations. We categorize them by the fundamental questions they answer: • Existence (Is it there?): exists(o), removed(o). • Spatial Relations (Where is it globally?): at(o,pos), on(o,surface), between(o,a,b), near(o,target,τ), alignedwith(o,target,axis). • Directional & Relative (How is it oriented?): isfacing(o,target), leftof(o,ref,view), rightof(...), infrontof(...), behind(...). • Grouping & Constraints (How does it interact?): grouped with(child,parent), locked(o). • Physical & Functional (Is it valid?): supported(o,surface), contact(o,surface), clear(o), stable(o), colliding(o1,o2). visible(o,view), accessible(o). • Attributes (What does it look like?): has style(o,desc), matchesstyle(o1,o2), hasscale(o,sx,sy,sz). D.3. Action Definitions Theplanneremploysatomicactionsdefinedby ⟨pre, add, del⟩ conditions. • Spatial Manipulation: – move to(o, pos): Relocates a single object o. – movegroup(parent, pos): Moves a parent ob- ject (e.g., dining table) and all associated children (e.g., chairs, centerpiece) while preserving their relative local transforms. – place relative(o, target, relation): Places o satisfying directional predicates (e.g., leftof) relative to the target. – placeon(o, surface): Places o on a support surface. – align with(o, target, axis):Aligns bounding boxes along an axis. • Orientation & Scale: – rotatetowards(o, target): Updates yaw to face a target. 3 – rotateby(o, degrees): Rotates object o by a specified angle relative to its current orientation. – scale(o, sx, sy, sz): Modifies dimensions with collision checks. • Creation, Deletion & Style: – add object(o, cat, support): Instantiates a new asset. – remove object(o): Deletes o and clears relations. – stylize(o, desc): Updates texture based on text description. Tab. 5 demonstrates mapping instructions to these ex- tended goals. E. Goal-Regressive Planner Details This section details the algorithmic implementation of the planner. We employ the LLM as a policy π θ to propose transition operators, while the control flow is governed by a deterministic symbolic loop. E.1. Planning Algorithm Algorithm 1 outlines the core planning loop. The system maintains a stack of goals G t . At each iteration, it prompts the LLM to propose an action a t that satisfies at least one condition in G t . Crucially, the VALIDATOR acts as a re- jection sampler, filtering out geometrically invalid or non- monotonic actions before they affect the plan state. Algorithm 1 LLM-Driven Goal Regression Loop Require: Goal predicates G target , Source state S 0 Ensure: Plan Π = [a 1 , . . . , a T ] 1: G← G target 2: Π back ← []▷ Backward plan sequence 3: while G̸=∅ do 4: success← False 5:for k ← 1 to 3 do▷ Max 3 retries per step 6:a← LLM POLICY(G, S 0 , Π back ) 7:valid, msg← VALIDATOR(a, G, S 0 ) 8:if valid then 9:success← True 10:break 11:else 12:Add msg to prompt history (Refinement) 13:end if 14:end for 15:if not success then 16:return Π back 17:end if 18:Append a to Π back 19: G← SOURCEAWAREREGRESS(G, a, S 0 ) 20: end while 21: return Reverse(Π back ) E.2. Source-Aware Regression Logic Unlike classical STRIPS which regresses to an initial empty state, our ‘SourceAwareRegress‘ function filters precondi- tions against the actual 3D scene geometry S 0 . This serves as a pruning mechanism. The implementation logic is as follows: 1. Satisfy: Remove goals satisfied by action effects (G\ add(a)). 2. Propagate: Identify preconditions required by a (P = pre(a)). 3. Prune: Filter out preconditions that are already true in the source scene (U = P \ S 0 ). 4. Update: The new goal set becomes the remaining goals plus the unsatisfied preconditions (G← (G\ add(a))∪ U ). This ensures the planner only generates sub-plans for conditions that physically need changing (e.g., moving an obstacle), rather than reconstructing the entire scene graph. E.3. Execution Trace Example To demonstrate the regression logic, consider the instruc- tion: “Place the lamp on the side table,” where the table is currently cluttered. Initial Goal (G 0 ). on(lamp, table) Step 1. • LLM Proposal: a 1 = place on(lamp, table) • Preconditions:clear(table), exists(lamp) • Check S 0 : – exists(lamp) is True (in inventory). – clear(table) is False (blocked by a mug). • Regression: G 1 ←clear(table) Step 2. • LLM Proposal: a 2 = move to(mug, shelf) • Effects: Adds clear(table) (by removing mug from it). • Check S 0 : move to preconditions (shelf is valid) are met in S 0 . • Regression: G 2 ←∅ (All conditions grounded in S 0 ) Final Plan (Reversed). 1. moveto(mug, shelf) (Clears the table) 2. place on(lamp, table) (Achieves goal) This trace illustrates how the regression mechanism nat- urally unrolls the dependency chain to handle intermediate obstacles. 4 Table 5. Examples of Natural Language to EditLang Goal Translation. InstructionGoal Predicates “Move the dining set to the window.” near(table1, window1), groupedwith(chair * , table 1) “Make the chair match the sofa.” matches style(chair1, sofa1) “Place the lamp to the left of the bed.” leftof(lamp1, bed1, cam frame) F. Validator Details The validator V evaluates each proposed action based on the four criteria (Goal directedness, Monotonicity, Contex- tual consistency, Formal validity) defined in the main paper. In this section, we provide the technical implementation de- tails specifically for the Geometric and Physical Feasibility checks used to enforce the domain invariants (I). F.1. Geometric and Physical Implementation To enforce physical plausibility and spatial constraints, we implement the following deterministic checks: Geometric & Physical Checks. These checks implement the Domain Invariants (I) described in the main paper, uti- lizing the 3D scene state: • Collision: We compute Oriented Bounding Box (OBB) intersections. An action is rejected if the target volume intersects with static scene elements (tolerance ε < 1cm). • Support: For place on or add actions, we cast rays downwards from the object’s base. At least 60% of the base area must contact the target surface to satisfy the sta- bility invariant. F.2. Pass/Fail Case Studies Tab. 6 presents specific examples of actions rejected by the Validator and the corresponding feedback provided to the planner for refinement. G. LVLM Metric Reliability To validate the automated evaluation protocol used in our benchmark, we analyze the correlation between the LVLM- based metrics and human judgments. We collected paired ratings on the same set of edited scenes using the identical 1-to-7 Likert scale described in the main paper. This sec- tion compares the global score distributions and analyzes sample-level agreement. Table 6. Validator Decision Examples. Detailed breakdown of why specific actions are rejected during the planning loop. Check TypeProposed Action & Context Validator Decision & Feedback Physical (Collision) Action: move to(chair1, [1.2, 0, 1.5]) Context: Target coor- dinates overlap with table1. FAIL “Targetposition causes collision with table1.Please select a clear region or move the obstacle first.” Symbolic (Monotonicity) Action: moveto(lamp, floor) Context: Previous step satisfied on(lamp, table). FAIL “Actionundoesa previouslysatisfied goal: on(lamp, table).Do not moveobjectsthat are already correctly placed.” Symbolic (Relevance) Action: stylize(curtain, "blue") Context:Instruction is “Rotate the chair”. No goal relates to the curtain. FAIL “Actiondoesnot satisfyanycurrent goal.Focus only on the chair and its orientation.” G.1. Distribution Alignment Fig. 7, Fig. 8, and Fig. 9 illustrate the normalized frequency of scores for Instruction Fidelity (IF), Physical Plausibility (P), and Semantic Consistency (SC), respectively. The histograms reveal a strong alignment between hu- man and LVLM evaluations: • Matching Modes: For all three metrics, both human and LVLM distributions peak at the highest score bucket (7), reflecting the model’s high performance. • Similar Variance: The spread of scores across the 1–7 scale is comparable, indicating that the LVLM effectively 5 1234567 Score level (17) 0.0 0.2 0.4 0.6 0.8 1.0 Normalized frequency Score distribution: Ours / IF Human VLM (binned from 0100) Figure 7. Validation of Instruction Fidelity (IF) Metrics. The strong overlap between Human and LVLM histograms confirms that our automated evaluator correctly identifies successful edits. The synchronization at the high-score range indicates the metric reliably reflects the model’s adherence to instructions. 1234567 Score level (17) 0.0 0.2 0.4 0.6 0.8 1.0 Normalized frequency Score distribution: Ours / P Human VLM (binned from 0100) Figure 8. Validation of Physical Plausibility (P) Metrics. Both human and LVLM distributions heavily favor the highest scores, demonstrating that the LVLM is a strict and reliable judge of phys- ical violations similar to human perception. captures the nuances of editing quality rather than col- lapsing to a binary pass/fail. • Optimism Bias: While the distributions are consistent, the LVLM tends to be slightly more generous (higher den- sity at score 7) than human raters, particularly in Seman- tic Consistency (SC) and Physical Plausibility (P). How- ever, the relative ranking trends remain preserved. G.2. Sample-Level Disagreement Analysis To assess granular agreement, we compute the absolute dif- ference|S Human −S LVLM | for each sample. Fig. 10, Fig. 11, and Fig. 12 visualize these differences across scene types and editing operations. The majority of the heatmap regions are dark blue (< 1.5 difference), confirming that the LVLM 1234567 Score level (17) 0.0 0.2 0.4 0.6 0.8 1.0 Normalized frequency Score distribution: Ours / SC Human VLM (binned from 0100) Figure 9. Validation of Semantic Consistency (SC) Metrics. Al- though the LVLM is slightly more optimistic in cluttered scenes, it closely mimics the human preference for high consistency, vali- dating its utility as a proxy for measuring scene preservation. ADD REMOVE TRANSLATE ROTATE SCALE STYLIZE MIXED Command bathroom bedroom computer_room dining_room game_room kids_room kitchen living_room office Category 1.72.80.20.73.62.73.3 1.33.85.61.50.41.81.9 1.23.61.00.40.40.44.1 2.01.10.41.00.72.31.0 1.85.43.82.31.10.71.1 0.81.50.70.30.62.61.0 0.61.20.21.90.21.23.3 1.20.70.80.70.40.24.8 2.32.60.70.21.60.54.6 |Human - VLM| Model=Ours Question=IF 0 1 2 3 4 5 6 Absolute difference (0 6) Figure 10. Absolute Score Difference (IF). Disagreements are localized to specific ambiguous scenarios. approximates human judgment accurately for most tasks. The paired analysis demonstrates that our LVLM-based metric is a reliable proxy for human evaluation. It re- produces the global score distribution and maintains low sample-level error in most configurations, justifying its use for scalable benchmarking in open-vocabulary scene edit- ing. H. Extended Quantitative Analysis We provide how well each method handles specific types of editing operations. Indoor scene editing involves fun- 6 Table 7. Performance by editing operation type. Edit-As-Act achieves the strongest and most reliable performance across all edit categories, maintaining high instruction fidelity (IF), semantic consistency (SC), and physical plausibility (P). ADDREMOVETRANSLATEROTATE MethodsIFSCPPIFSCPPIFSCPPIFSCPP LayoutGPT-E40.542.182.460.371.588.152.650.380.741.432.578.1 AnyHome47.265.082.161.258.779.866.076.685.752.748.485.0 ArtiScene-E39.950.089.180.178.494.050.258.086.139.928.090.8 Edit-As-Act (ours)82.790.288.373.980.195.997.195.789.653.186.395.0 SCALESTYLIZEMIXEDAverage MethodsIFSCPPIFSCPPIFSCPPIFSCPP LayoutGPT-E58.159.281.535.358.777.227.827.564.642.348.878.6 AnyHome91.090.786.853.454.892.131.929.480.257.660.584.5 ArtiScene-E45.356.088.152.254.693.630.333.390.848.351.290.3 Edit-As-Act (ours)85.296.187.350.689.797.741.168.187.469.186.691.7 ADD REMOVE TRANSLATE ROTATE SCALE STYLIZE MIXED Command bathroom bedroom computer_room dining_room game_room kids_room kitchen living_room office Category 3.61.60.20.71.71.50.4 0.50.43.01.62.01.70.2 0.92.01.22.40.40.30.9 1.40.40.41.31.40.81.0 0.00.62.12.01.30.10.1 0.20.70.70.30.70.51.0 0.91.20.22.60.32.30.7 0.90.60.52.60.40.80.0 0.30.40.91.80.80.51.4 |Human - VLM| Model=Ours Question=P 0 1 2 3 4 5 6 Absolute difference (0 6) Figure 11. Absolute Score Difference (P). High agreement (low error) is observed across most categories. damentally different reasoning modes, such as addition or removal. Each operation stresses a different capability: • ADD stresses open-vocabulary generalization and asset integration. • REMOVE stresses boundary reasoning. • TRANSLATE/ ROTATE/ SCALE stress geometric pre- cision. • STYLIZE stresses geometry-level consistency. • MIXED stresses multi-step, compositional reasoning. ADD REMOVE TRANSLATE ROTATE SCALE STYLIZE MIXED Command bathroom bedroom computer_room dining_room game_room kids_room kitchen living_room office Category 4.02.40.20.71.42.52.6 0.52.53.52.61.10.71.4 0.93.21.40.40.30.40.9 2.01.00.40.90.61.30.9 1.02.83.02.41.90.90.3 0.81.40.72.70.70.70.0 0.11.20.23.40.22.90.7 1.90.60.63.00.61.11.8 2.21.51.11.11.40.51.9 |Human - VLM| Model=Ours Question=SC 0 1 2 3 4 5 6 Absolute difference (0 6) Figure 12. Absolute Score Difference (SC). The LVLM aligns well with human judgment, with minor deviations in cluttered en- vironments. For this reason, we further analyze performance by edit cat- egory in Tab. 7. Analysis. Several observations emerge from the per- category breakdown. First, Edit-As-Act achieves the high- est average scores across all three metrics (IF 69.1, SC 86.6, P 91.7), demonstrating that goal-regressive planning gen- 7 eralizes well across fundamentally different editing modes. Second, the advantage is most pronounced in categories that demand precise spatial reasoning. In Translate, Edit-As- Act attains 97.1 IF and 95.7 SC, outperforming the next- best method by over 30 points in IF. This confirms that our symbolic predicate formulation effectively grounds po- sitional intent. Third, for Add and Stylize, which rely heavily on the quality of the generative backbone, Edit-As- Act still leads in SC and P, indicating that the planner– validator loop successfully constrains asset placement and appearance even when the underlying generation is imper- fect. Finally, the Mixed category, which requires composi- tional multi-step reasoning, proves challenging for all meth- ods; however, Edit-As-Act maintains a substantial lead in IF (41.1 vs. 31.9) and SC (68.1 vs. 33.3), validating that goal regression naturally decomposes complex instructions into tractable sub-goals. A notable weakness appears in the Scale category, where AnyHome achieves a higher IF (91.0 vs. 85.2). We attribute this to AnyHome’s direct paramet- ric scaling strategy, which bypasses symbolic grounding. Nevertheless, Edit-As-Act compensates with a significantly higher SC (96.1 vs. 90.7), preserving scene coherence more reliably. I. Failure Case Visualization Figure 13. Qualitative Failure Examples. We acknowledge lim- itations where ambiguity leads to misaligned goals or geometric validity does not guarantee semantic affordance. Table 8. Additional ablation on backbone robustness, predicate complexity, and validator parameter sensitivity. SettingIFSCPP w/ GPT-OSS-20b62.572.877.8 Small Pred. Set52.478.182.6 High Sensitivity66.584.590.2 Low Sensitivity68.985.088.0 Ours69.186.691.7 We visualize representative failure cases discussed in the main paper to provide concrete insight into the current lim- itations of Edit-As-Act. Ambiguous Instructions. The left example in Fig. 13 il- lustrates a case where the instruction “Clean up the room” is inherently under-specified. Because no explicit goal ob- jects are mentioned, the LLM over-aggressively maps the instruction to remove actions, deleting functional items (e.g., desk accessories) that a human would consider essen- tial. This highlights a limitation in goal condition extrac- tion: when the instruction lacks grounding cues, the model defaults to an overly literal interpretation of “clean,” pro- ducing a barren scene. Semantic Affordance Errors. The right example shows a case where the instruction “Place the chair next to the desk” is executed in a geometrically valid but semantically incorrect manner. The validator confirms that spatial pred- icates such as near(chair, desk) are satisfied; how- ever, the chair is oriented facing the wall rather than the desk surface, violating the implicit functional affordance. This failure reveals a gap in our current predicate set: while geo- metric constraints are enforced, higher-level affordance rea- soning (e.g., a chair should face its associated workspace) is not yet captured by the symbolic domain. J. Additional Ablation Studies We conduct three additional ablation studies to examine the robustness of Edit-As-Act along axes not covered in the main paper. All experiments use the full E2A-Bench and report the same LVLM-based metrics. Results are summa- rized in Tab. 8. J.1. Backbone LLM To assess whether our framework is tied to a specific fron- tier model, we replace the default backbone with GPT-OSS- 20b[2], a smaller open-source LLM. As shown in the first row of Tab. 8, all three metrics drop noticeably (IF 62.5, SC 72.8, P 77.8), confirming that the quality of symbolic goal extraction and action proposal scales with model capabil- ity. Notably, the largest degradation occurs in SC (−13.1), suggesting that weaker models struggle most with preserv- ing scene context during multi-step planning. Nevertheless, the system remains functional, indicating that EditLang’s structured interface partially compensates for reduced LLM reasoning capacity. J.2. Predicate Set Size We evaluate a reduced predicate set (Small Pred. Set) that retains only existence, basic spatial (at, on), and collision predicates, removing directional, grouping, and affordance- level predicates. This ablation isolates the contribution of 8 our rich predicate vocabulary. The results show a marked decline in IF (52.4 vs. 69.7), as the planner can no longer express fine-grained goals such as leftof or facing. Interestingly, P remains relatively high (82.6), because ba- sic collision and support checks are preserved. This con- firms that expressive predicates are essential for instruction fidelity, while physical plausibility is primarily governed by the validator’s geometric checks. J.3. Validator Geometric Sensitivity We vary the collision tolerance threshold of the validator to study its effect on plan quality. High Sensitivity tightens the OBB intersection tolerance to ε < 0.5 cm, while Low Sen- sitivity relaxes it to ε < 3 cm. As shown in Tab. 8, the de- fault setting (ε < 1 cm) achieves the best balance across all metrics. High sensitivity marginally improves P (90.2) but reduces IF (66.5), because the stricter threshold causes the validator to reject more valid placements, forcing the plan- ner into suboptimal compromises. Conversely, low sensitiv- ity slightly degrades P (88.0) by admitting near-collision configurations. These results justify our default threshold as an effective trade-off between physical strictness and plan- ning flexibility. K. Additional Quantitative Experiments In this section, we report two additional sets of quantitative experiments. First, we analyze the prompt sensitivity of the goal condition prediction module to assess its robustness to instruction rephrasing (Sec. K.1). Second, we present geometry-based metrics derived directly from the final 3D layouts to complement the LVLM-based semantic evalua- tions with objective physical measurements (Sec. K.2). K.1. Prompt Sensitivity of Goal Condition Predic- tion A potential concern with LLM-based goal extraction is that minor rephrasing of the input instruction could lead to sub- stantially different goal predicate sets, undermining the re- liability of the entire pipeline. To quantify this, we design a prompt sensitivity experiment. Setup. We select 50 editing instructions from E2A-Bench and generate three semantically equivalent rephrasings for each using an independent LLM (GPT-4o), resulting in 200 instruction variants. For example, “Move the chair closer to the window” is rephrased as “Slide the chair toward the window,” “Position the chair near the window,” and “Bring the chair next to the window.” We then run our goal con- dition extraction module on all variants and measure con- sistency via two metrics: (1) Predicate Recall, defined as the fraction of predicates from the original instruction that also appear in the rephrased variant’s goal set, and (2) Exact Match Rate, the percentage of cases where the original and rephrased variants produce identical goal predicate sets. Results. Across all 150 rephrased variants, we observe a predicate recall of 92.4% and an exact match rate of 78.0%. The majority of mismatches involve stylistic dif- ferences rather than semantic divergence; for instance, one variant may produce near(chair, window) while an- other yields at(chair, pos nearwindow), both of which lead to functionally equivalent plans. When we fur- ther measure downstream plan equivalence (i.e., whether the final executed action sequences produce the same scene state), agreement rises to 94.6%. Discussion. These results confirm that our goal extraction module is robust to surface-level linguistic variation. The structured EditLang interface acts as a bottleneck that reg- ularizes diverse phrasings into a compact symbolic space, effectively absorbing paraphrase noise before it can propa- gate to the planner. K.2. Geometry-based Metrics 1. Out-of-Boundary (OOB) Rate. This metric identifies objects placed outside the valid room volume. • Measurement: We first compute the axis-aligned bound- ing box (AABB) of the entire source room. An object is classified as OOB if its geometric center lies outside the room’s AABB, expanding more than 10cm. • Calculation: We report the percentage of scenes contain- ing at least one OOB object. 2. Floating Object Rate. This metric measures the per- centage of objects that are physically unstable (i.e., levitat- ing without support). An object is considered “grounded” if it satisfies one of two conditions: • Floor Contact: Its bottom vertical coordinate (z min ) is within a tolerance of 10cm from the floor height. • Stacked Support: It rests on another object that is itself grounded. Wall-mounted assets are excluded from this check. Objects failing both conditions are classified as “Floating.” Tab. 9 summarizes the performance of each method. Table 9. Comparison of Geometry-Based Metrics. Lower is better. Edit-As-Act achieves the best physical validity. MethodOOB Scene Ratio (%)↓Floating Object Rate (%)↓ ArtiScene-E88.8992.06 AnyHome7.9457.14 Edit-As-Act (Ours)6.1614.21 9 Analysis of Failures. • ArtiScene-E: The high failure rates stem from the am- biguity of lifting 2D edits to 3D. Without explicit depth constraints, the estimated 3D bounding boxes often drift through walls (OOB) or fail to touch the ground (Float- ing). • AnyHome: Although better than image-based methods, AnyHome struggles with causal dependencies. A com- mon failure mode involves “Remove” operations: when a supporting object (e.g., a table) is deleted, the system of- ten fails to address the supported objects (e.g., a laptop), leaving them floating in mid-air. • Edit-As-Act: Our method explicitly models support rela- tions (e.g., on(x, y)) and room boundaries in the sym- bolic domain. This ensures that objects are placed within bounds and that removing a parent object triggers nec- essary adjustments for its children, yielding the highest physical fidelity. L. Limitations and Discussion While Edit-As-Act demonstrates strong performance across diverse editing operations, several limitations remain. Hand-Designed Predicate Set. EditLang currently relies on a manually curated set of predicates and action schemas. Although this design provides precise control and inter- pretability, extending the domain to new object categories or interaction types requires manual effort. Incorporating learned predicates or data-driven action schemas—for in- stance, by mining recurring spatial patterns from large-scale scene datasets—could improve adaptability and reduce the engineering overhead of domain expansion. Single-Scene, Static Setting. Our experiments focus ex- clusively on editing single, static indoor scenes. Applying goal-regressive planning to multi-room environments or dy- namic settings (e.g., scenes that evolve over time with mov- ing agents) would significantly broaden the framework’s applicability. Such extensions introduce additional chal- lenges, including longer reasoning horizons, richer contex- tual dependencies across rooms, and the need to handle tem- poral constraints. Outlook. Overall, Edit-As-Act illustrates how symbolic reasoning can enable precise and controllable 3D scene editing.The modular separation of semantic reasoning (LLM) and geometric grounding (planner–validator) pro- vides a principled foundation that can accommodate fu- ture advances in both language models and 3D generative systems, with many promising paths toward scaling this paradigm. M. Visualization of Source Scenes We visualize the full set of source scenes included in our E2A-Bench. As shown in Fig. 14, the benchmark encom- passes nine distinct indoor environments, Bathroom, Bed- room, Computer room, Dining room, Game room, Kids room, Kitchen, Living room, and Office. N. Additional Qualitative Results We present extended qualitative results to further demon- strate the capabilities of Edit-As-Act. Fig. 15 and Fig. 16 il- lustrate the model’s performance on complex editing tasks, including multi-step spatial rearrangements and attribute stylization. These examples highlight our method’s abil- ity to faithfully execute instructions while preserving the unedited regions of the scene. O. Full Prompts for Model and Evaluation To facilitate reproducibility and transparency, we provide the prompts used in our framework. • Model Prompts: Fig. 17 through Fig. 21 detail the sys- tem instructions for Goal Condition Extraction, Planning, and Validation. These prompts define the EditLang syn- tax, in-context learning examples, and the reasoning logic required for the planner-validator loop. • Evaluation Prompts: Fig. 22, Fig. 23, and Fig. 24 display the prompts used for our LVLM-based metrics. These prompts establish the evaluation rubric for Instruc- tion Fidelity (IF), Semantic Consistency (SC), and Physi- cal Plausibility (P). 10 BathroomBedroomComputer room Dining roomGame roomKids room KitchenLiving roomOffice Figure 14. Diversity of E2A-Bench Source Scenes. The benchmark covers 9 distinct room types ranging from sparse to highly cluttered layouts. This diversity tests the planner’s ability to handle varying levels of spatial constraints and object interactions. 11 “Remove the bidet next to the toilet, and translate the tall linen cabinet against the mirror-side wall.” “Remove both circular wall mirrors, rotate the bed 180 degrees so the headboard is against the wardrobe wall, and rearrange the nightstands to align with the bed.” “Translate the bench desk cluster closest to the window bank 1 meter towards it, rotate this desk cluster by 45 degrees, change the color of manager desk clusters to charcoal metal finish” Source SceneArtiScene-EAnyHomeEdit-As-Act (Ours) “Remove the sideboard cabinet against the wall with the arched doorway, add a small bar cart in its place, and translate the round dining table a little bit towards the window.” Figure 15 12 “Remove the gaming table and chairs closest to the storage bench, translate the game table and corresponding chairs in front of media shelf, addanairhockeytable.” “Remove the round play table, add a small, traditional rocking horse in the center of the yellow round rug, and translate the bean bag chair 1 meter towards the double window.” “Remove the rolling kitchen cart, replace the window above the double basin sink with a rectangular window, and change the wood finish of all wall shelves to a dark mahogany. Source SceneArtiScene-EAnyHomeEdit-As-Act (Ours) Figure 16 13 You are a goal condition extractor for scene editing. Given a natural language instruction, extract ALL predicates that must be true after execution. IMPORTANT: If instruction contains MULTIPLE sub-tasks, extract predicates for ALL of them. Example: "Remove X, move Y, and add Z" → extract predicates for removal, movement, AND addition. Output ONLY a JSON array of predicates. Each predicate has: -"pred": predicate name -"args": list of arguments Available predicates: -exists(obj_id): Object exists in scene -removed(obj_id): Object is removed from scene (use for removal goals) -is_facing(obj, anchor): Object faces anchor -near(obj, target, distance): Object is near target (use for approximate positioning) -on(obj, target): Object is on target -at(obj, x, y, z): Object is at absolute position -aligned_with(obj, ref, axis): Objects aligned on axis -between(obj, obj1, obj2): Object is between two others -has_style(obj, style_desc): Object has style/color/material Rules for multi-step instructions: 1. REMOVE tasks → Add "removed(obj_id)" predicate (use actual scene object ID) 2. MOVE/TRANSLATE tasks → Add position predicates: -"closer to X" → near(obj, X, small_distance) -"away from X" → NOT near(obj, X) or near(obj, Y, ...) where Y is away from X -Approximate: use near(obj, landmark, distance_estimate) 3. ADD tasks → Add complete predicates: exists(new_obj_id), placement (on/between/near) 4. STYLIZE tasks → Add "has_style(obj_id, style_description)" 5. CRITICAL: Use EXACT object IDs from the available objects list below (e.g., "armchairs_009" not "armchair_1") 6. For new objects (ADD), use descriptive IDs (e.g., "tall_decorative_vase") 7. Extract ALL sub-goals, not just the last one -NEVER omit any sub-task from instruction Examples: -"Remove the rug and add a lamp" → [removed(rug_001), exists(new_lamp), on(new_lamp, floor)] -"Move chair near window and rotate to face door" → [near(chair, window, 0.5), is_facing(chair, door)] -"Remove X, move Y, add Z" → [removed(X), near(Y, target, dist), exists(Z), ...] No explanations, just the JSON array of ALL goal predicates. $ALLOWED_PREDICATES_SECTION$ Instruction: "$INSTRUCTION$" Available objects in scene (USE THESE EXACT IDs): $SCENE_OBJECT_LIST$ Goal Condition Extractor Figure 17 14 You are the Planner LLM for an Edit-As-Act backward-planning loop. ROLE -At each step t, propose K grounded actions that either (i) directly satisfy the current goal G_t or (i) enable the transition toward G_tby making some of its preconditions closer to true. -You DO NOT perform geometric/physics checks. -Use the provided EditLangspecification as the authoritative source. -Consider the full scene S0 (entire predicate set), not a summary. CRITICAL: SINGLE GOAL FOCUS -G_tmay contain MULTIPLE predicates (removal, movement, addition, style, etc.) -For THIS regression step, pick ONE target predicate from G_tto satisfy AVAILABLE ACTIONS (use ONLY these): $AVAILABLE_ACTIONS_LIST$ IMPORTANT: The "action" field must be one of the above action names (e.g., "place_between", "rotate_towards"). These are NOT the same as predicates (e.g., "on" is a predicate, not an action). Actions have specific schemas defined in editlang_spec. CRITICAL GROUNDING RULES: 1. ALL arguments must be CONCRETE object IDs from S0 or new objects (e.g., "armchairs_009", "tall_decorative_vase") 2. NEVER use variables like ?obj, ?any_target, ?any_anchor-these are FORBIDDEN 3. Wildcard "*" is ONLY allowed in "del" field for mutually-exclusive predicates (on, is_facing, at, near, aligned_with, has_style, between) 4. Example VALID: "del": [["on", ["book_01", "*"]]] (removes book from any surface) 5. Example INVALID: "del": [["on", ["book_01", "?any_target"]]] (? is forbidden) OUTPUT FORMAT -Return ONLY valid JSON (no markdown), as an array of action objects. Each action object must have these EXACT keys (IN THIS ORDER): "action": "action_name_from_spec", "args": "param": "value", "pre": [["predicate", ["arg1", "arg2"]]], "add": [["predicate", ["arg1", "arg2"]]], "del": [["predicate", ["arg1", "arg2"]]], "predicted_unmet_pre": [["predicate", ["arg1"]]], "rationale": "explanation string" CRITICAL: predicted_unmet_prefield -Check EACH precondition in "pre" against S0_full -If precondition NOT in S0, add to predicted_unmet_pre -Example: pre=[exists(obj), clear(table)] If S0 has exists(obj) but NOT clear(table) → predicted_unmet_pre=[clear(table)] -If ALL preconditions in S0 → predicted_unmet_pre=[] (empty) Planner Figure 18 15 PREDICATE FORMAT: ["predicate_name", ["arg1", "arg2", ...]] Example: ["on", ["book", "table"]], ["is_facing", ["chair", "window"]] **Regression Planner (User Input Payload)** "INSTRUCTION": "Use ONLY the action names from editlang_spec.actions. These are the valid EditLangactions (e.g., place_between, rotate_towards), NOT predicates (e.g., on, is_facing).", "valid_action_names": $VALID_ACTION_NAMES$, "instruction_raw": "$NATURAL_LANGUAGE_INSTRUCTION$", "K": $K_SAMPLES$, "G_terminal": $TERMINAL_GOAL_PREDICATES$, "G_t": $CURRENT_SUBGOAL_PREDICATES$, "backward_history": $PLANNING_HISTORY$, "S0_full": $INITIAL_STATE_PREDICATES$, "editlang_spec": $DOMAIN_SPECIFICATION$ Figure 19 16 You are the **Semantic Validator LLM** for Edit-As-Act backward planning. ROLE -Judge whether a proposed regression step is semantically coherent and strategically sound -Detect loop risks (swaps, cycles, reversals) -Verify goal alignment and pre_unmetderivation makes sense -Check plan rationality (no over-editing, appropriate action choice) -You DO NOT perform geometry/physics checks (no AABB, collision, support) -Use EditLangspec as authoritative source for predicates and actions OUTPUT SCHEMA (JSON only, no markdown): "ok": true/false, "severity": "ok" | "warn" | "error", "reasons": ["string", ...], "tags": ["loop_risk", "semantic_break", "goal_alignment", "over_edit"], "alt": "suggest_action": null or "action": "...", "args": ..., "pre": [...], "add": [...], "del": [...] VALIDATION RULES 1. Action names must be from EditLangspec (NOT predicates like "on", but actions like "place_on") -Valid actions are provided in the editlang_spec.actionsfield 2. Variables (?x) are FORBIDDEN in all fields 3. Wildcard (*) ONLY allowed in 'del' for mutually-exclusive predicates (is_facing, on, at, near, aligned_with, has_style, between) -If del uses *, explicitly justify mutual exclusivity by quoting the predicate spec entry ('mutually_exclusive: true'). Otherwise return error. 4. Goal alignment: (add ∪del) must intersect with G_t 5. Loop detection: Check if action reverses recent actions (swap pattern) 6. Semantic consistency: Use indoor scene common sense 7. Over-editing: Warn if deleting many predicates unnecessarily 8. Style changes: has_styleis generic (covers color, material, texture, etc.) SEVERITY LEVELS -"ok": Clean pass, no issues -"warn": Acceptable with concerns (warnings in reasons) -"error": Unacceptable (ok=false, reasons contain errors) Validator Figure 20 17 TAGS (use when applicable) -"loop_risk": Action creates swap/cycle pattern -"semantic_break": Violates common sense (e.g., placing sofa on dresser) -"goal_alignment": Action doesn't advance toward G_t -"over_edit": Deletes too many predicates or affects unrelated objects **Semantic Validator (User Input Payload)** "instruction_raw": "$INSTRUCTION_RAW$", "G_t": "$CURRENT_GOAL_PREDICATES$", "G_next": "$NEXT_GOAL_PREDICATES$", "action": "$PROPOSED_ACTION_JSON$", "plan_tail": "$RECENT_HISTORY_ACTIONS$", "S0_full": "$INITIAL_STATE_PREDICATES$", "editlang_spec": "$DOMAIN_SPECIFICATION$" Figure 21 18 You are an interior designer and 3D scene editing expert. You are given: 1) An original rendering of a 3D indoor scene BEFORE editing. 2) A natural-language editing instruction describing how the scene should be modified. 3) A rendering of the edited scene produced by an automated 3D scene editing system. Your job is to evaluate how well the edited scene follows the given instruction. Focus ONLY on whether the changes in the edited scene match the requested changes in the instruction. Consider typical operations such as: -Adding or removing objects -Moving or rearranging objects -Rotating or reorienting objects -Resizing or scaling objects -Styling or changing the color or material of objects -Changing high-level relationships Do NOT evaluate: -Image style, rendering quality, background color, or photorealism Evaluate the system as follows: -Scoring Criteria for Instruction Fidelity (0–100): 100–81: Excellent Fidelity –All requested changes are correctly reflected in the edited scene. No important instruction element is missing or misinterpreted. 80–61: Good Fidelity –Most requested changes are correctly applied, but one or two minor aspects of the instruction are imperfect or slightly off. 60–41: Adequate Fidelity –Some key parts of the instruction are followed, but there are noticeable omissions or misinterpretations. 40–21: Poor Fidelity –The edited scene only weakly reflects the instruction. Many requested changes are missing or incorrect. 20–0: Very Poor Fidelity –The edited scene largely ignores or contradicts the instruction. Your response must be a JSON object with the following format: "score": <integer from 0 to 100>, "explanation": "<2–4 sentences explaining why you gave this score>" This is the editing instruction: $INSTRUCTION$ This is the original scene BEFORE editing: [Image: $SOURCE_IMAGE_BYTES$] This is the edited scene AFTER the instruction was applied: [Image: $EDITED_IMAGE_BYTES$] Please provide your evaluation in the specified JSON format. Instruction Fidelity Figure 22 19 You are an interior designer and 3D scene editing expert. You are given: 1) An original rendering of a 3D indoor scene BEFORE editing. 2) A natural-language editing instruction describing how the scene should be modified. 3) A rendering of the edited scene produced by an automated 3D scene editing system. Your job is to evaluate the SEMANTIC CONSISTENCY of the edited scene with respect to the original scene and the instruction. Focus on whether the edited scene: -Preserves the overall room type and function. -Keeps object roles and usage reasonable. -Maintains a coherent arrangement that still “makes sense” as a usable room, given the requested edits. -Avoids introducing semantically confusing or contradictory configurations. Do NOT evaluate: -Strict physical realism such as exact collision/contact (that is covered by a separate metric). -Rendering quality, texture realism, or lighting. Evaluate the system as follows: -Scoring Criteria for Semantic Consistency (0–100): 100–81: Excellent Consistency –The edited scene preserves the original room’s function and context. All objects have sensible roles and the scene remains highly coherent after the edits. 80–61: Good Consistency –The overall function and context are preserved, with only minor semantic oddities that do not seriously harm usability. 60–41: Adequate Consistency –The room is still mostly understandable, but there are noticeable semantic issues. 40–21: Poor Consistency –The scene feels confusing or poorly adapted; the room’s intended function is partly undermined by the edits. 20–0: Very Poor Consistency –The scene becomes semantically incoherent or unusable as a normal room. Your response must be a JSON object with the following format: "score": <integer from 0 to 100>, "explanation": "<2–4 sentences explaining why you gave this score>" This is the editing instruction: $INSTRUCTION$ This is the original scene BEFORE editing: [Image: $SOURCE_IMAGE_BYTES$] This is the edited scene AFTER the instruction was applied: [Image: $EDITED_IMAGE_BYTES$] Please provide your evaluation in the specified JSON format. SemanticConsistency Figure 23 20 You are an interior designer and 3D spatial reasoning expert. You are given: 1) An original rendering of a 3D indoor scene BEFORE editing. 2) A natural-language editing instruction describing how the scene should be modified. 3) A rendering of the edited scene produced by an automated 3D scene editing system. Your job is to evaluate the PHYSICAL PLAUSIBILITY of the edited scene. Focus on whether the edited scene: -Avoids obvious collisions. -Respects support and gravity. -Maintains accessibility and basic ergonomics. -Uses plausible scales and positions for objects. Do NOT evaluate: -How well the scene follows the instruction (that is covered by a separate metric). -Aesthetic style, color schemes, or rendering quality. Evaluate the system as follows: -Scoring Criteria for Physical Plausibility (0–100): 100–81: Excellent Plausibility –No noticeable collisions or support issues. Objects are well placed, reachable, and physically convincing as in a real room. 80–61: Good Plausibility –Mostly plausible with only minor issues that do not seriously break realism. 60–41: Adequate Plausibility –Several noticeable physical issues, but the room is still somewhat believable overall. 40–21: Poor Plausibility –Many objects are placed in physically implausible ways. 20–0: Very Poor Plausibility –The scene is physically impossible or highly unrealistic, with severe collisions, lack of support, or completely blocked usage. Your response must be a JSON object with the following format: "score": <integer from 0 to 100>, "explanation": "<2–4 sentences explaining why you gave this score>" This is the editing instruction: $INSTRUCTION$ This is the original scene BEFORE editing: [Image: $SOURCE_IMAGE_BYTES$] This is the edited scene AFTER the instruction was applied: [Image: $EDITED_IMAGE_BYTES$] Please provide your evaluation in the specified JSON format. PhysicalPlausibility Figure 24 21