Paper deep dive

R&D: Balancing Reliability and Diversity in Synthetic Data Augmentation for Semantic Segmentation

Huy Che, Dinh-Duy Phan, Duc-Khai Lam

Year: 2026Venue: arXiv preprintArea: cs.CVType: PreprintEmbeddings: 40

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/22/2026, 6:04:07 AM

Summary

The paper introduces a novel synthetic data augmentation pipeline for semantic segmentation that balances diversity and reliability. By integrating an Image-to-Image Controllable Diffusion Model and a Controllable Inpainting Diffusion Model, the method generates high-quality synthetic images that maintain distributional consistency with real datasets. Key innovations include class-aware prompting and visual prior blending to ensure precise alignment between generated images and segmentation labels. Experimental results on PASCAL VOC and BDD100K demonstrate significant performance improvements in data-scarce scenarios.

Entities (6)

BDD100K · dataset · 100%PASCAL VOC · dataset · 100%Controllable Inpainting Diffusion Model · method · 95%DeepLabV3+ · model-architecture · 95%Image-to-Image Controllable Diffusion Model · method · 95%Mask2Former · model-architecture · 95%

Relation Signals (3)

Class-aware prompting → enhances → Image Quality

confidence 95% · These combined methods enhance the quality of the generated images

Controllable Inpainting Diffusion Model → balances → Data Diversity and Reliability

confidence 90% · aiming to balance data diversity and the reliability of the generated images

Image-to-Image Controllable Diffusion Model → improves → Semantic Segmentation

confidence 90% · Our method significantly enhances semantic segmentation performance

Cypher Suggestions (2)

List datasets used for evaluation · confidence 95% · unvalidated

MATCH (d:Dataset) WHERE (d)-[:EVALUATED_BY]->(:Method) RETURN d.name

Find all methods used for data augmentation in the paper · confidence 90% · unvalidated

MATCH (m:Method)-[:USED_FOR]->(t:Task {name: 'Data Augmentation'}) RETURN m.name

Abstract

Abstract:Collecting and annotating datasets for pixel-level semantic segmentation tasks are highly labor-intensive. Data augmentation provides a viable solution by enhancing model generalization without additional real-world data collection. Traditional augmentation techniques, such as translation, scaling, and color transformations, create geometric variations but fail to generate new structures. While generative models have been employed to extend semantic information of datasets, they often struggle to maintain consistency between the original and generated images, particularly for pixel-level tasks. In this work, we propose a novel synthetic data augmentation pipeline that integrates controllable diffusion models. Our approach balances diversity and reliability data, effectively bridging the gap between synthetic and real data. We utilize class-aware prompting and visual prior blending to improve image quality further, ensuring precise alignment with segmentation labels. By evaluating benchmark datasets such as PASCAL VOC and BDD100K, we demonstrate that our method significantly enhances semantic segmentation performance, especially in data-scarce scenarios, while improving model robustness in real-world applications. Our code is available at \href{this https URL}{this https URL}.

PDF

Open source PDF →Open local PDF →

Full Text

39,954 characters extracted from source content.

Expand or collapse full text

11institutetext: University of Information Technology, Ho Chi Minh City, Vietnam 22institutetext: Vietnam National University, Ho Chi Minh City, Vietnam 22email: huycq@uit.edu.vn, duypd@uit.edu.vn, khaild@uit.edu.vn R&D: Balancing Reliability and Diversity in Synthetic Data Augmentation for Semantic Segmentation Quang-Huy Che Dinh-Duy Phan Corresponding author Duc-Khai Lam Abstract Collecting and annotating datasets for pixel-level semantic segmentation tasks are highly labor-intensive. Data augmentation provides a viable solution by enhancing model generalization without additional real-world data collection. Traditional augmentation techniques, such as translation, scaling, and color transformations, create geometric variations but fail to generate new structures. While generative models have been employed to extend semantic information of datasets, they often struggle to maintain consistency between the original and generated images, particularly for pixel-level tasks. In this work, we propose a novel synthetic data augmentation pipeline that integrates controllable diffusion models. Our approach balances diversity and reliability data, effectively bridging the gap between synthetic and real data. We utilize class-aware prompting and visual prior blending to improve image quality further, ensuring precise alignment with segmentation labels. By evaluating benchmark datasets such as PASCAL VOC and BDD100K, we demonstrate that our method significantly enhances semantic segmentation performance, especially in data-scarce scenarios, while improving model robustness in real-world applications. Our code is available at https://github.com/chequanghuy/Enhanced-Generative-Data-Augmentation-for-Semantic-Segmentation-via-Stronger-Guidance. 1 Introduction Deep learning has transformed the field of computer vision, where model performance depends not only on methodological advancements but also significantly on the quality and quantity of training data. Large-scale datasets, such as SA-1B [12], and Imagenet [7], have played a crucial role in driving progress across various computer vision tasks. However, collecting and annotating these datasets is labor-intensive, especially for complex and privacy-sensitive data. This challenge is particularly notable in semantic segmentation, where each pixel in an image must be accurately classified. While widely used datasets like PASCAL VOC [8], BDD100K [25] provide a strong foundation for training segmentation models, expanding or creating new datasets of similar scale remains a significant bottleneck. Consequently, data augmentation has emerged as a critical approach to enhancing model generalization without requiring additional real-world data collection. This technique not only increases data diversity but also reduces annotation costs, offering an efficient alternative for addressing challenges in semantic segmentation. Traditional data augmentation methods such as rotation, scaling, flipping, or pixel-level manipulations (e.g., blurring, adjusting brightness, and contrast) enhance model accuracy by introducing geometric and color variations. However, these transformations do not generate new structural components, perspectives, or textures, thus limiting their ability to expand dataset diversity. More advanced techniques include partial image removal methods (e.g. Random Erasing [19], Cutout), or image mixing techniques (e.g., Mosaic [4], Mixup [26]). However, most of these methods primarily expand the visual representation space without introducing new semantic information, thereby reducing their effectiveness in improving the model’s generalization capability. Unlike previous data augmentation methods [22], generative models are trained directly on the target dataset to produce additional samples. However, since these models learn from the same data domain, the generated samples often lack diversity compared to the original data. Without fine-tuning the target dataset, the synthesized images tend to follow the distribution of the pre-trained model, not the desired distribution for data augmentation. Although generative models [16, 23, 1] can generate semantically diverse images, ensuring distributional alignment between the original and generated data remains a challenge. Additionally, semantic segmentation requires that generated samples preserve precise object shapes and structures, unlike classification [10, 20] or object detection tasks [9]. To address these limitations, we propose a synthetic data augmentation pipeline for semantic segmentation based on generative models. In summary, the contributions of our work are as follows: • We propose a novel synthetic data augment pipeline that integrates two controllable diffusion models to generate synthetic datasets for semantic segmentation. This approach bridges the gap between synthetic datasets and real datasets, ensuring both the diversity and reliability of synthetic images. • We integrate the proposed pipeline with the class-aware prompting method we propose and visual prior blending [1]. These combined methods enhance the quality of the generated images by ensuring that all relevant objects are included in the generated images and improving the alignment of the generated images with segmentation labels, thereby ensuring high accuracy and reliability in the synthetic datasets. • We demonstrate the effectiveness of our approach through extensive experiments on standard benchmarks, including PASCAL VOC [8] and BDD100K [25]. Our method consistently improves semantic segmentation performance, particularly in data-scarce scenarios. 2 Related work 2.1 Image Generation Image generation is a significant research direction in computer vision and artificial intelligence, especially with the rapid advancement of deep learning models in recent years. Generative Adversarial Networks (GANs) [11], as foundational models in image synthesis, have been widely used to generate high-resolution images. However, GANs often face optimization challenges, making it difficult for the model to fully capture the underlying data distribution. Recently, diffusion models (DMs) have emerged as a more advanced approach to image generation, enabling the model to approximate the data distribution more stably compared to GANs. Stable Diffusion (SD) [18, 17] is a variant of diffusion models that leverages the latent space instead of directly processing images in the pixel space. Through the cross-attention mechanism, SD can generate images based on various input modalities such as text, bounding boxes, or semantic maps. One key advancement that enhances controllability in image generation is the integration of SD with ControlNet [27] or T2I-Adapter [15]. These methods allow the model to incorporate additional structured guidances (visual priors), represented as edges, segmentation masks, lineart, and depth maps, improving the consistency of shape and structure in the generated images. 2.2 Image Synthesis for Data Augmentation Previous studies have utilized Generative Adversarial Networks (GANs) to generate synthetic data for semantic segmentation, focusing primarily on object-centered images. However, these methods face limitations when handling complex image layouts or interactions between multiple objects. With advancements in generative models, data augmentation techniques based on diffusion models have recently emerged. However, most of these methods are tailored for image classification [10, 20] or object detection [9] rather than semantic segmentation, which requires pixel-level precision. Semantic segmentation poses a significant challenge for generative image synthesis due to its strict accuracy requirements. [16, 23] introduced Synthetic Dataset approaches capable of generating synthetic images along with segmentation masks for specified classes. These methods generate synthetic datasets and pseudo-labels from text descriptions, enabling data utilization for pretraining segmentation models. Unlike synthetic dataset approaches, generative-based augmentation uses existing images and masks to create additional training data. Inpainting-based methods [13] modify objects while preserving backgrounds but often limit data diversity. In contrast, Che et al. [1] introduced the Controllable Diffusion Model with strong guidance for image synthesis, demonstrating notable improvements in data augmentation. However, synthetic data generation faces two challenges: (1) mismatches between segmentation masks and synthesized images and (2) domain shifts due to the generative model’s training dataset constraints. In this work, we propose a synthetic data augmentation pipeline based on generative models. Our pipeline integrates advanced techniques to enhance the robustness of the generated data. Additionally, our method achieves a balance between diversity and data reliability consistency compared to the original dataset, resulting in high-quality synthetic data suitable for training semantic segmentation models. Figure 1: Our proposed synthetic data augmentation pipeline utilizes the real dataset 0D_0 to create two synthetic datasets, 1genD_1^gen and 2genD_2^gen. The annotations for the synthetic data are directly copied from the labels of the real dataset. 3 Methods In this work, we propose a pipeline that integrates two SD models designed for controllable synthetic data generation: the Image-to-Image Controllable Diffusion Model (Sec. 3.2) and the Controllable Inpainting Diffusion Model (Sec. 3.3). This pipeline takes an image and its corresponding segmentation labels as input and generates two synthetic images for each input image. The overall architecture of the proposed pipeline is illustrated in Fig. 1. Given a real dataset 0D_0, proposed pipeline generates the synthetic dataset 1gen∪2genD_1^gen _2^gen, where 1genD_1^gen is generated by the Image-to-Image Controllable Diffusion Model, producing a highly diverse dataset by changing both labeled and unlabeled objects as well as the background, while 2genD_2^gen is generated by the Controllable Inpainting Diffusion Model, ensuring data distribution consistency by modifying only the labeled objects while keeping the remaining parts unchanged. The merging of the two datasets 1genD_1^gen and 2genD_2^gen yields a reliable synthetic dataset that simultaneously maximizes diversity and preserves data distribution fidelity. Furthermore, to enhance synthetic image precision, we propose novel methods for textual prompt refinement and visual prior in Section 3.1. 3.1 Robust condition for Diffusion Controllable Models 3.1.1 Preparing text prompt: To generate high-quality synthetic images for semantic segmentation, constructing an effective prompt is crucial in ensuring the presence of all relevant objects in the generated image. A straightforward approach is to list the annotated classes explicitly. Given an image ℐiI_i with list of labeled classes i=[c1,C_i=[c_1, c2,c_2, …], a simple prompt can be formulated as “A photograph of c1, c2,...”. While this approach ensures that all objects in the image are mentioned, it lacks contextual information, making it challenging for the generative model to produce a coherent and realistic image. Instead of using simple annotated class lists, another approach is applying image captioning models to generate descriptions for datasets. However, these captions do not guarantee the inclusion of all annotated classes in the image, which may result in generated images that are either incomplete or contain incorrect objects. To address these challenges, we propose a prompt formulation integrating general contextual information about the image and a list of annotated classes. Unlike previous works [16, 1], which merely concatenate the image caption with the annotated class list—often leading to poor linguistic coherence—we utilize BLIP [14] as an conditional image captioning. Specifically, BLIP generating a more comprehensive description that combines visual context with class labeled list. To enhance the focus on class tokens corresponding to target objects, we propose re-weighting mechanism during class token embedding [6]. By assigning higher weights to class tokens, this approach emphasizes key objects, thereby improving segmentation accuracy. The adjusted class tokens are denoted as “[class]++”. Our proposed prompt generation method, called class-aware prompting, focuses on integrating class-specific information to produce more contextually rich and accurate prompts. Figure 2 illustrates an example of an image alongside its corresponding label, as well as various types of prompts. 3.1.2 Visual priors for controllable model: Controllable generative models are characterized by their ability to generate high-quality images guided by visual priors. Among these, edge-based visual priors are widely used for object representation because they can generalize image structures. However, relying solely on edge information may introduce limitations when target objects are not well-emphasized or edge maps lack sufficient detail. This limitation can lead to generated objects not aligning accurately with the segmentation labels. To address this issue, our proposed pipeline incorporates visual prior blending [1], a technique designed to enhance the representation of labeled objects, ensuring that generated content better aligns with segmentation labels. Given VIV^I as the visual prior derived from the original image and VSV^S as the visual prior extracted from the segmentation mask, the blended visual prior V∗V^* is formulated as: V∗=αVI+VSV^*=αV^I+V^S (1) where α ∈ (0, 1) is a blending coefficient. Setting α < 1 reduces the influence of global image structures while emphasizing the information from labeled objects. Figure 2: Some examples of text prompt selection for input images show that simple text prompts are often too simplistic, while generated captions may miss some labeled classes. Class-prompt appending addresses this but can lead to incoherent prompts. In contrast, conditional image captioning creates coherent prompts that accurately describe the image and include all labeled classes. 3.2 Image-to-Image Controllable Diffusion Model Controllable Diffusion Models [15, 27] have demonstrated remarkable capabilities in generating highly diverse synthetic images [1, 9], offering significant advantages for data augmentation techniques. The image generation can be process as a function G0 : VV × PP → IgenI^gen, where VV represents the visual prior of input image, PP denotes the textual prompt describing the image content, and IgenI^gen is the output image. However, controllable models often overlook input image distributions due to gaps between training data and target domains, leading to distribution shifts in generated images. To address this, we integrate an Image-to-Image (Img2Img) mechanism into the Controllable Diffusion Model framework. This extends the function G0 to G1 : II × VV × PP → IgenI^gen, where the additional input II represents the reference image. This approach ensures that the generated images not only maintain diversity at a moderate level but also exhibit improved similarity to the reference image, thereby achieving better alignment with the target data distribution. Furthermore, the Img2Img mechanism preserves the reference image’s structural composition more effectively than traditional Text-to-Image (T2I) methods, as it utilizes the input image as a foundational guide during the diffusion process. As illustrated in Fig. 3(a), our proposed pipeline incorporates two proposed methods to generate new image IgenI^gen: class-aware prompting and visual prior blending, which generate P∗P^* and V∗V^*, respectively. These components are then combined with input image, enabling the Img2Img Controllable Diffusion Model to produce highly diverse images while preserving fine-grained details and maintaining the distributional characteristics of the original data. (a) (b) Figure 3: Image generation using the Img2Img Controllable Diffusion Model. Igen=G1∗(I,V∗,P∗)I^gen=G^*_1(I,V^*,P^*) (2) The overall process is depicted in Fig. 3(b), highlighting the model’s ability to produce highly diverse images while preserving fine-grained details and maintaining the distributional characteristics of the original data. 3.3 Controllable-Inpainting Diffusion Model (a) (b) Figure 4: Image generation using the Controllable Inpainting Diffusion Model. The issue of data generation out of the original domain when using controllable diffusion models has been highlighted in previous research [1]. This phenomenon can lead to a decline in model training performance as the dataset size increases. Although Sec. 3.2 introduces the Image-to-Image Controllable Diffusion Model to mitigate this limitation, the transformation of the entire image makes it challenging to preserve the original data distribution. Therefore, we proposed to maintain the original image characteristics by employing an Inpainting Diffusion Model to modify specific regions of the image instead of transforming the entire image. To balance the advantages of both methods for the data augmentation task, we combine the Inpainting Diffusion Model with the proposed Img2Img Controllable Diffusion Model in Sec. 3.2, aiming to balance data diversity and the reliability of the generated images. The transformation function can represent the process of the Inpainting Diffusion model G : II × MM × PP → IgenI^gen. Here, II is the input image, MM is the mask specifying the regions to be modified, PP represents the visual prior controlling the inpainting process, and IgenI^gen is the generated image after the inpainting process. [13] has been noted that relying on a pre-existing Inpainting Diffusion Model G does not ensure the newly generated objects conform to the mask MM. Additionally, it does not guarantee that the generated objects retain the original objects. To address this limitation, we integrate the Controllable Model with the Inpainting Diffusion Model, resulting in a novel framework termed the Controllable Inpainting Diffusion Model. Precisely, controllable models such as T2I-Adapter [15] and ControlNet [27] extract visually structured information to inject into the Unet architecture of the Diffusion Model. In this approach, the generation process is conditioned not only on the mask MM but also on the visual prior information VV. We also utilize class-aware prompting and visual prior blending techniques to reduce the risk of small objects being removed and replaced with background elements or inaccurately generated shapes; this process is illustrated in Fig. 4(a). These improvements tackle the limitations associated to extends the function G2∗^*_2: II × V∗V^* × MM × P∗P^* → IgenI^gen. However, unlike the approach using the Img2Img Controllable Diffusion Model, the generation of objects in the Controllable Inpainting Diffusion Model is performed sequentially for each object type. This approach allows for a more accurate generation of objects, particularly when dealing with objects that have similar shapes. In this case, the images for each new object are generated as follows: Iigen=G2∗(I,V∗,Mi,Pi∗)I_i^gen=G^*_2(I,V^*,M_i,P^*_i) (3) where Mi∈ℳM_i is the segmentation mask for the i-th class (ci∈c_i ), determined by leveraging segmentation labels for each class. Meanwhile, Pi∗∈∗P_i^* ^* is the prompt describing the data generation process for the class cic_i. After obtaining the list of images Ii\I_i\ generated based on the information for each class, we perform an image merging operation to produce the final composite image IallgenI_all^gen. This process relies on the list of masks Mi\M_i\ to obtain an image with the labeled objects modified. This process is shown in Fig. 4(b). Iallgen=I⊙(1−∑i=1NMi)+∑i=1N(Iigen⊙Mi)I^gen_all=I (1- _i=1^NM_i )+ _i=1^N (I^gen_i M_i ) (4) Table 1: Comparison of mIoU (%) on the validation set when training models on the original dataset (0D_0) and when merging it with the synthetic dataset (0D_0 ∪ [1] vs. 0D_0 ∪ ours), using DeepLabV3+ and Mask2Former model architectures. Dataset VOC7 VOC12 Number images 209 92 183 366 732 1464 DeepLabV3+ Resnet50 0D_0 63.75 48.19 58.44 65.84 70.55 72.19 0D_0 ∪ [1] 64.02 51.83 59.37 65.98 69.14 72.16 0D_0 ∪ ours 64.47 53.67 59.98 67.52 71.06 72.96 Resnet101 0D_0 67.61 54.06 62.88 67.85 73.06 76.19 0D_0 ∪ [1] 68.79 56.01 63.09 68.89 73.05 75.68 0D_0 ∪ ours 68.81 55.93 64.08 69.54 73.83 76.91 Mask2Former Swin-B 0D_0 76.19 59.11 74.39 75.21 79.02 81.78 0D_0 ∪ [1] 77.01 65.01 76.67 77.10 79.87 81.86 0D_0 ∪ ours 78.52 64.51 76.45 77.84 81.08 82.88 4 Experiments 4.1 Datesets and implementation details 4.1.1 Datesets: We evaluate our synthetic data generation framework on two benchmark datasets: PASCAL VOC [8] (VOC07 and VOC12) and BDD100Kă[25]. To assess performance under data-limited scenarios, we conduct experiments on both the full VOC12 dataset (1,464 images), and its subsets [21]. Beyond standard object segmentation tasks, we further validate the model’s capability to generate images under diverse environmental conditions (e.g., weather, scene) using BDD100K, a large-scale dataset capturing real-world driving scenarios for drivable area and lane segmentation tasks. 4.1.2 Implementation details: For object segmentation evaluation, we employ Deep -LabV3+ [3] (with ResNet50/101 backbones) and Mask2Former [5] (Swin-B backbone) implemented in the MMSegmentation framework, training for 30K iterations at 512×512 resolution with batch size 16 using AdamW optimization and default augmentations. For weather-conditioned segmentation on BDD100K [25], we adopt TwinLiteNet [2] for simultaneous lane and drivable area segmentation. Our image generation pipeline leverages SD-XL [17] controlled via T2I-Adapter [15] with Line Art as visual priors. The coefficient α in the visual prior blending method is set to 0.8. Table 2: The table shows the performance of the Mask2Former (Swin-B) model when trained on (1) the real dataset, (2) the synthetic dataset, and (3) fine-tuned on the real dataset after pre-training on the synthetic dataset. The compared methods include generating synthetic data with pseudo-labels [23, 16, 24] and synthetic data based on the original dataset [1] Real images Synthetic images VOC (5k) VOC (1,5k) DiffuMask [23] (60k) D [16] (40k) Attn2Mask [24] SG [1] (1,5k) Ours (2,9k) mIoU (%) ✓ 83.4 (1) ✓ 81.8 ✓ 70.6 ✓ 67.6 ✓ 71.0 ✓ 73.0 (2) ✓ 76.3 ✓ ✓ 84.9 ✓ ✓ 82.4 ✓ ✓ 82.8 (3) ✓ ✓ 84.0 Table 3: Evaluation of multi-task segmentation model performance across different environmental conditions when integrating our method with real data via merging/fine-tuning. Condition Ours Number Lane Line Drivable Area Accuracy (%) IoU(%) mIoU(%) Foggy 130 56.9 4.5 72.6 ✓ 390 67.8 / 67.5 8.3 / 8.7 79.9 / 80.4 Tunnel 129 80.5 9.3 73.4 ✓ 387 84.4 / 85.2 15.2 / 14.9 86.0 / 87.3 Gas Station 27 48.9 0.2 67.1 ✓ 81 62.6 / 63.7 0.4 / 0.5 70.5 / 72.1 4.2 Semantic segmentation result on VOC To evaluate our proposed data augmentation method, we compare models (Deep- LabV3+ and Mask2Former) trained on the original dataset (0D_0) and our augmented dataset (0∪1gen∪2genD_0 _1^gen _2^gen). We also compare our method with Stronger Guidance [1], re-implemented using our training settings. Notably, we did not apply the object filter [9] or class balancing algorithm [1] to focus solely on synthetic image quality. As shown in Tab. 1, our method consistently improves semantic segmentation performance across datasets and architectures. While our method outperforms the baseline (trained on 0D_0) in all configurations, it occasionally underperforms compared to [1] on smaller datasets but not significantly (VOC12 with 92 images when trained on DeepLabV3 Resnet101 and VOC12 with 92 or 183 images when trained on Mask2Former). However, as the dataset size increases (e.g., VOC12 with 732 or 1464 images), our method consistently outperforms [1], which sometimes underperforms the baseline. This suggests that while highly diverse data improves accuracy on small datasets, it may cause distribution shifts that decrease performance as the number of samples in the dataset grows. Our method addresses this by balancing diversity and data consistency. Additionally, we follow [23, 24], first training on synthetic data and then fine-tuning on real data (VOC12 with 1464 images), as synthetic data may not perfectly align with real data or domain shifts. Tab. 2 shows that our method achieves the highest mIoU (76.3%) when trained solely on synthetic data, outperforming other approaches. After fine-tuning, DiffuMask [23] achieves the best performance (84.9%), but requires 60k synthetic images for pre-training and 5k real images for fine-tuning. In contrast, our method achieves a competitive 84.0% mIoU with only 2.9k synthetic images and 1.5k real images, improving the baseline (81.8%) by 2.2%. This highlights the effectiveness of our approach. 4.3 Image generation based on environmental conditions In addition to object segmentation, we evaluate the generation of images under different environmental conditions, such as fog, tunnel, and gas station scenarios. Using the TwinLiteNet model [2] for the drivable area and lane segmentation on the BDD100K dataset, we observe poor performance with fewer than 200 samples in these conditions. However, applying our method—through merging synthetic and real datasets or fine-tuning on real data—significantly improves the model’s performance. This highlights our method’s ability to generate synthetic data tailored to specific environmental conditions, enhancing model performance in real-world scenarios. Table 4: Quantitative comparison (FID /CLIP Score (ViT-B/32)). CLIP ↑ FID ↓ [1] 0.81 114.49 1genD^gen_1 0.84 101.92 2genD^gen_2 0.92 72.22 Figure 5: Some synthetic images generated using different methods. 4.4 Visualization and metrics for generated image quality Qualitative results on the PASCAL VOC and BDD100K datasets, as illustrated in Figure 5, demonstrate that our method generates highly similar images to the original images (D0D_0). Specifically, images produced by the Img2Img Controllable Diffusion model (D1genD^gen_1) and the Controllable Inpainting Diffusion Model (D2genD^gen_2) not only exhibit diversity but also maintain structural similarity to the original images. In contrast, the method proposed in [1] yields inferior results, failing to preserve the structure and distribution of the generated images. In addition, we conducted quantitative evaluations using two metrics: FID and CLIP Score. The results in Tab. 4, evaluated on the VOC7 dataset, show that our method achieves higher scores, confirming its ability to generate high-quality images that closely align with the distribution of the original data. 4.5 Ablation Study This section presents a comprehensive ablation study evaluating our method’s components using the PASCAL VOC with Mask2Former. Table 5: Comparison of semantic segmentation performance (mIoU%) using different training strategies. Results are reported for Mask2Former models. VOC7 VOC12 0D_0 1genD^gen_1 2genD^gen_2 Number mIoU (%) Number mIoU (%) Train with Pure Real Data ✓ R: 209 76.19 R: 1464 81.8 ✓ S: 209 74.32 R: 1464 74.22 ✓ S: 209 73.85 R: 1464 75.18 Train with Pure Synthetic Data ✓ ✓ S: 418 74.67 R: 2928 76.27 ✓ ✓ R: 209 S: 209 77.58 78.53 R: 1464 S: 1464 81.91 83.01 ✓ ✓ R: 209 S: 209 77.91 78.58 R: 1464 S: 1464 81.97 82.87 Merge with Real Data or Finetune on Real Data ✓ ✓ ✓ R: 209 S: 418 78.52 80.21 R: 1464 S: 2928 82.88 84.02 4.5.1 Data Ablation Study: We evaluate the impact of synthetic data on semantic segmentation by comparing three strategies: (1) training on real data only, (2) training on synthetic data only, and (3) combining both. Results in Tab. 5 show that while training solely on synthetic data achieves notable accuracy, it is lower than training on real data. However, combining synthetic with real data, either by merging or fine-tuning real data after pre-training on synthetic data, significantly improves performance. The best results are achieved when fine-tuning on real data after synthetic pre-training and using multiple synthetic datasets (1genD^gen_1 and 2genD^gen_2) further enhances performance, demonstrating that synthetic data is effective when combined with real data. These results confirm that while synthetic data cannot fully replace real data, it plays a key role in improving model performance, especially when merged or fine-tuned with real data. 4.5.2 Text prompt selection: The performance of the model when selecting text prompts using different methods is detailed in Tab. 7. Our proposed class-aware prompting method demonstrates superior performance compared to previous methods. Specifically, our method achieves a performance of 78.52%. These results indicate that our text prompt generation method helps the model focus more effectively on the classes that need to be segmented, thereby significantly improving the performance of the semantic segmentation model. 4.5.3 Effect of different numbers of generated images in the synthetic data: In addition to generating two synthetic images per original image (via Controllable Inpainting Diffusion and Img2Img Controllable Diffusion), we conducted experiments by increasing the number of generated images to evaluate semantic segmentation performance. The results in Table 7 are presented in the format X/Y, where X denotes the performance from merging synthetic data with real data, and Y denotes the performance after fine-tuning on real data following pretraining. The results show that merging synthetic data with real data leads to degraded performance as the amount of synthetic data increases, whereas fine-tuning on real data after pretraining with synthetic data improves performance. These findings indicate that fine-tuning on real data after pretraining with a larger number of synthetic images can result in a more robust pre-trained model and improved overall performance. Table 6: Performance of different text prompt selections when evaluating on VOC7 with Mask2Former (SwinB), The results are presented for model training using the merging mechanism. Method mIoU (%) Simple text prompt 76.11 Generated caption 75.67 Class-prompt appending 77.81 Class-aware prompting 78.52 Table 7: Effect of increasing the number of synthetic images during training on the Mask2Former (SwinB) model. The values Nreal/Nsyn indicate the number of real and synthetic images, respectively. VOC7 VOC12 Nreal/Nsyn mIoU(%) Nreal/Nsyn mIoU(%) 209/0 76.19 1464/0 81.8 209/418 78.52/80.21 1464/2928 82.88/84.02 209/836 79.21/81.53 1464/5856 82.01/85.33 209/1254 76.91/82.23 1464/8784 80.08/87.05 5 Discussion and Conclusion 5.1 Limitations Although our method shows promising results, there are several limitations. First, the quality of the synthesized images depends on the pre-trained generative model (SD). Second, generating high-quality synthetic images using diffusion models can be computationally expensive and time-consuming. Finally, while our method has potential for privacy-sensitive applications, this study only evaluates general datasets, so further validation is needed to ensure its effectiveness in specific scenarios. 5.2 Conclusion In this work, we proposed a novel synthetic data augmentation pipeline that combines controllable diffusion models with advanced conditioning techniques to tackle the challenges of balancing diversity and reliability in semantic segmentation. Our method effectively generates high-quality synthetic data that preserves the structure of labeled objects and aligns well with real-world data distributions, demonstrating significant performance improvements on benchmark datasets like PASCAL VOC and BDD100K, particularly in data-scarce scenarios. Moreover, our approach effectively mitigates domain shift issues commonly associated with synthetic data generation, enabling more robust training. Building on the success of image transformations guided by segmentation masks, we explore their potential for privacy protection applications. In privacy scenarios, sensitive regions identified by segmentation masks can be concealed using techniques like inpainting, ensuring privacy while preserving the quality of synthetic datasets for training segmentation models. Further exploration of these methods could enhance their applicability in privacy-sensitive domains. 6 Acknowledgement This research is funded by Vietnam National University Ho Chi Minh City (VNU-HCM) under grant C2023-26-10. References [1] Q. Che, D. Le, B. Pham, D. Lam, and V. Nguyen (2025) Enhanced generative data augmentation for semantic segmentation via stronger guidance. In Proceedings of the 14th International Conference on Pattern Recognition Applications and Methods - ICPRAM, Cited by: 2nd item, §1, §2.2, §3.1.1, §3.1.2, §3.2, §3.3, Table 1, Table 1, Table 1, Table 1, Table 1, Table 4, §4.2, §4.4, Table 2, Table 2, Table 2. [2] Q. Che, D. Nguyen, M. Pham, and D. Lam (2023) TwinLiteNet: an efficient and lightweight model for driveable area and lane segmentation in self-driving cars. In 2023 International Conference on Multimedia Analysis and Pattern Recognition (MAPR), Cited by: §4.1.2, §4.3. [3] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In Computer Vision – ECCV 2018, Cited by: §4.1.2. [4] Y. Chen, P. Zhang, Z. Li, Y. Li, X. Zhang, L. Qi, J. Sun, and J. Jia (2021) Dynamic scale training for object detection. Cited by: §1. [5] B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar (2022) Masked-attention mask transformer for universal image segmentation. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.1.2. [6] Damian0815 (2023) Compel: a library for conditioning and weighting in prompt-based models. External Links: Link Cited by: §3.1.1. [7] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1. [8] M. Everingham, L. Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. Int. J. Comput. Vision. Cited by: 3rd item, §1, §4.1.1. [9] H. Fang, B. Han, S. Zhang, S. Zhou, C. Hu, and W. Ye (2024) Data augmentation for object detection via controllable diffusion models. In 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Cited by: §1, §2.2, §3.2, §4.2. [10] C. Feng, K. Yu, Y. Liu, S. A. Khan, and W. Zuo (2023) Diverse data augmentation with diffusions for effective test-time prompt tuning. 2023 IEEE/CVF International Conference on Computer Vision (ICCV). Cited by: §1, §2.2. [11] I. Goodfellow and et al. (2020) Generative adversarial networks. Commun. ACM. Cited by: §2.1. [12] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W. Lo, P. Dollár, and R. Girshick (2023) Segment anything. Cited by: §1. [13] O. Kupyn and C. Rupprecht (2024) Dataset enhancement with instance-level augmentations. In European Conference on Computer Vision, Cited by: §2.2, §3.3. [14] J. Li, D. Li, C. Xiong, and S. Hoi (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, Cited by: §3.1.1. [15] C. Mou, X. Wang, L. Xie, Y. Wu, J. Zhang, Z. Qi, and Y. Shan (2024) T2I-adapter: learning adapters to dig out more controllable ability for text-to-image diffusion models. In Thirty-Eighth AAAI Conference on Artificial Intelligence, Cited by: §2.1, §3.2, §3.3, §4.1.2. [16] Q. H. Nguyen, T. T. Vu, A. T. Tran, and K. Nguyen (2023) Dataset diffusion: diffusion-based synthetic data generation for pixel-level semantic segmentation. In Thirty-seventh Conference on Neural Information Processing Systems, Cited by: §1, §2.2, §3.1.1, Table 2, Table 2, Table 2. [17] D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rombach (2024) SDXL: improving latent diffusion models for high-resolution image synthesis. In International Conference on Learning Representations, Cited by: §2.1, §4.1.2. [18] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer (2022) High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1. [19] M. Saran, F. Nar, and A. N. Saran (2021) Perlin random erasing for data augmentation. In 29th Signal Processing and Communications Applications Conference, Cited by: §1. [20] B. Trabucco, K. Doherty, M. A. Gurinas, and R. Salakhutdinov (2024) Effective data augmentation with diffusion models. In The Twelfth International Conference on Learning Representations, Cited by: §1, §2.2. [21] Y. Wang, H. Wang, Y. Shen, J. Fei, W. Li, G. Jin, L. Wu, R. Zhao, and X. Le (2022) Semi-supervised semantic segmentation using unreliable pseudo-labels. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.1.1. [22] W. Wu, Y. Zhao, H. Chen, Y. Gu, R. Zhao, Y. He, H. Zhou, M. Z. Shou, and C. Shen (2023) DatasetDM: synthesizing data with perception annotations using diffusion models. In Conference on Neural Information Processing Systems, Cited by: §1. [23] W. Wu, Y. Zhao, M. Z. Shou, H. Zhou, and C. Shen (2023) Diffumask: synthesizing images with pixel-level annotations for semantic segmentation using diffusion models. Proc. Int. Conf. Computer Vision (ICCV 2023). Cited by: §1, §2.2, §4.2, Table 2, Table 2, Table 2. [24] R. Yoshihashi, Y. Otsuka, K. Doi, T. Tanaka, and H. Kataoka (2024) Exploring limits of diffusion-synthetic training with weakly supervised semantic segmentation. In 17th Asian Conference on Computer Vision, Cited by: §4.2, Table 2, Table 2, Table 2. [25] F. Yu, H. Chen, X. Wang, W. Xian, Y. Chen, F. Liu, V. Madhavan, and T. Darrell (2020) BDD100K: a diverse driving dataset for heterogeneous multitask learning. In 2020 Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: 3rd item, §1, §4.1.1, §4.1.2. [26] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2018) Mixup: beyond empirical risk minimization. In International Conference on Learning Representations, Cited by: §1. [27] L. Zhang, A. Rao, and M. Agrawala (2023) Adding conditional control to text-to-image diffusion models. In International Conference on Computer Vision (ICCV), Cited by: §2.1, §3.2, §3.3.