Paper deep dive

SteelDefectX: A Coarse-to-Fine Vision-Language Dataset and Benchmark for Generalizable Steel Surface Defect Detection

Shuxian Zhao, Jie Gui, Baosheng Yu, Lu Dong, Zhipeng Gui

Year: 2026Venue: arXiv preprintArea: cs.CVType: PreprintEmbeddings: 49

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 98%

Last extracted: 3/26/2026, 2:34:20 AM

Summary

SteelDefectX is a new vision-language dataset for steel surface defect detection, containing 7,778 images across 25 categories. It features a coarse-to-fine annotation structure, providing class-level semantic information (categories, visual attributes, industrial causes) and fine-grained sample-level descriptions (shape, size, depth, position, contrast). The paper establishes a benchmark for vision-only, vision-language, few/zero-shot, and zero-shot transfer tasks, demonstrating that these textual annotations improve model interpretability and generalization.

Entities (5)

GPT-4o · large-language-model · 100%Shuxian Zhao · researcher · 100%SteelDefectX · dataset · 100%CLIP · model-architecture · 95%NEU · dataset · 95%

Relation Signals (3)

SteelDefectX → contains → 7778 images

confidence 100% · a vision-language dataset containing 7,778 images across 25 defect categories

GPT-4o → generatesannotationsfor → SteelDefectX

confidence 100% · an automated annotation pipeline powered by large language models (e.g., GPT-4o [11]) generates fine-grained textual descriptions

SteelDefectX → integrates → NEU

confidence 100% · The SteelDefectX dataset is constructed by integrating and reorganizing four publicly available steel surface defect datasets: NEU

Cypher Suggestions (2)

Identify researchers associated with the paper · confidence 95% · unvalidated

MATCH (r:Researcher)-[:AUTHORED]->(p:Paper {id: '4a47674b-9ba1-49ee-8d76-2713fac407eb'}) RETURN r.name

Find all datasets integrated into SteelDefectX · confidence 90% · unvalidated

MATCH (d:Dataset {name: 'SteelDefectX'})-[:INTEGRATES]->(source:Dataset) RETURN source.name

Abstract

Abstract:Steel surface defect detection is essential for ensuring product quality and reliability in modern manufacturing. Current methods often rely on basic image classification models trained on label-only datasets, which limits their interpretability and generalization. To address these challenges, we introduce SteelDefectX, a vision-language dataset containing 7,778 images across 25 defect categories, annotated with coarse-to-fine textual descriptions. At the coarse-grained level, the dataset provides class-level information, including defect categories, representative visual attributes, and associated industrial causes. At the fine-grained level, it captures sample-specific attributes, such as shape, size, depth, position, and contrast, enabling models to learn richer and more detailed defect representations. We further establish a benchmark comprising four tasks, vision-only classification, vision-language classification, few/zero-shot recognition, and zero-shot transfer, to evaluate model performance and generalization. Experiments with several baseline models demonstrate that coarse-to-fine textual annotations significantly improve interpretability, generalization, and transferability. We hope that SteelDefectX will serve as a valuable resource for advancing research on explainable, generalizable steel surface defect detection. The data will be publicly available on this https URL.

PDF

Open source PDF →Open local PDF →

Full Text

49,108 characters extracted from source content.

Expand or collapse full text

SteelDefectX: A Coarse-to-Fine Vision-Language Dataset and Benchmark for Generalizable Steel Surface Defect Detection Shuxian Zhao 1 Jie Gui 1,2, * Baosheng Yu 3 Lu Dong 1 Zhipeng Gui 4 1 Southeast University 2 Purple Mountain Laboratories 3 Nanyang Technological University 4 Wuhan University zhaosxian,guijie@seu.edu.cn Abstract Steel surface defect detection is essential for ensur- ing product quality and reliability in modern manufac- turing. Current methods often rely on basic image clas- sification models trained on label-only datasets, which limits their interpretability and generalization.To ad- dress these challenges, we introduce SteelDefectX, a vision- language dataset containing 7,778 images across 25 de- fect categories, annotated with coarse-to-fine textual de- scriptions. At the coarse-grained level, the dataset pro- vides class-level information, including defect categories, representative visual attributes, and associated industrial causes.At the fine-grained level, it captures sample- specific attributes, such as shape, size, depth, position, and contrast, enabling models to learn richer and more detailed defect representations.We further establish a benchmark comprising four tasks, vision-only classifica- tion, vision-language classification, few/zero-shot recog- nition, and zero-shot transfer, to evaluate model perfor- mance and generalization. Experiments with several base- line models demonstrate that coarse-to-fine textual annota- tions significantly improve interpretability, generalization, and transferability. We hope that SteelDefectX will serve as a valuable resource for advancing research on explain- able, generalizable steel surface defect detection. The data will be publicly available on https://github.com/ Zhaosxian/SteelDefectX . 1. Introduction Steel surface defect detection is critical for ensuring the quality and reliability of industrial steel products [26, 38]. Defects such as cracks, scratches, and oxides can weaken mechanical strength, reduce durability, and cause failures in applications ranging from construction to automotive and machinery. Undetected defects also lead to production * Corresponding author. ➢Sample-level: “Thepunchingdefectisadark,oval-shapedmarkonthe steelsurface.Ithasasmoothandwell-definededge.The markcontrastssharplyagainstthelightersurroundingarea. Itislocatednearthetop-leftsectionoftheimage.Thesize ofthedefectisrelativelysmallcomparedtothevisiblesteel surface.” ➢Class-level : “A photo ofPunching, which showscircular holes caused by unintended punchingdue to equipment malfunction.” ➢Classname-only: “A photo of steel surface defect:Punching.” Figure 1. Illustration of different textual descriptions for steel sur- face defects. The figure shows the progression from simple class- name templates to coarse class-level descriptions that capture the semantic characteristics of defect types, representative visual pat- terns, and potential causes, and finally to fine-grained sample-level descriptions that provide detailed visual and semantic information. losses, higher maintenance costs, and safety risks, under- scoring the need for accurate and timely detection. Despite recent advances in visual recognition [4, 8], existing steel surface defect detection models primarily rely on basic im- age classification or object detection models [9, 22, 26, 38]. A key challenge is that current datasets of steel surface defects provide only category labels or numerical annota- tions, limiting the development of more explainable and generalizable approaches, such as vision-language mod- els [16, 18, 19, 28], which excel in tasks like zero-shot clas- sification [25] and retrieval [21]. Existing methods often attempt to convert class labels into simplified template descriptions [12, 17] for vision- language-based industrial defect detection. However, these approaches fail to capture the full complexity of steel sur- face defects, which exhibit high variability and uncertainty in both appearance and underlying causes. The same man- arXiv:2603.21824v1 [cs.CV] 23 Mar 2026 ufacturing operation can produce vastly different visual patterns on various materials, and defects are non-natural anomalies whose appearance varies with production pro- cesses and environmental conditions. To address these chal- lenges, vision-language annotations provide richer, more interpretable, and more generalizable supervision, enabling models to understand defects in ways that closely align with human reasoning and industrial requirements. Effective sur- face defect detection, therefore, requires a semantic under- standing of defect types, properties, and causes, rather than relying solely on category labels, whether numerical or sim- plified templates. Fig. 1 illustrates the contrast between simple classname templates and our richer textual annota- tions, which provide both visual and semantic information about steel surface defects. In this paper, we introduce SteelDefectX, a new vision- language dataset with coarse-to-fine textual annotations for steel surface defect detection. The dataset is constructed by collecting steel surface defect images from four publicly available sources: NEU [32], GC10 [23], X-SDD [5], and S3D [3]. Similar defect categories across these datasets are merged, yielding a unified dataset of 7,778 images spanning 25 distinct defect categories. Beyond simple category la- bels, SteelDefectX provides textual annotations at both the class and sample levels. At the class level, each defect cat- egory is described using three semantic elements: the de- fect class name (e.g., “punching”), a representative visual attribute (e.g., “circular holes”), and a possible cause (e.g., “equipment malfunction”). A complete list of the class-level annotations for all 25 categories is provided in the Supple- mentary Material. At the sample level, an automated an- notation pipeline powered by large language models (e.g., GPT-4o [11]) generates fine-grained textual descriptions. The process begins with open-ended prompts to produce diverse candidate descriptions with controlled randomness. A semantic filtering module removes redundant outputs to maintain diversity. Each retained description is evaluated for completeness across five key defect attributes: shape (e.g., “a dark, oval-shaped mark”), size (e.g., “relatively small compared to the visible steel surface”), depth (e.g., “a smooth and well-defined edge”), position (e.g., “located near the top-left section of the image”), and contrast (e.g., “the mark contrasts sharply against the lighter surround- ing area”). Incomplete cases trigger structured regeneration to ensure comprehensive coverage. Finally, manual refine- ment ensures the accuracy, terminological consistency, and linguistic quality of the resulting annotations. Building upon the proposed dataset, we establish a benchmark comprising four tasks.The first task is vision-only classification using a linear-layer classification head [4, 8]. The second task is vision-language matching for classification using CLIP variants [2, 28, 34, 40, 41]. The third task is zero/few-shot recognition, which evalu- ates the model’s ability to recognize unseen or scarcely seen defect categories by leveraging knowledge transferred from seen classes. The fourth task is zero-shot transfer [37], which assesses a model’s ability to generalize to unseen datasets. Specifically, the model is trained on SteelDefectX and tested on ten aluminum surface defect categories from the MSD-Cls dataset [39] and five seamless steel tube de- fect categories from the CGFSDS-9 dataset [33]. The main contributions of this paper can be summarized as follows: • We propose SteelDefectX, the first vision-language dataset with coarse-to-fine annotations for steel surface defect detection. • We establish a benchmark comprising four tasks, vision-only classification, vision-language classification, few/zero-shot recognition, and zero-shot transfer, to eval- uate the impact of the proposed dataset on steel surface defect detection. • We conduct comprehensive experiments on the bench- mark using various baseline models. The results demon- strate that high-quality textual annotations significantly enhance model interpretability, generalization, and trans- ferability, providing a new research paradigm for intelli- gent industrial inspection. 2. Related Work 2.1. Steel Surface Defect Datasets Research on steel surface defect recognition has relied mainly on several public datasets that differ in defect types, collection conditions, and scale. The NEU dataset [32] contains 1,800 balanced images of six common defects on hot-rolled steel. GC10 [23] focuses on cold-rolled steel, offering 2,312 high-resolution images spanning 10 com- plex and imbalanced defect categories. X-SDD [5] pro- vides 1,360 images of seven defect types on hot-rolled steel, and S3D [3] includes 880 images covering five defect cate- gories. While these datasets have driven significant progress in defect recognition, they remain single-modality and are limited to image-level annotations. This lack of semantic textual information restricts their utility for vision-language learning and multimodal research. To address these lim- itations, there is a clear need for a multimodal dataset of steel surface defects that integrates aligned image-text pairs, rich semantic descriptions, and a unified structure, enabling more comprehensive and intelligent defect analysis. 2.2. Vision-Language Models Recent advances in multimodal pretraining have enabled a range of vision-language tasks, including image-text match- ing, captioning, and visual question answering. Large- scale datasets that pair natural images with textual de- scriptions [15, 31] have driven the development of models such as CLIP [28], BLIP [18], and LLaVA [20]. Domain- Prompt A: Describe the steel surface defect using short, clear sentences. Focus on visual features such as appearance, shape, size, depth, position, and contrast. Avoid speculation or vague language. Step 1: Candidate Generation Exist Dimension Score > 4 Diversity- based Filtering Step 2: Candidate Refinement Evaluating If Yes Prompt B: “What does the defect on the steel surface look like in the image?” “What is the shape of the defect?” “What is its approximate size relative to the image or surface?” “Does the defect appear shallow or deep?” “Where is it located within the image?” “How does the defect contrast with the background surface?” No Class-level AnnotationClass-level Annotation · Representative visual attribute: Bright, elongated scratches that are highly reflective. The shape can be deep or shallow, varying in length. Randomly distributed on the surface of the steel strip. · Possible industrial causes: Bright scratches are mostly caused by foreign objects or projections rubbing against the strip surface. Class-level description: A photo of Bright scratch, which shows a bright, reflective scratch running along the steel surface caused by friction with foreign objects. Step 3: Candidate Supplement Step 4: Manual Correction Sample-level AnnotationSample-level Annotation Manually Review (a) (b) Optimal GPT-4o GPT-4o Sample-level description: The bright scratch defect appears as a vertical, elongated streak. It is narrow and extends across the surface. The streak is lighter in contrast compared to the surrounding area. It starts from the top edge and continues downward. The width is consistent along its length. It has a slightly irregular, fuzzy edge. There is no visible depth or indentation. ... 5 Dimensions: ShapeSizeDepth PositionContrast · Defect class name: Bright Scratch. Figure 2. Illustration of coarse-to-fine textual annotations in SteelDefectX. (a) Class-level: Each defect category is described by three semantic components: defect class name, representative visual attributes, and possible industrial causes, providing global contextual se- mantics. (b) Sample-level: Step 1: Candidate Generation using open-ended promptP a to generate diverse descriptions via GPT-4o. Step 2: Candidate Refinement applying diversity-based filtering and dimension-aware scoring across five semantic aspects (shape, size, depth, position, contrast). Step 3: Candidate Supplement using structured promptP b when dimensional coverage is insufficient. Step 4: Manual Correction for quality assurance. specific datasets have also emerged, e.g., CVLUE [36] for Chinese multimodal understanding, SkyScript [37] for re- mote sensing, Omnidrive [35] for autonomous driving, and MMAD [14] for industrial anomaly detection. However, textual content in these datasets is often limited to question- answer pairs, lacking the fine-grained, professional seman- tics required for industrial applications. Integrating vision-language models into industrial anomaly detection has shown promise. CAM-CLIP [13] employs context-aware masking for pixel-level interpreta- tion, AnomalyGPT [7] generates descriptive guidance for end-to-end detection, WinCLIP [12] utilizes window-based feature aggregation with compositional prompt ensembles, and MultiADS [30] enables multi-type zero-shot detec- tion and segmentation via defect-aware supervision and visual-textual alignment. These approaches demonstrate that language-guided mechanisms improve semantic under- standing, discriminative performance, and few-shot learn- ing.Nevertheless, their progress is constrained by the scarcity of large-scale, professional industrial image-text datasets, emphasizing the need for multimodal resources with fine-grained structural distinctions. 3. Dataset In this section, we describe the dataset construction, includ- ing data collection and coarse- and fine-grained annotations, along with key dataset statistics. 3.1. Dataset Construction Data Collection. The SteelDefectX dataset is constructed by integrating and reorganizing four publicly available steel surface defect datasets: NEU [32], GC10 [23], X-SDD [5], and S3D [3]. To enhance coverage and representation, we also incorporate processed samples from FSC-20 [43] and ESDIs-SOD [3], as these datasets largely originate from the same sources. The integration process involves three key steps. First, all images are standardized to a resolution of 256 × 256, and redundant or low-quality samples are re- moved to ensure data uniformity. Second, defect annota- tions are cross-verified and consolidated across sources to eliminate inconsistencies. Lastly, visually and semantically similar subclasses are merged into a unified taxonomy of 25 categories, yielding a compact yet coherent label space that maintains consistency across datasets while preserving representational diversity. Details are provided in the sup- plementary materials. Class-Level Annotation. Motivated by the question, “How can we effectively describe industrial defect images with natural language for multimodal understanding?”, we de- sign coarse-level descriptions for each of the 25 defect classes to provide global contextual semantics. Initial tem- plates are manually crafted based on domain knowledge from steel manufacturing and subsequently refined using descriptions generated by CuPL [27]. Each class-level de- scription consists of three components: (1) the defect class name; (2) representative visual attributes, including shape, texture, and color; and (3) possible industrial causes, such as rolling defects or material inclusions. These elements are combined into smooth, natural-language descriptions that capture shared semantic properties across samples within each category and provide consistent conceptual grounding for vision-language alignment. Serving as high-level su- pervision, these class-level semantics complement sample- specific descriptions and support hierarchical multimodal understanding. Sample-Level Annotation. While class-level annotations provide consistent category semantics, individual defect samples exhibit significant visual variation, requiring fine- grained descriptions. To address this, we design an auto- mated and structured pipeline that generates high-quality and dimension-aware descriptions. The pipeline operates within a predefined semantic space consisting of five defect- related dimensions: shape, size, depth, position, and con- trast. These dimensions serve as explicit constraints to guide generation, filtering, and validation. As shown in Fig. 2 (b), the pipeline consists of four steps: candidate gen- eration, candidate refinement, candidate supplement, and manual correction, leveraging large models, semantic em- beddings, and a scoring-based selection mechanism. Step 1: Candidate Generation. For each image I , an open-ended instruction prompt P a guides GPT-4o [11] to produce multiple natural-language descriptions of visi- ble defect characteristics. We adopt a relatively high sam- pling temperature to encourage linguistic diversity and re- duce template bias. Multiple sampling rounds generate n candidate descriptions d 1 ,d 2 ,...,d n (n = 4, tempera- ture=0.9, top p=0.9, maxtokens=80). This stage prioritizes semantic diversity so that different visual aspects of the de- fect may be expressed across candidate descriptions. Step 2: Candidate Refinement. To eliminate redun- dancy while preserving semantic variety, we apply greedy selection based on Sentence-BERT [29] embeddings φ(d i ). Starting with the first candidate, each subsequent descrip- tion is retained only if its maximum cosine similarity to selected descriptions is below 0.9. This process preserves up to three diverse candidates and prevents description col- lapse. To ensure that descriptions capture meaningful de- fect characteristics, we further evaluate semantic coverage across the predefined five-dimensional defect space. Each retained description d i is encoded as a 5-bit binary vector b i = [b 1 ,b 2 ,b 3 ,b 4 ,b 5 ] ∈ 0, 1 5 , where b k = 1 if di- mension k is mentioned in d i , and 0 otherwise. Each di- mension is represented by a predefined keyword set: shape (round, irregular, linear), size (small, large, span), depth (shallow, deep, raised), position (top, center, corner), and contrast (noticeable, faint, sharp). This encoding allows the annotation process to quantify semantic completeness and enforce structural consistency across samples. Each candi- date is scored to balance dimensional coverage and seman- tic uniqueness. Semantic dissimilarity D(d i ) is computed relative to other candidates: D(d i ) = 1− 1 n− 1 X j̸=i cos φ(d i ),φ(d j ) ,(1) and the final score is: S(d i ) =        λ 1 · ∥b i ∥ 1 5 + λ 2 · D(d i ), n > 1, λ 1 · ∥b i ∥ 1 5 ,n = 1, (2) with λ 1 = 0.6 and λ 2 = 0.4, ensuring the selected de- scription is both comprehensive and non-redundant. The highest-scoring description with coherent sentence struc- ture is selected as the final annotation. Step 3: Candidate Supplement. If no candidate cov- ers at least four of the five dimensions, a structured multi- question prompt P b = q 1 ,...,q 6 is used to explicitly target each visual aspect. Unlike P a , which emphasizes diversity and naturalness, P b ensures completeness. The responses are concatenated to form a comprehensive description that balances diversity with dimensional coverage. Step 4: Manual Correction. All sample-level descrip- tions undergo manual review and calibration. Two annota- tors conducted approximately 275 hours of manual cross- validation to standardize industry terminology, eliminate ambiguous expressions, and ensure consistency across cat- egories. By combining class-level and sample-level annota- tions, the dataset provides a scalable, interpretable resource for high-quality supervision in vision-language learning for industrial defect inspection. 3.2. Dataset Statistics Dataset overview.SteelDefectX comprises 7,778 im- ages spanning 25 steel surface defect categories. Tab. 1 compares SteelDefectX with existing industrial surface de- fect datasets, highlighting its broader category coverage, richer annotation design, and more comprehensive sample- level descriptions. These characteristics make it a valu- able benchmark for developing explainable, generalizable AI models for industrial inspection. Class Distribution and Visual Diversity. Each category contains between 50 and 795 samples, illustrating the nat- ural class imbalance often observed in manufacturing envi- ronments. Fig. 3 visualizes the class distribution, showing a long-tailed pattern that aligns with the practical occurrence frequencies of steel defects. This imbalance provides an ef- fective test bed for evaluating few-shot and long-tail learn- ing approaches. To assess visual diversity, Fig. 4 presents t-SNE embeddings of pixel-level features, revealing sub- stantial intra-class variation and partial inter-class overlap. Table 1. Comparison of different industrial surface defect datasets. DatasetClassesImagesModalityTypesBalance Surface Defect-4i [1]12561VisionSteel, Rail, Aluminum, MT, Leather, Tile × MSD-Cls [39]20872VisionSteel, Aluminum✓,× FSC-20 [43]201000VisionSteel✓ ESDIs-SOD [3]144800VisionSteel× SteelDefectX257778Vision + LanguageSteel× Inclusion Water spot Silk spot Welding line Oil spot Bright scratch Punching Red iron sheet Dark scratches Rolled-in Scale Pitted surface Patches Crazing Crescent gap Slag inclusion Oxide scale of temperature system Finishing roll printing Iron scale compression Waist folding Secondary rust skin White rust Iron sheet ash Crease Oxide scale of plate system Rolled pit 0 200 400 600 800 #Samples Average = 311.1 Figure 3. Class distribution of SteelDefectX. The dataset exhibits an imbalanced distribution across 25 defect categories, with sample counts following a log-normal trend. The average number of samples in the dataset is 311. Common defects such as inclusion and water spot dominate the dataset, whereas rare defects (e.g., crease and rolled pit) are underrepresented, reflecting real-world variability in steel surface inspection scenarios. These characteristics underscore the complexity of defect appearance, which varies across steel types, production pro- cesses, and environmental conditions. Such visual hetero- geneity poses challenges for reliable defect recognition and emphasizes the necessity for robust representation learning. Annotation Structure and Text Statistics. Each image is annotated with class-level descriptions specifying defect type, visual attributes, and possible causes, while sample- level annotations provide fine-grained details on shape, size, depth, position, and contrast. These coarse-to-fine annota- tions provide rich semantic supervision, facilitating inter- pretable and transferable development of vision-language models. The statistical characteristics of the textual annota- tions further demonstrate the consistency and informative- ness of SteelDefectX. As shown in Fig. 5a, the text length distribution follows an approximately regular pattern with a mean of 54.83 words and moderate variance, indicating a balance between descriptiveness and conciseness. The con- trolled skewness and kurtosis suggest uniform annotation practices, ensuring most descriptions convey comparable detail. Such regularity supports stable visual-text alignment during model training, reducing potential linguistic bias. Vocabulary Diversity Analysis. The vocabulary diver- sity analysis (Fig. 5b) evaluates the lexical variation across Bright scratch Crazing Crease Crescent gap Dark scratches Finishing roll printing Inclusion Iron scale compression Iron sheet ash Oil spot Oxide scale of plate system Oxide scale of temperature system Patches Pitted surface Punching Red iron sheet Rolled in scale Rolled pit Secondary rust skin Silk spot Slag inclusion Waist folding Water spot Welding line White rust Figure 4. t-SNE visualization of pixel-level features, illustrating intra-class variation and inter-class overlap among defect cate- gories. textual descriptions. Most samples exhibit moderate diver- sity with consistent distribution, suggesting stable linguis- 255075100125 Text Length (Words) 0 250 500 750 1000 1250 Frequency Mean: 54.83 Variance: 128.11 Skewness: 0.50 Kurtosis: 1.64 (a) 020004000 Samples 0 20 40 60 Vocabulary Diversity (b) Figure 5. (a) Text length distribution of fine-grained descriptions in SteelDefectX. The distribution centers around 55 words with moderate variance, indicating concise yet sufficiently detailed an- notations. (b) Vocabulary diversity across samples, measured by counting unique non-stop words using a TF-IDF representation, reflecting the lexical richness and variation within the dataset. tic structure alongside adequate variation in word choice to describe fine-grained defect characteristics. By estimat- ing diversity using unique non-stop words from a TF-IDF representation, the analysis focuses on semantically mean- ingful content rather than standard terms. This balanced linguistic variety facilitates better generalization across tex- tual expressions while minimizing noise. Consequently, the textual corpus in SteelDefectX complements the visual data effectively, strengthening multimodal learning and improv- ing the interpretability of vision-language models. 3.3. Applications and Limitations SteelDefectX enables multiple downstream applications be- yond conventional classification. The fine-grained textual annotations support multimodal pretraining, enabling zero- shot and few-shot learning in data-scarce scenarios. Since surface defects across various metals share similar visual patterns, the dataset facilitates cross-material transfer for defect detection. The descriptive annotations enhance in- terpretability by bridging machine predictions and human understanding, which is crucial for transparent decision- making in industrial settings. Additionally, SteelDefectX serves as a benchmark for evaluating vision-language mod- els under low-resource and imbalanced conditions. Limitations. First, the dataset size is limited by the in- herent difficulty of collecting industrial defect data How- ever, it remains the largest steel defect dataset to date with rich semantic information. Second, although the defect de- scriptions are generated through a structured process and manually refined, text annotations cannot fully capture the subtle visual differences present in certain defect categories. Finally, the current version focuses on image-level classi- fication and visual-language alignment. Future work will incorporate structured attribute annotations and pixel-level segmentation annotations to support more tasks. Table 2. Results of vision-only model classification. ModelInput SizeAccmAcc ShuffleNetV2 [24]224*22496.3494.98 MobileNetV3 [10]224*22493.2489.66 ResNet-50 [8]224*22492.6989.23 ResNet-101 [8]224*22493.6391.19 ViT-B/16 [4]224*22444.8440.31 ViT-B/32 [4]224*22443.4637.43 4. Benchmark and Experiments Building on SteelDefectX, we establish a four-task bench- mark to evaluate how vision-language supervision enhances steel surface defect detection in terms of accuracy, general- ization, and transferability. The following subsections out- line each task and summarize baseline performance. To ensure clarity, we define four levels of textual an- notations used throughout the experiments: “T0” refers to the commonly used classname-only annotation (“A photo of steel surface defect: [classname].”). “T1” denotes class- level annotation. “T2” corresponds to fine-grained annota- tion generated by GPT-4o, while “T3” represents the final fine-grained annotation that was manually verified and re- fined at the sample level. 4.1. Vision-Only Classification (Task 1) The SteelDefectX dataset is divided into training (70%) and testing (30%) sets. We then adopt a conventional im- age classification framework, where a visual backbone net- work (e.g., ResNet [8] or ViT [4]) extracts image features, followed by a linear classification head that maps these features to discrete category labels. We evaluate classi- cal CNNs (ResNet-50, ResNet-101 [8]), lightweight CNNs (MobileNetV3 [10], ShuffleNetV2 [24]), and Vision Trans- formers (ViT-B/16, ViT-B/32 [4]). All models are trained for 100 epochs with a batch size of 32 using SGD (momen- tum 0.9, weight decay 1e-4) and an initial learning rate of 0.1, decayed by a factor of 10 every 30 epochs. Perfor- mance is reported using overall accuracy (Acc) and mean class accuracy (mAcc). As shown in Tab. 2, CNNs achieve strong performance, while ViTs perform poorly, likely due to the small size of SteelDefectX compared to large natural image datasets. CNNs benefit from inductive biases such as local connec- tivity and translation equivariance, which support effective feature extraction and stable training, whereas ViTs often underfit without large-scale pretraining. The gap between Acc and mAcc highlights the effect of class imbalance. Overall, SteelDefectX serves as a representative benchmark for steel surface defect classification, with moderate chal- lenges posed by long-tailed distributions. Table 3. Results of vision-language model classification. ModelImage EncoderInput SizeAccmAcc CLIP [28]ViT-B/16224*22481.8481.14 EVA-CLIP [34]ViT-B/16224*22479.6077.61 OpenCLIP [2]ViT-B/16224*22487.8785.04 Long-CLIP [41]ViT-B/16224*22488.2585.41 FG-CLIP [40]ViT-B/16224*22485.0382.45 CLIP [28]ViT-L/14224*22480.4680.22 EVA-CLIP [34]ViT-L/14224*22484.5584.41 OpenCLIP [2]ViT-L/14224*22488.2187.54 Long-CLIP [41]ViT-L/14224*22493.6392.56 FG-CLIP [40]ViT-L/14336*33691.8790.56 4.2. Vision-Language Classification (Task 2) This task follows a vision-language matching or CLIP [28] paradigm, which employs a vision encoder to extract visual features and a text encoder to encode textual representations of all possible label templates. The model then computes the similarity between visual and textual features, and se- lects the label whose textual feature has the highest similar- ity to the visual feature as the prediction. We evaluate four representative CLIP variants, includ- ing CLIP [28], OpenCLIP [2], EVA-CLIP [34], Long- CLIP [41], and FG-CLIP [40]. In each experiment, the training text is “T3” and the evaluation text is “T0”. All models are trained following CLIP-Adapter [6], which adapts a pretrained CLIP model for downstream image clas- sification by introducing lightweight residual adapters into the visual encoders. We train for 20 epochs using the Adam optimizer with a learning rate of 1e-4 and bidirectional cross-entropy loss. The batch size is set to 16 for training and 32 for validation. Tab. 3 shows the classification results of the vision- language model.Among all variants, Long-CLIP [41] achieves the best performance, with an accuracy of 93.63% and an mAcc of 92.56%, which is close to vision-only mod- els. Nevertheless, vision-language models specifically tai- lored for industrial defect detection remain scarce. Notably, the ViT-L/14-based vision-language model shows a smaller gap between Acc and mAcc than vision-only models, indi- cating greater robustness to class imbalance and improved capture of minority-class characteristics. 4.3. Zero-/Few-Shot Recognition (Task 3) Few-Shot Recognition. To evaluate the dataset under few- shot settings, we use the Long-CLIP-Adapter [41] and Tip- Adapter-F [42] models for comparative experiments. For each experiment, 1, 2, 4, or 8 samples per class are ran- domly selected from the training set for few-shot training, and evaluation is conducted on the test set. The remain- ing settings are kept consistent with those in Sec. 4.2. The Long-CLIP-Adapter is trained under both the “T0” and T3” 1248 Number of training examples per class 0 20 40 60 80 Accuracy (%) Long-CLIP-Adapter ViT-B/16 (T3) ViT-L/14 (T3) ViT-B/16 (T0) ViT-L/14 (T0) 1248 Number of training examples per class 0 20 40 60 80 Tip-Adapter-F RN50 RN101 ViT-B/16 ViT-B/32 Figure 6. Results of few-shot recognition. configurations, while Tip-Adapter-F is trained under the “T0” configuration. Both models are evaluated using the “T0” configuration. As shown in Fig. 6, the evaluation per- formance of all methods improves as the number of shots increases. Among them, Long-CLIP-Adapter (ViT-L/14, T0) achieves the best results, whereas its performance un- der the “T3” setting is the lowest. These results suggest that while fine-grained textual supervision offers richer se- mantic cues, existing models still struggle to fully exploit its potential under few-shot conditions. Therefore, beyond serving as a benchmark for few-shot recognition, the pro- posed fine-grained vision-language dataset provides a valu- able resource for developing more adaptive few-shot vision- language methods. Zero-Shot Recognition. All baseline datasets are eval- uated using contrastive learning on the “T0” descriptions. For the zero-shot setting, the original pre-trained models are directly evaluated on the test set without fine-tuning. As shown in Tab. 4, different vision-language models show varying performance, yet the overall accuracy remains rela- tively low. This demonstrates the challenges of directly ap- plying large-scale pre-trained visual-language models to in- dustrial defect data, as the texture and structural patterns in this domain differ significantly from those in open-domain image-text pairs. Among all datasets, SteelDefectX poses the most significant challenge, as it includes more cate- gories, some of which are difficult for humans to distin- guish. Nevertheless, using the description “T1” improves recognition performance over “T0”, demonstrating the ef- fectiveness of description optimization in enhancing zero- shot generalization. Therefore, SteelDefectX serves as a valuable benchmark for studying generalization and seman- tic understanding in industrial defect classification. 4.4. Zero-shot Transfer (Task 4) Unlike traditional zero-shot classification, which focuses on unseen categories, zero-shot transfer evaluates a model’s ability to generalize to entirely unseen datasets [37]. For the experiment, the original Long-CLIP [41] model is directly evaluated on two external datasets as a zero-shot base- line: ten aluminum surface defect categories from MSD- Table 4. Results of zero-shot recognition. Dataset CLIPLong-CLIPFG-CLIP ResNet-50ViT-B/32ViT-L/14ViT-L/14 NEU [32]32.5922.4127.5930.00 GC10 [23]16.2815.4215.7125.36 X-SDD [5]15.106.689.6513.61 S3D [3]13.7416.4120.6125.19 SteelDefectX-T05.258.267.578.30 SteelDefectX-T19.3812.5611.2714.80 Table 5. Results of zero-shot transfer using Long-CLIP under dif- ferent textual annotation levels. DatasetAluminumSeamless Steel Tubes BackboneViT-B/16ViT-L/14ViT-B/16ViT-L/14 Zero-shot10.228.6031.0525.11 T015.5912.9031.5128.31 T118.0120.4333.3333.79 T219.3525.2737.4434.25 T322.5829.0343.3840.18 Cls [39] and five seamless steel tube defect categories from CGFSDS-9 [33]. The same datasets are then used to eval- uate SteelDefectX-trained models for transfer performance. To investigate the impact of textual annotation, we compare four levels of textual annotations (“T0”-“T3”). To ensure consistent text semantics across different materials, a gen- eral textual template (“A photo of surface defect: [class- name]”) is adopted during evaluation. Training settings fol- low those described in Sec. 4.2. As shown in Tab. 5, the model trained with “T3” achieves the best transfer performance across all datasets and back- bone networks. From “T0” to “T2”, performance consis- tently improves as textual annotations become more infor- mative, indicating that incorporating class-level semantics and fine-grained sample descriptions effectively enhances transferability. The additional gains from “T2” to “T3” further demonstrate that high-quality, accurate sample-level annotations play a crucial role in improving cross-dataset generalization capabilities across different materials and de- fect types. Therefore, our dataset provides rich textual rep- resentations that facilitate cross-domain knowledge transfer and support subsequent research on multimodal generaliza- tion in industrial defect detection. 4.5. Discussion Coarse-to-fine annotations provide different inductive bi- ases. Fine-grained descriptions introduce intra-class vari- ance and thus hurt performance in low-shot regimes, while benefiting interpretability and cross-domain analysis. Visualization. As shown in Fig. 7, we compare sim- ilarity heatmaps generated from different textual descrip- ImageT0T1T2T3 Figure 7. Comparison of heatmap visualizations under different textual descriptions. tions for selected steel surface defect images. Dense im- age features [44] are extracted using the FG-CLIP [40] ViT- L/14 backbone, and cosine similarity is computed between each image patch and the normalized textual features. In the heatmaps, warmer colors (e.g., yellow) indicate higher similarity, whereas cooler colors (e.g., blue) indicate lower similarity. The results across multiple defect types pro- vide an intuitive view of spatial alignment between textual prompts and defect features. Compared to the classname- only description (T0), the fine-grained textual description (T3) enables the model to capture fine-grained visual cues more effectively. This demonstrates that the designed tex- tual prompts enhance alignment between text and images and improve interpretability for defect localization. 5. Conclusion We introduce SteelDefectX, the first vision-language dataset for steel surface defect detection with coarse-to- fine textual annotations. We establish a comprehensive benchmark spanning vision-only classification, vision- language classification, few/zero-shot recognition, and zero-shot transfer, enabling systematic evaluation of vision-language models in industrial inspection scenarios. Extensive experiments demonstrate that high-quality textual annotations significantly improve interpretability, generalization, and cross-dataset transferability, high- lighting the importance of rich textual semantics for intelligent defect detection. We expect SteelDefectX to serve as a valuable resource for advancing research in multimodal industrial vision and enabling more gener- alizable and explainable inspection systems, while also supporting broader multimodal applications such as open- vocabulary detection, retrieval, and text-to-image synthesis. References [1] Yanqi Bao, Kechen Song, Jie Liu, Yanyan Wang, Yunhui Yan, Han Yu, and Xingjie Li. Triplet-graph reasoning net- work for few-shot metal generic surface defect segmenta- tion. IEEE Transactions on Instrumentation and Measure- ment, 70:1–11, 2021. 5 [2] Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuh- mann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scal- ing laws for contrastive language-image learning. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023. 2, 7 [3] Wenqi Cui, Kechen Song, Xiujian Jia, Hongshu Chen, Yu Zhang, Yunhui Yan, and Wenying Jiang. An efficient tar- geted design for real-time defect detection of surface defects. Optics and Lasers in Engineering, 178:108174, 2024. 2, 3, 5, 8 [4] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representa- tions, 2021. 1, 2, 6 [5] Xinglong Feng, Xianwen Gao, and Ling Luo. X-sdd: A new benchmark for hot rolled steel strip surface defects detection. Symmetry, 13(4):706, 2021. 2, 3, 8 [6] Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision, 132(2): 581–595, 2024. 7 [7] Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Ming Tang, and Jinqiao Wang. Anomalygpt: Detecting in- dustrial anomalies using large vision-language models. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 1932–1940, 2024. 3 [8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016. 1, 2, 6 [9] Yu He, Kechen Song, Qinggang Meng, and Yunhui Yan. An end-to-end steel surface defect detection approach via fusing multiple hierarchical features. IEEE transactions on instru- mentation and measurement, 69(4):1493–1504, 2019. 1 [10] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mo- bilenetv3. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1314–1324, 2019. 6 [11] Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 2, 4 [12] Jongheon Jeong, Yang Zou, Taewan Kim, Dongqing Zhang, Avinash Ravichandran, and Onkar Dabeer. Winclip: Zero- /few-shot anomaly classification and segmentation. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19606–19616, 2023. 1, 3 [13] Xiaofeng Ji, Faming Gong, Nuanlai Wang, Yanpu Zhao, Yuhui Ma, and Zhuang Shi. Pixel-level semantic parsing in complex industrial scenarios using large vision-language models. Information Fusion, 116:102794, 2025. 3 [14] Xi Jiang, Jian Li, Hanqiu Deng, Yong Liu, Bin-Bin Gao, Yifeng Zhou, Jialin Li, Chengjie Wang, and Feng Zheng. MMAD: A comprehensive benchmark for multimodal large language models in industrial anomaly detection. In The Thirteenth International Conference on Learning Represen- tations, 2025. 3 [15] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123(1):32–73, 2017. 2 [16] Hugo Laurenc ̧on, L ́ eo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision-language mod- els? Advances in Neural Information Processing Systems, 37:87874–87907, 2024. 1 [17] Kaiyan Lei, Zhiquan Qi, and Jin Song. Improving surface de- fect detection for trains based on visual-language knowledge guidance on tiny datasets. IEEE Transactions on Intelligent Transportation Systems, 2025. 1 [18] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Interna- tional conference on machine learning, pages 12888–12900, 2022. 1, 2 [19] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36:34892–34916, 2023. 1 [20] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024. 2 [21] Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and Stephen Gould. Image retrieval on real-life images with pre-trained vision-and-language models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2125–2134, 2021. 1 [22] Qiwu Luo, Yichuang Sun, Pengcheng Li, Oluyomi Simpson, Lu Tian, and Yigang He. Generalized completed local binary patterns for time-efficient steel surface defect classification. IEEE Transactions on Instrumentation and Measurement, 68 (3):667–679, 2018. 1 [23] Xiaoming Lv, Fajie Duan, Jia-jia Jiang, Xiao Fu, and Lin Gan. Deep metallic surface defect detection: The new bench- mark and detection network. Sensors, 20(6):1562, 2020. 2, 3, 8 [24] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architec- ture design. In Proceedings of the European Conference on Computer Vision, pages 116–131, 2018. 6 [25] Wenxin Ma, Xu Zhang, Qingsong Yao, Fenghe Tang, Chenxu Wu, Yingtai Li, Rui Yan, Zihang Jiang, and S Kevin Zhou. Aa-clip: Enhancing zero-shot anomaly detection via anomaly-aware clip. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 4744–4754, 2025. 1 [26] Yuxin Ma, Jiaxing Yin, Feng Huang, and Qipeng Li. Surface defect inspection of industrial products with object detection deep networks: A systematic review. Artificial Intelligence Review, 57(12):333, 2024. 1 [27] Sarah Pratt, Ian Covert, Rosanne Liu, and Ali Farhadi. What does a platypus look like? generating customized prompts for zero-shot image classification. In Proceedings of the IEEE/CVF international conference on computer vi- sion, pages 15691–15701, 2023. 3 [28] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763, 2021. 1, 2, 7 [29] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Lan- guage Processing, 2019. 4 [30] Ylli Sadikaj, Hongkuan Zhou, Lavdim Halilaj, Stefan Schmid, Steffen Staab, and Claudia Plant. Multiads: Defect- aware supervision for multi-type anomaly detection and seg- mentation in zero-shot learning.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22978–22988, 2025. 3 [31] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in neural in- formation processing systems, 35:25278–25294, 2022. 2 [32] Kechen Song and Yunhui Yan.A noise robust method based on completed local binary patterns for hot-rolled steel strip surface defects. Applied Surface Science, 285:858–864, 2013. 2, 3, 8 [33] Kechen Song, Hu Feng, Tonglei Cao, Wenqi Cui, and Yunhui Yan. Mfanet: Multifeature aggregation network for cross- granularity few-shot seamless steel tubes surface defect seg- mentation. IEEE Transactions on Industrial Informatics, 20 (7):9725–9735, 2024. 2, 8 [34] Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023. 2, 7 [35] Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Al- varez. Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22442–22452, 2025. 3 [36] Yuxuan Wang, Yijun Liu, Fei Yu, Chen Huang, Kexin Li, Zhiguo Wan, Wanxiang Che, and Hongyang Chen. Cvlue: A new benchmark dataset for chinese vision-language under- standing evaluation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 8196–8204, 2025. 3 [37] Zhecheng Wang, Rajanie Prabha, Tianyuan Huang, Jiajun Wu, and Ram Rajagopal. Skyscript: A large and seman- tically diverse vision-language dataset for remote sensing. In Proceedings of the AAAI Conference on Artificial Intel- ligence, pages 5805–5813, 2024. 2, 3, 7 [38] Xin Wen, Jvran Shan, Yu He, and Kechen Song. Steel sur- face defect recognition: A survey. Coatings, 13(1):17, 2022. 1 [39] Weiwei Xiao, Kechen Song, Jie Liu, and Yunhui Yan. Graph embedding and optimal transport for few-shot classification of metal surface defect. IEEE Transactions on Instrumenta- tion and Measurement, 71:1–10, 2022. 2, 5, 8 [40] Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng, and Yuhui Yin. FG-CLIP: Fine-grained visual and textual alignment.In Forty-second International Conference on Machine Learn- ing, 2025. 2, 7, 8 [41] Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, and Jiaqi Wang. Long-clip: Unlocking the long-text capability of clip. In Proceedings of the European Conference on Com- puter Vision, pages 310–325, 2024. 2, 7 [42] Renrui Zhang, Wei Zhang, Rongyao Fang, Peng Gao, Kun- chang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip- adapter: Training-free adaption of clip for few-shot classifi- cation. In Proceedings of the European Conference on Com- puter Vision, pages 493–510, 2022. 7 [43] Wenli Zhao, Kechen Song, Yanyan Wang, Shubo Liang, and Yunhui Yan. Fanet: Feature-aware network for few shot clas- sification of strip steel surface defects. Measurement, 208: 112446, 2023. 3, 5 [44] Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. In Proceedings of the European Con- ference on Computer Vision, pages 696–712, 2022. 8