← Back to papers

Paper deep dive

A Framework and Prototype for a Navigable Map of Datasets in Engineering Design and Systems Engineering

H. Sinan Bank, Daniel R. Herber

Year: 2026Venue: arXiv preprintArea: cs.SEType: PreprintEmbeddings: 51

Abstract

Abstract:The proliferation of data across the system lifecycle presents both a significant opportunity and a challenge for Engineering Design and Systems Engineering (EDSE). While this "digital thread" has the potential to drive innovation, the fragmented and inaccessible nature of existing datasets hinders method validation, limits reproducibility, and slows research progress. Unlike fields such as computer vision and natural language processing, which benefit from established benchmark ecosystems, engineering design research often relies on small, proprietary, or ad-hoc datasets. This paper addresses this challenge by proposing a systematic framework for a "Map of Datasets in EDSE." The framework is built upon a multi-dimensional taxonomy designed to classify engineering datasets by domain, lifecycle stage, data type, and format, enabling faceted discovery. An architecture for an interactive discovery tool is detailed and demonstrated through a working prototype, employing a knowledge graph data model to capture rich semantic relationships between datasets, tools, and publications. An analysis of the current data landscape reveals underrepresented areas ("data deserts") in early-stage design and system architecture, as well as relatively well-represented areas ("data oases") in predictive maintenance and autonomous systems. The paper identifies key challenges in curation and sustainability and proposes mitigation strategies, laying the groundwork for a dynamic, community-driven resource to accelerate data-centric engineering research.

Tags

ai-safety (imported, 100%)csse (suggested, 92%)preprint (suggested, 88%)

Links

Your browser cannot display the PDF inline. Open PDF directly →

Intelligence

Status: not_run | Model: - | Prompt: - | Confidence: 0%

Entities (0)

No extracted entities yet.

Relation Signals (0)

No relation signals yet.

Cypher Suggestions (0)

No Cypher suggestions yet.

Full Text

50,615 characters extracted from source content.

Expand or collapse full text

A FRAMEWORK AND PROTOTYPE FOR A NAVIGABLE MAP OF DATASETS IN ENGINEERING DESIGN AND SYSTEMS ENGINEERING H. Sinan Bank ∗ Department of Systems Engineering Colorado State University Fort Collins, CO 80523 Daniel R. Herber Department of Systems Engineering Colorado State University Fort Collins, CO 80523 ABSTRACT The proliferation of data across the system lifecycle presents both a significant opportunity and a challenge for Engineering Design and Systems Engineering (EDSE). While this “digital thread” has the potential to drive innovation, the fragmented and inaccessi- ble nature of existing datasets hinders method validation, limits reproducibility, and slows research progress. Unlike fields such as computer vision and natural language processing, which bene- fit from established benchmark ecosystems, engineering design research often relies on small, proprietary, or ad-hoc datasets. This paper addresses this challenge by proposing a systematic framework for a “Map of Datasets in EDSE.” The framework is built upon a multi-dimensional taxonomy designed to classify engineering datasets by domain, lifecycle stage, data type, and format, enabling faceted discovery. An architecture for an in- teractive discovery tool is detailed and demonstrated through a working prototype, employing a knowledge graph data model to capture rich semantic relationships between datasets, tools, and publications. An analysis of the current data landscape reveals underrepresented areas (“data deserts”) in early-stage design and system architecture, as well as relatively well-represented areas (“data oases”) in predictive maintenance and autonomous systems. The paper identifies key challenges in curation and sus- tainability and proposes mitigation strategies, laying the ground- work for a dynamic, community-driven resource to accelerate data-centric engineering research. Keywords: Systems Engineering, Engineering informatics, Data/information Modeling, Design Representation, Knowledge services, AI/KBS, Data Exchange, Design Methodology 1 INTRODUCTION The transition to a data-centric paradigm is reshaping engineer- ing disciplines. In Engineering Design and Systems Engineering (EDSE), this shift has produced a significant volume of data span- ning the entire system lifecycle, from requirements definition and conceptual design to manufacturing, operations, and disposal. This digital thread offers the potential to enable AI-driven de- sign, optimize complex systems, and validate new engineering methodologies. However, a primary bottleneck prevents the full realization of this potential: the fragmented, siloed, and often inaccessible nature of engineering datasets. Researchers and practitioners face significant difficulty in discovering, comparing, and reusing rele- vant data [1, 2]. Unlike fields such as computer vision and natural language processing, which benefit from large-scale benchmark datasets, engineering design research often proceeds with small, proprietary, or ad-hoc datasets that limit reproducibility and gen- eralizability. Research has identified a fundamental theory-practice gap that complicates AI system engineering across data quality assur- ance, model building, and deployment [3]. Data-driven design in the early phases of physical product development faces partic- ular challenges [4]. The reliance on data availability represents a significant hindrance, often leading to underexplored design spaces [5]. Both Model-Based Systems Engineering (MBSE) and AI face challenges regarding successful deployment, with a key bottleneck being the lack of a coherent foundation to inte- grate data-driven methods [6]. Additional concerns include the reliability and interpretability of AI-driven results, particularly with “black-box” models that produce outputs difficult to vali- date [7], and barriers related to human trust in AI predictions despite achieving comparable accuracy to human experts [8]. This challenge, frequently highlighted within the research ∗ Corresponding author: sinan.bank@colostate.edu 1 arXiv:2603.15722v2 [cs.SE] 18 Mar 2026 FIGURE 1: Proposed framework (top) and multi-dimensional taxonomy (bottom) for classifying and discovering engineering datasets. community, directly impedes scientific progress. To address this critical need, this paper presents the design of a “Map of Datasets in EDSE”. The objective is to move beyond a static list of re- sources and define a dynamic, navigable ecosystem that makes en- gineering data Findable, Accessible, Interoperable, and Reusable (FAIR). An initial prototype is presented to demonstrate feasibil- ity, while complete implementation and validation of this design are identified as future work. The contribution of this work is a concrete design comprising three main components, as illustrated in Fig. 1 (top): 1.A multi-dimensional taxonomy for structured classification of engineering datasets across four dimensions: engineering domain, system lifecycle stage, data type/modality, and data format 2.A knowledge graph architecture specifying node types, relationship types, and interaction modes for an interactive discovery tool that captures rich semantic relationships be- tween datasets, tools, publications, and taxonomy terms 3.An analysis of the current dataset landscape to identify gaps (“data deserts”) and relatively well-represented areas (“data oases”), with discussion of mitigation strategies in- cluding synthetic data The remainder of this paper is organized as follows: Section 2 reviews background and related work on FAIR principles, existing dataset ecosystems, and knowledge graph technologies. Section 3 presents the proposed design, including the multi-dimensional taxonomy and knowledge graph architecture. Section 4 applies the framework to analyze the current landscape and presents an exemplar dataset catalog. Section 5 concludes with future work toward an intelligent, agent-driven ecosystem. 2 BACKGROUND AND RELATED WORK The problem of data discoverability is a well-recognized chal- lenge across scientific domains, leading to the development of the FAIR (Findability, Accessibility, Interoperability, and Reusabil- ity) Guiding Principles for scientific data management [9]. While general-purpose data catalogs such as data.gov and Zenodo exist, they often lack the domain-specific structure required to navigate the complexities of engineering data. The engineering community has produced numerous high-value datasets, but they are scat- tered across institutional repositories, competition websites, and individual publications. 2.1 Existing Dataset Ecosystems Benchmark datasets in prognostics and health management (PHM), such as the NASA C-MAPSS turbofan engine data [10] and the CWRU Bearing Data [11], have become foundational for validating new algorithms. Similarly, the computer vision community has demonstrated the power of shared data through benchmarks such as KITTI for autonomous driving [12]. In mate- rials science, the Materials Project has revolutionized discovery by providing a massive, open database of computed properties [13]. However, these successes are isolated. There is no unifying framework to connect a dataset of manufacturing line performance from Bosch with a repository of public requirements documents or a collection of CAD models. This lack of a central, struc- tured map makes cross-domain discovery difficult and hinders the development of holistic, lifecycle-aware engineering AI. 2.2 FAIR Principles and Implementation Barriers The FAIR principles provide a framework for managing heteroge- neous datasets across the product lifecycle. However, assessments in domains such as lifecycle assessment reveal that although awareness of FAIR data sharing is increasing, implementing spe- 2 cific FAIR guidelines is rarely observed in practice [14]. The challenge of independent and siloed datasets limits transparency and interoperability. Recent work on metadata repositories using Linked Data principles demonstrates approaches for real-time access to heterogeneous source systems [15], while Digital Twin frameworks have been proposed to promote FAIR principles in automotive use-cases [16]. Poor interoperability among computer-aided engineering (CAE) software tools costs the industry billions of dollars [17]. Fundamental interoperability barriers (e.g., in Systems of Sys- tems) include heterogeneous data and disparate APIs [18]. Indus- trial scenarios present largely unexplored applicability of FAIR principles compared to research data [19]. The aspirational nature of FAIR principles means they do not provide precise guidance for direct implementation into specific domains, complicating practical adoption [20]. 2.3 Ontologies and Taxonomies for Engineering Arti- facts Prior work has established the importance of formal ontologies for organizing engineering design knowledge. Ontologies of engineering artifacts contribute to design knowledge modeling by providing structured taxonomies with rich relationships [21]. Foundational ontologies such as BFO, GFO, and DOLCE of- fer frameworks for characterizing and classifying artifacts [22]. Knowledge-Based Engineering Data Management (KBEDM) ap- proaches organize knowledge-intensive activities in modern de- sign processes [23]. Domain-specific ontologies have been developed for CAD/CAE integration, including simulation intent ontologies that formalize analysis parameters and idealization decisions [24]. Layered ontology architectures—comprising general domain, domain-specific, and application-specific ontologies—represent engineering design knowledge at multiple abstraction levels [25, 26]. Such approaches enable semantic-level information exchange between heterogeneous applications across the product lifecy- cle [27]. Ontological engineering has also addressed integration of CAD and GIS for infrastructure management [28]. Prior work on representation frameworks for engineering design provides classification based on vocabulary, structure, expression, purpose, and abstraction [29]. Efforts to synthesize product knowledge across the lifecycle using upper-tiered ontologies [30] inform the taxonomy proposed in this work. 2.4 Knowledge Graphs for Data Integration Knowledge graphs offer advantages over relational databases for managing diverse, interconnected datasets. Knowledge graphs integrate heterogeneous data from various sources—unstructured, semi-structured, and structured—in a semantically rich way [31]. Graph databases are valuable for data integration because their unfixed structure allows flexibility, unlike relational databases that depend on rigid schemas [32]. Comparative studies show graph databases significantly outperform relational databases in search query response time [33], and are particularly effective for handling large-scale data requiring semantic association and visualization [34]. The LinkClimate platform demonstrates knowledge graphs for integrating multi-source heterogeneous data with improved interoperability through ontologies [35]. Data-centric system design using Knowledge Graphs and Semantic Web technolo- gies provides a framework for data interoperability [36]. The flexibility of graph-based data models means adding new data sources requires significantly less effort than altering relational schemas. Schema-free graph databases can be high-performance replacements for relational databases when handling highly in- terconnected data [37]. Graph-based approaches have enabled practical applications such as railway infrastructure systems that merged disconnected relational databases into unified knowledge graphs [38]. This makes knowledge graphs particularly suitable for representing the complex relationships between datasets, tools, methods, and publications in engineering design. 3 METHODOLOGY: A FRAMEWORK FOR THE EDSE DATASET MAP This section presents the proposed framework for a navigable “Map of Datasets in EDSE”. The design comprises two main components: a multi-dimensional taxonomy for structured clas- sification and a knowledge graph architecture for an interactive discovery tool. 3.1 The Multi-Dimensional Taxonomy A foundational element of the proposed map is a multi- dimensional taxonomy that enables faceted search and flexible organization, as illustrated in Fig. 1 (bottom). Unlike tradi- tional flat lists or single-category classification schemes, a multi- dimensional taxonomy allows datasets to be characterized along multiple independent dimensions simultaneously. This approach supports diverse user needs: a researcher seeking time-series data for prognostics can filter by data type, while another seeking aerospace applications can filter by domain, and both can com- bine criteria to narrow results. Four key dimensions are proposed, each structured hierarchically to support navigation from broad categories to specific sub-disciplines. 3.1.1 Dimension 1: Engineering Domain and Applica- tion Area. The first dimension categorizes datasets by their pri- mary field of engineering application. This reflects the reality that engineering research and practice are organized around domain- specific communities, conferences, and publication venues. Re- searchers typically begin their search within their home domain before considering cross-domain resources. The hierarchical structure of this dimension allows users to navigate from broad do- mains such as Aerospace Engineering, Automotive Engineering, 3 Biomedical Engineering, Civil and Infrastructure Engineering, Manufacturing Engineering, and Energy Systems, down to in- creasingly specific sub-disciplines. For example, a user might navigate from Aerospace to Propulsion Systems to Turbofan En- gines to find the NASA C-MAPSS dataset, or from Automotive to Autonomous Systems to Perception to discover the KITTI benchmark. This hierarchical organization also reveals structural similarities across domains—condition monitoring datasets in aerospace propulsion share methodological characteristics with those in wind turbine drivetrains, even though they originate from different engineering communities. 3.1.2 Dimension 2: System Lifecycle Stage.The second dimension classifies datasets according to the phase of the systems engineering lifecycle in which they are generated or primarily ap- plicable. This dimension is particularly important for engineering design research because data characteristics, availability, and qual- ity vary dramatically across lifecycle stages. The lifecycle stages follow the general structure of the systems engineering V-model, spanning from early conceptual phases through detailed design, manufacturing, operations, and eventual disposal. The stages include System Requirements Definition, which encompasses stakeholder needs, functional requirements, and specifications; Conceptual Design and Trade Studies, covering architecture explo- ration and concept evaluation; Preliminary and Detailed Design, addressing component specifications, CAD models, and simula- tion results; Manufacturing and Production, including process data, quality control, and assembly information; Integration, Veri- fication, and Validation, encompassing test data and certification evidence; Operations and Maintenance, covering field data, sen- sor streams, and maintenance records; and finally Disposal and End-of-Life, addressing decommissioning and material recovery data. This lifecycle perspective enables users to find data relevant to their specific engineering activity and reveals the digital thread connecting data across stages. 3.1.3 Dimension 3: Data Type and Modality.The third dimension describes the fundamental nature of the information contained within a dataset, which determines the analytical meth- ods and tools required for its use. Engineering datasets span a remarkable diversity of modalities, from natural language text to three-dimensional geometry to high-frequency sensor signals. Understanding this diversity is essential for researchers seeking data compatible with their methods. The taxonomy distinguishes five primary data types. Textual and Semantic data includes requirements documents, specifications, design rationale, tech- nical reports, and other natural language content that encodes engineering knowledge in human-readable form; these datasets are amenable to natural language processing and information ex- traction techniques. Geometric and Structural data encompasses CAD models, meshes, point clouds, and other representations of physical form; these require specialized geometric process- ing algorithms and domain-specific file formats. Behavioral and Simulation data include time-series outputs from physics-based simulations, system models, and digital twins that capture how systems evolve over time; these datasets enable the development and validation of surrogate models and reduced-order approxima- tions. Experimental and Test data comprises measurements from laboratory experiments, component tests, and controlled evalu- ations that provide ground truth for model validation. Finally, Operational and Field data include sensor streams, maintenance logs, and performance records from deployed systems operating in real-world conditions; these datasets are essential for prog- nostics, health management, and operational optimization but often come with challenges of noise, missing data, and propri- etary restrictions. Importantly, a single dataset may span multiple data types—a digital twin dataset might include both geometric models and behavioral simulations—and the taxonomy supports multi-labeling to capture this richness. 3.1.4 Dimension 4: Data Format and Structure. The fourth dimension classifies datasets based on their file format and internal organization, which dictates the technical require- ments for access and processing. This practical dimension is often overlooked in conceptual discussions but is critically important for researchers who must actually work with the data. The taxonomy distinguishes four categories of data structure. Structured data follows a predefined schema with explicit relationships, including tabular formats such as CSV and relational databases; these are readily ingested by standard data science tools and machine learn- ing pipelines. Semi-Structured data has organizational properties but without rigid schemas, including hierarchical formats such as JSON and XML; these require parsing but offer flexibility for complex nested relationships. Unstructured data lacks predefined organization, including PDF documents, images, and raw text; these require content extraction and interpretation before analysis. Finally, Domain-Specific Formats are specialized representations developed for particular engineering applications, such as STEP and IGES for CAD geometry, HDF5 for large scientific datasets, and SysML/XMI for system models; these require specialized software and domain expertise but preserve rich semantic infor- mation. By classifying format alongside content, the taxonomy helps users identify datasets they can realistically work with given their available tools and expertise. 3.2 Knowledge Graph Architecture for the Navigable Map The multi-dimensional taxonomy provides a classification scheme, but realizing a truly navigable map requires an underlying data model and user interface that support flexible exploration. This section describes the proposed architecture for an interactive dis- covery tool. 4 3.2.1 Data Model: Knowledge Graph.A knowledge graph is proposed as the underlying data model, chosen over traditional relational databases for several compelling reasons. First, the graph model offers superior flexibility by representing datasets, taxonomy terms, tools, methods, publications, and organizations as nodes connected by typed relationships (edges). This struc- ture natively supports the faceted classification scheme: a dataset node can connect to multiple taxonomy term nodes across all four dimensions simultaneously, without the awkward join ta- bles required in relational schemas. Second, knowledge graphs enable rich semantic relationships beyond simple classification. Relationships such as: ◦ dataset_A used_in publication_X ◦ tool_Y compatible_with format_Z ◦ dataset_A derived_from dataset_B ◦ method_M validated_on dataset_D can be explicitly represented and queried. These relationships capture the intellectual connections within the research commu- nity that are invaluable for discovery but impossible to represent in flat catalogs. Third, knowledge graphs support extensibility: new entity types, relationship types, and attributes can be added incrementally without schema migrations that would disrupt exist- ing queries. This is essential for a community resource that must evolve with the field. Fourth, adoption of standard vocabularies such as DCAT (Data Catalog Vocabulary) ensures interoperability with other data catalogs and the broader semantic web ecosystem. The knowledge graph schema includes five primary node types. Dataset nodes carry metadata including title, description, source URL, license, DOI, size, temporal coverage, and quality metrics. TaxonomyTerm nodes represent each term in the four taxonomic dimensions, connected by hierarchicalparent_of relationships that enable roll-up queries and faceted navigation. Publication nodes represent papers, reports, and other scholarly works that use or describe datasets. Tool nodes represent software packages and libraries compatible with specific data formats or analysis types. Organization nodes represent institutions that create, host, or maintain datasets. 3.2.2 User Interaction and Querying.The user interface must support both exploratory browsing—for users who do not yet know what they are looking for—and directed querying—for users with specific requirements. Faceted search allows users to progressively filter datasets by selecting terms from any combi- nation of the four taxonomic dimensions; each selection narrows the result set and updates the available filter options based on re- maining datasets. Relationship navigation allows users to traverse the graph by following edges: starting from a publication, users can find all datasets it cites; starting from a tool, users can find compatible datasets; starting from a dataset, users can find all publications that have used it. Natural language querying allows users to express information needs conversationally (e.g., “find me CAD datasets for aerospace components with open licenses”), with the system translating to structured graph queries. 3.2.3 Visualization Paradigms.Effective visualization helps users understand the dataset landscape at a glance. A sun- burst chart can visualize the hierarchical taxonomy structure, with ring segments representing taxonomy terms and segment size proportional to dataset counts, enabling users to immediately per- ceive where data is concentrated. A network graph can display relationships between datasets, publications, and tools as an inter- active node-link diagram, revealing clusters of related resources and influential datasets that connect multiple communities. A lifecycle heatmap can cross-tabulate engineering domains against lifecycle stages, with cell color intensity indicating data availabil- ity, immediately revealing where data deserts and oases exist. A domain matrix can similarly cross-tabulate domains against data types or formats to identify underrepresented combinations. 3.3 Dataset Cataloging Process To populate the map, a systematic cataloging process is employed. Identification involves discovering datasets through literature re- view, repository searches (e.g., Zenodo, Kaggle, NASA Open Data), competition archives (e.g., PHM Society Data Challenges), and community contributions. Profiling creates standardized metadata for each dataset, including source, description, access method, license, temporal and size characteristics, and known applications in the literature. Classification assigns datasets to terms along all four taxonomic dimensions, with multi-labeling where a dataset spans multiple categories. Relationship extrac- tion identifies and encodes connections to publications that use the dataset, tools that process it, and related datasets (e.g., ex- tended versions, derived subsets). Quality assessment records data quality indicators where available, including documentation completeness, FAIR compliance scores, and community feedback on usability. 4 RESULTS AND DISCUSSION This section presents the results of applying the proposed frame- work, including an exemplar catalog of public datasets, a gap analysis of the current data landscape, and a discussion of syn- thetic data as a mitigation strategy. 4.1 Exemplar Catalog of Public Datasets To illustrate the taxonomy’s application, a curated catalog of diverse, publicly accessible datasets was compiled. Each entry was profiled with its source, description, classification across the four dimensions, and known applications. Table 1 presents three representative examples spanning different domains, lifecycle stages, and data types. These three examples demonstrate the taxonomy’s ability to classify datasets across contrasting dimensions—from operational simulation data to geometric models to textual requirements— enabling cross-cutting discovery. 5 FIGURE 2: Screenshots of the Map of Datasets in EDSE prototype: about modal (top left), dashboard (top right), knowledge graph view (bottom left), table view with filters (bottom right), and dataset detail panel (side). TABLE 1: Representative Examples from the EDSE Dataset Catalog DatasetDomainLifecycle StageData TypeFormatPrimary Use NASA C-MAPSS [10]Aerospace / Propulsion Operations & Maintenance Behavioral / SimulationStructured (CSV)RUL Prediction ABC CAD Dataset [39]Cross-DomainDetailed DesignGeometric / StructuralDomain-Specific (STEP)Shape Analysis PURE Requirements [40]Cross-DomainRequirements Definition Textual / SemanticSemi-Structured (XML)NLP for RE 4.2 Interactive Discovery Tool Prototype To demonstrate the feasibility of the proposed framework, a web- based prototype was developed. The tool implements the multi- dimensional taxonomy as interactive faceted filters and provides multiple complementary views for dataset exploration. Figure 2 presents screenshots of the prototype [41]. A dashboard view (Fig. 2, top right) provides an at-a-glance summary of the catalog, showing total datasets, publications, tools, and year range alongside distribution charts for each taxonomic di- mension. The table view (Fig. 2, bottom right) displays datasets in a filterable, sortable table with taxonomy classifications rendered as color-coded badges. Selecting a dataset highlights the corre- sponding row and opens a detail panel (Fig. 2, side panel) with full metadata, including source, license, tools, key publications, and a Google Scholar search link. The graph view (Fig. 2, bottom left) renders the dataset catalog as an interactive knowledge graph using a force-directed layout. Dataset nodes (rectangles) connect to taxonomy term nodes (ellipses) colored by dimension—teal for domain, orange for lifecycle, blue for data type, and purple for format—with tool nodes (diamonds) shown in amber. Layer toggles allow users to show or hide specific dimensions to reduce visual complexity. An about modal (Fig. 2, top left) presents the project context and provides links for community contribution, in- cluding dataset submission and issue reporting through structured GitHub issue templates. 4.3 Gap Analysis: Data Deserts and Data Oases Analysis of the current dataset landscape using the proposed taxonomy reveals significant imbalances in data availability across the engineering design space. The literature documents clear contrasts between underrepresented and well-supported areas. 4.3.1 Data Deserts: Underrepresented Areas. A signif- icant scarcity of public datasets exists in specific domains and lifecycle stages. The literature explicitly documents these gaps: Early Lifecycle Stages: Data related to conceptual design, requirements engineering, and trade studies are exceptionally 6 rare. A prominent challenge facing machine learning adoption in engineering design research is the scarcity of publicly avail- able, high-quality datasets [1]. The lack of readily accessible, specialized engineering design data—in contrast to fields such as computer vision—hinders research efforts. System Architecture and MBSE: Publicly available sys- tem models (e.g., SysML repositories) are scarce. Progress in data-driven engineering design has been significantly slowed by the lack of standardized simulation environments and diverse datasets [2]. Late Lifecycle Stages: The catalog reveals a near-complete absence of public datasets covering system disposal, retirement, and material recovery, suggesting these stages remain a critical data desert. Specific Domains: Certain domains have limited open datasets due to proprietary, safety, or privacy concerns. A severe lack of datasets exists for mechanics, dynamics, and engineering design on platforms such as Hugging Face, contrasting sharply with natural language processing and computer vision [42]. The limited generalizability of current data-driven methods represents a major challenge, largely due to simplified models and constrained datasets [43]. Research in testing, validation, and verification of autonomous systems reveals fragmentation in eval- uation methodologies and a lack of consolidated benchmarks [44]. 4.3.2 Data Oases: Well-Represented Areas.Conversely, certain areas benefit from an abundance of high-quality public data, often propelled by community or institutional efforts, as illustrated in Fig. 3: Predictive Maintenance of Rotating Machinery: A wealth of datasets (e.g., CWRU [11], PRONOSTIA [45], NASA C- MAPSS [10]) has created a robust benchmarking environment. Figure 3 illustrates how C-MAPSS is represented in the proposed knowledge graph schema, with taxonomy classification across all four dimensions and semantic connections to publications, tools, and the maintaining organization. The current prototype imple- ments dataset, taxonomy term, and tool nodes; publication and organization nodes are planned for future versions. The lack of common datasets was an impediment to progress in prognostics, FIGURE 3: Example of the C-MAPSS dataset represented in the proposed knowledge graph schema. which motivated the generation of the C-MAPSS datasets [46, 47]. Numerous publications have utilized C-MAPSS for data-driven algorithms, demonstrating significant research impact. The PHM Society and NASA continue to have a major impact on dataset provision [48]. Analysis of benchmarking datasets has estab- lished minimum sample sizes for effective data-driven model- ing [49], while standardized evaluation metrics tailored for prog- nostics enable fair algorithm comparison [50]. Open-source tools such as PyPHM assist researchers in accessing and preprocessing common industrial datasets to facilitate reproducibility [51, 52]. C-MAPSS continues to be used for state-of-the-art aero-engine performance degradation models [53, 54]. Autonomous Driving and Vision: Large-scale, multi-sensor datasets (e.g., KITTI, Waymo Open, nuScenes) have fueled rapid advancements in perception and planning algorithms. Materials Science: The Materials Project [13] has made millions of computed material properties openly available, revolu- tionizing materials discovery workflows. Public Infrastructure: Government-mandated datasets, such as the National Bridge Inventory [55], have enabled large- scale statistical studies of infrastructure condition. Having standard benchmark datasets enables researchers to reproduce and validate results, which is critical for data-driven method development [56]. Benchmark datasets such as C-MAPSS and XJTU-SY are crucial for enabling comparability and repro- ducibility [57]. 4.4 The Role of Synthetic Data For areas identified as data deserts, synthetic data generation offers a viable mitigation strategy. High-fidelity, physics-based simulations and generative AI models can produce datasets for training and testing new methods when real data is unavailable or proprietary. However, challenges must be addressed: ◦Reality Gap: Models trained on synthetic data may fail to generalize to real-world conditions if the simulation does not capture all relevant physics and variability ◦Validation: The fidelity of synthetic data must be validated against real-world observations where possible ◦Documentation: Synthetic datasets must be clearly labeled and their generation process documented to enable proper interpretation The proposed knowledge graph architecture can accommo- date synthetic datasets by including metadata about generation methods, simulation tools used, and validation status, enabling users to make informed decisions about dataset suitability. 4.5 Challenges and Mitigation Strategies The development and maintenance of a sustainable dataset map face several challenges: ◦ Data Quality and Standardization: Datasets vary in qual- ity and format. Mitigation strategies include enforcing a 7 standardized metadata schema (e.g., DCAT), providing data quality metrics in dataset profiles, and encouraging commu- nity contributions for cleaning and annotation. ◦Metadata and Discoverability: Creating rich, consistent metadata is labor-intensive. This can be addressed through semi-automated curation, community-driven contributions with moderation workflows, and adoption of persistent iden- tifiers such as DOIs. ◦Intellectual Property and Accessibility: Licenses must be clearly tracked and displayed. The map should prioritize verifiably open datasets and provide clear guidance for those with restricted access. ◦Sustainability: To prevent stagnation, the map must be a living resource. This requires a governance model that en- courages community contributions, integration with publica- tion workflows, and institutional support from professional societies or university consortia. 5 CONCLUSION AND FUTURE WORK This paper has presented a concrete design for a “Map of Datasets in EDSE”, addressing the critical challenge of data fragmentation that impedes data-driven research and practice. 5.1 Summary of Contributions The contributions of this work include: 1.A multi-dimensional taxonomy with four dimensions (En- gineering Domain, Lifecycle Stage, Data Type, and Data Format) that enables faceted classification and discovery of engineering datasets 2. A knowledge graph architecture for an interactive discovery tool, specifying node types, relationship types, and interac- tion modes that support rich semantic relationships between datasets, publications, tools, and taxonomy terms 3.An exemplar catalog demonstrating the taxonomy’s applica- tion to diverse public datasets from prognostics, manufactur- ing, autonomous systems, and materials science 4.A gap analysis revealing significant “data deserts” in early- stage design and system architecture, contrasted with “data oases” in predictive maintenance and autonomous driving 5. Identification of challenges and mitigation strategies for sus- tainable curation, including the role of synthetic data for underrepresented areas The analysis confirms that data scarcity is a documented, widespread challenge limiting validation of data-driven methods in engineering design, while also demonstrating that coordinated community efforts (as in PHM) can create transformative bench- mark ecosystems. 5.2 Future Work: Toward an Intelligent, Agent-Driven Ecosystem Future work should focus on a complete implementation from the current prototype of this map and cultivating a community of contributors. The evolution of this platform is envisioned as an intelligent, agent-driven ecosystem. Recent breakthroughs in AI, including self-supervised learn- ing and geometric deep learning, are enabling new approaches to scientific discovery [58]. AI agents demonstrate capabilities in- cluding user-centered interaction, semantic knowledge extraction, intelligent reasoning, automation, and explainability [59]. AI is already used to automate archival workflows around capture and organization of collections [60]. In a future state, the dataset map could leverage agentic AI in several ways: ◦ Automated Discovery: AI agents could proactively crawl repositories, publications, and institutional websites to iden- tify new datasets, extracting metadata and proposing classifi- cations ◦Semantic Enrichment: Natural language processing could automatically extract relationships between datasets and pub- lications, populating the knowledge graph with minimal man- ual effort ◦ Quality Assessment: Agents could evaluate dataset docu- mentation completeness, FAIR compliance, and consistency, flagging issues for human review ◦Natural Language Querying: Users could interact with the map using natural language queries (e.g., “find me CAD datasets for aerospace components with permissive licenses”), with the agent translating to graph queries 5.3 Call to Action By pursuing this vision, the Map of Datasets in EDSE can become an indispensable, self-sustaining infrastructure that empowers the next generation of data-driven engineering innovation. Success requires: ◦Community Engagement: Researchers, practitioners, and institutions contributing dataset entries and maintaining qual- ity ◦ Institutional Support: Professional societies (ASME, IN- COSE) or university consortia providing governance and resources ◦ Integration with Workflows: Linking dataset deposition to publication processes to ensure new datasets are discoverable ◦Open Standards: Adopting DCAT, DOIs, and FAIR prin- ciples to ensure interoperability with broader data ecosystems The design presented here provides the foundation; realizing its potential requires coordinated action across the engineering design and systems engineering community. 8 REFERENCES [1] Ahmed, F., Picard, C., Chen, W., McComb, C., Wang, P., Lee, I., Stankovic, T., Allaire, D., and Menzel, S., 2025. “Special Issue: De- sign by Data: Cultivating Datasets for Engineering Design”. Journal of Mechanical Design, 147(4), p. 040301. [2]Felten, F., Apaza, G., Bräunlich, G., Diniz, C., Dong, X., Drake, A., Habibi, M., Hoffman, N. J., Keeler, M., Massoudi, S., et al., 2025. “Engibench: A Framework for Data-Driven Engineering Design Research”. arXiv preprint arXiv:2508.00831. [3]Fischer, L., Ehrlinger, L., Geist, V., Ramler, R., Sobiezky, F., Zellinger, W., Brunner, D., Kumar, M., and Moser, B., 2020. “AI Sys- tem Engineering—Key Challenges and Lessons Learned”. Machine Learning and Knowledge Extraction, 3(1), p. 56–83. [4]Briard, T., Jean, C., Aoussat, A., and Véron, P., 2023. “Challenges for Data-Driven Design in Early Physical Product Design: A Scientific and Industrial Perspective”. Computers in Industry, 145, p. 103814. [5]Rad, M. A., 2025. “Data Engineering for Data-Driven Design”. PhD thesis. [6]Chami, M., Abdoun, N., and Bruel, J.-M., 2022. “Artificial Intelli- gence Capabilities for Effective Model-Based Systems Engineering: A Vision Paper”. In INCOSE International Symposium, Vol. 32, Wiley Online Library, p. 1160–1174. [7] Rudin, C., 2019. “Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead”. Nature Machine Intelligence, 1(5), p. 206–215. [8]Schmidt, P., Biessmann, F., and Teubner, T., 2020. “Transparency and Trust in Artificial Intelligence Systems”. Journal of Decision Systems, 29(4), p. 260–278. [9]Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., da Silva Santos, L. B., Bourne, P. E., et al., 2016. “The FAIR Guiding Principles for Scientific Data Management and Stewardship”. Scientific data, 3(1), p. 1–9. [10]Saxena, A., and Goebel, K., 2008. “Turbofan Engine Degradation Simulation Data Set”. NASA ames prognostics data repository, 18, p. 878–887. [11] Case Western Reserve University, 2000. Bearing Data Center. [12] Geiger, A., Lenz, P., and Urtasun, R., 2012. “Are We Ready for Autonomous Driving? The KITTI Vision Benchmark Suite”. In 2012 IEEE conference on computer vision and pattern recognition, IEEE, p. 3354–3361. [13]Materials Project, 2011.Materials project database.https:// materialsproject.org. [14]Ghose, A., 2024. “Can LCA be FAIR? Assessing the Status Quo and Opportunities for FAIR Data Sharing”. The International Journal of Life Cycle Assessment, 29(4), p. 733–744. [15]Eickhoff, T., Eiden, A., Göbel, J. C., and Eigner, M., 2020. “A Metadata Repository for Semantic Product Lifecycle Management”. Procedia CIRP, 91, p. 249–254. [16]Hamlaoui, R., Orimi, A. G., Donia, R., Backe, C., Briken, V., and Lachmayer, R., 2025. “Digital Twin for Field Data Management: Design of a Platform to Promote FAIR Principles and Ensure Data Reusability”. Procedia CIRP, 136, p. 165–170. [17]Szykman, S., Fenves, S. J., Keirouz, W., and Shooter, S. B., 2000. “A Foundation for Interoperability in Next-Generation Product De- velopment Systems”. In International Design Engineering Technical Conferences and Computers and Information in Engineering Con- ference, Vol. 35111, American Society of Mechanical Engineers, p. 87–103. [18]Sadeghi, M., Carenini, A., Corcho, O., Rossi, M., Santoro, R., and Vogelsang, A., 2024. “Interoperability of Heterogeneous Systems of Systems: From Requirements to a Reference Architecture”. The Journal of Supercomputing, 80(7), p. 8954–8987. [19] Bodenbenner, M., Montavon, B., and Schmitt, R. H., 2021. “FAIR Sensor Services—Towards Sustainable Sensor Data Management”. Measurement: Sensors, 18, p. 100206. [20] Karakoltzidis, A., Battistelli, C. L., Bossa, C., Bouman, E. A., Aguirre, I. G., Iavicoli, I., Jeddi, M. Z., Karakitsios, S., Leso, V., Løfstedt, M., et al., 2024. “The FAIR Principles as a Key Enabler to Operationalize Safe and Sustainable by Design Approaches”. RSC Sustainability, 2(11), p. 3464–3477. [21] Kitamura, Y., et al., 2006. “Roles of Ontologies of Engineering Artifacts for Design Knowledge Modeling”. [22]Borgo, S., and Vieu, L., 2009. “Artefacts in Formal Ontology”. In Philosophy of technology and engineering sciences. Elsevier, p. 273– 307. [23] Hirz, M., Dietrich, W., Gfrerrer, A., and Lang, J., 2013. “Integrated Computer-Aided Design in Automotive Development”. Springer- Verl. Berl.-Heidelb. DOI, 10, p. 978–3. [24]Boussuge, F., Tierney, C. M., Vilmart, H., Robinson, T. T., Arm- strong, C. G., Nolan, D. C., Léon, J.-C., and Ulliana, F., 2019. “Capturing Simulation Intent in an Ontology: CAD and CAE In- tegration Application”. Journal of Engineering Design, 30(10-12), p. 688–725. [25] Zhu, L., Jayaram, U., Jayaram, S., and Kim, O., 2009. “Ontology- Driven Integration of CAD/CAE Applications: Strategies and Com- parisons”. In International Design Engineering Technical Confer- ences and Computers and Information in Engineering Conference, Vol. 48999, p. 1461–1472. [26]Zhan, P., Jayaram, U., Kim, O., and Zhu, L., 2010. “Knowledge Representation and Ontology Mapping Methods for Product Data in Engineering Applications”. [27] Zhan, P., 2007. “An Ontology-Based Approach for Semantic Level Information Exchange and Integration in Applications for Product Lifecycle Management”. PhD thesis, Washington State University. [28]Peachavanish, R., Karimi, H. A., Akinci, B., and Boukamp, F., 2006. “An Ontological Engineering Approach for Integrating CAD and GIS in Support of Infrastructure Management”. Advanced Engineering Informatics, 20(1), p. 71–88. [29]Summers, J. D., and Shah, J. J., 2004. “Representation in Engineering Design: A Framework for Classification”. In International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, Vol. 46962, p. 439–448. [30]Witherell, P., Kulvatunyou, B., and Rachuri, S., 2013. “Towards the Synthesis of Product Knowledge Across the Lifecycle”. In ASME International Mechanical Engineering Congress and Ex- position, Vol. 56413, American Society of Mechanical Engineers, p. V012T13A071. [31]Hofer, M., Obraczka, D., Saeedi, A., Köpcke, H., and Rahm, E., 2024. “Construction of Knowledge Graphs: Current State and Challenges”. Information, 15(8), p. 509. [32]Di Pierro, D., et al., 2025. “Ontology-Enriched Graph Databases: An Interoperable Framework for Knowledge Integration and Man- agement”. PhD thesis. [33] Lorincz, J., Huljic, V., and Begusic, D., 2020. “Transforming Product Catalogue Relational into Graph Database: A Performance Compari- son”. In MIPRO, p. 523–528. [34]Zhang, K., Sun, S., Zou, D., Weng, S., and Ma, X., 2024. “Research on the Construction Method of Railway Data Resource Catalog Based on Knowledge Graphs”. In International Conference on Artificial Intelligence and Autonomous Transportation, Springer, p. 177–184. [35]Wu, J., Orlandi, F., O’Sullivan, D., and Dev, S., 2022. “LinkClimate: An Interoperable Knowledge Graph Platform for Climate Data”. Computers& Geosciences, 169, p. 105215. [36]Rojas, J. A., Aguado, M., Vasilopoulou, P., Velitchkov, I., Van Ass- che, D., Colpaert, P., and Verborgh, R., 2021. “Leveraging Semantic Technologies for Digital Interoperability in the European Railway Domain”. In International Semantic Web Conference, Springer, p. 648–664. [37] Kalaycı, T. E., Bricelj, B., Lah, M., Pichler, F., Scharrer, M. K., and 9 Rubeša-Zrim, J., 2021. “A Knowledge Graph-Based Data Integration Framework Applied to Battery Data Management”. Sustainability, 13(3), p. 1583. [38] Toledo, J., Doña, D., Ruckhaus, E., Corcho, O., Aguado, M., Patru, D., Atemezing, G., and Vasilopoulou, P., 2025. “Using Semantic Technologies in the Railway Domain: The Register of Infrastruc- ture (RINF) System”. In International Semantic Web Conference, Springer, p. 398–414. [39]Koch, S., Matveev, A., Jiang, Z., Williams, F., Artemov, A., Burnaev, E., Alexa, M., Zorin, D., and Panozzo, D., 2019. “ABC: A Big CAD Model Dataset For Geometric Deep Learning”. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, p. 9601–9611. [40] Ferrari, A., Spagnolo, G. O., and Gnesi, S., 2017. “PURE: A Dataset of Public Requirements Documents”. In 2017 IEEE 25th interna- tional requirements engineering conference (RE), IEEE, p. 502–505. [41]Bank, H. S., and Herber, D. R., 2026.A navigable map of datasets in engineering design and systems engineering: Inter- active web application. https://map-of-EDSE-datasets.github.io/ map-of-EDSE-datasets. Accessed: 2026. [42]Ebel, H., van Delden, J., Lüddecke, T., Borse, A., Gulakala, R., Stoffel, M., Yadav, M., Stender, M., Schindler, L., de Payrebrune, K. M., et al., 2025. “Data Publishing in Mechanics and Dynamics: Challenges, Guidelines, and Examples from Engineering Design”. Data-Centric Engineering, 6, p. e23. [43]Afifi, N., Wittig, C., Paehler, L., Lindenmann, A., Wolter, K., Leit- enberger, F., Dogru, M., Grauberger, P., Düser, T., Albers, A., et al., 2025. “Data-Driven Methods and AI in Engineering Design: A Systematic Literature Review Focusing on Challenges and Opportu- nities”. arXiv preprint arXiv:2511.20730. [44]Araujo, H., Mousavi, M. R., and Varshosaz, M., 2023. “Testing, Validation, and Verification of Robotic and Autonomous Systems: A Systematic Review”. ACM Transactions on Software Engineering and Methodology, 32(2), p. 1–61. [45]Nectoux, P., Gouriveau, R., Medjaher, K., Ramasso, E., Chebel- Morello, B., Zerhouni, N., and Varnier, C., 2012. “PRONOSTIA: An Experimental Platform for Bearings Accelerated Degradation Tests”. In IEEE International Conference on Prognostics and Health Management, PHM’12., IEEE Catalog Number: CPF12PHM-CDR, p. 1–8. [46]Ramasso, E., and Saxena, A., 2014. “Performance Benchmarking and Analysis of Prognostic Methods for CMAPSS Datasets”. In- ternational Journal of Prognostics and Health Management, 5(2), p. 1–15. [47]Ramasso, E., and Saxena, A., 2014. “Review and Analysis of Algorithmic Approaches Developed for Prognostics on CMAPSS Dataset”. In Annual Conference of the Prognostics and Health Man- agement Society 2014. [48]Hagmeyer, S., Mauthe, F., and Zeiler, P., 2021. “Creation of Publicly Available Data Sets for Prognostics and Diagnostics Addressing Data Scenarios Relevant to Industrial Applications”. International Journal of Prognostics and Health Management, 12(2). [49]Eker, O. F., Camci, F., and Jennions, I. K., 2012. “Major Challenges in Prognostics: Study on Benchmarking Prognostics Datasets”. In Phm society european conference, Vol. 1. [50]Saxena, A., Celaya, J., Saha, B., Saha, S., and Goebel, K., 2009. “Evaluating Algorithm Performance Metrics Tailored for Prognos- tics”. In 2009 IEEE Aerospace conference, IEEE, p. 1–13. [51]von Hahn, T., and Mechefske, C. K., 2022. “Computational Re- producibility Within Prognostics and Health Management”. arXiv preprint arXiv:2205.15489. [52]von Hahn, T., 2022. PyPHM: A Python Package for Prognostics and Health Management Datasets. Python package. [53]Zhou, M., Miao, K., Sun, J., Shen, Y., and Han, B., 2024. “Data- Driven Modeling of Aero-Engine Performance Degradation Models”. IEEE Access, 12, p. 150020–150031. [54] Fu, S., and Avdelidis, N. P., 2023. “Prognostic and Health Manage- ment of Critical Aircraft Systems and Components: An Overview”. Sensors, 23(19), p. 8124. [55]Federal Highway Administration, 2023. National bridge inventory. https://w.fhwa.dot.gov/bridge/nbi.cfm. [56] Soualhi, M., Soualhi, A., Nguyen, K. T., Medjaher, K., Clerc, G., and Razik, H., 2023. “Open Heterogeneous Data for Condition Monitoring of Multi Faults in Rotating Machines Used in Different Operating Conditions”. International Journal of Prognostics and Health Management, 14(2). [57] Sufi, F., 2025. “Beyond the Sensor: A Systematic Review of AI’s Role in Next-Generation Machine Health Monitoring”. Applied Sciences, 15(19), p. 10494. [58]Wang, H., Fu, T., Du, Y., Gao, W., Huang, K., Liu, Z., Chandak, P., Liu, S., Van Katwyk, P., Deac, A., et al., 2023. “Scientific Discovery in the Age of Artificial Intelligence”. Nature, 620(7972), p. 47–60. [59]Anjelia, S. R., Sensuse, D. I., and Lusa, S., 2025. “AI Agents for Organizational Knowledge Retrieval and Sharing: A Systematic Literature Review”. International Journal of Advances in Data and Information Systems, 6(3), p. 824–839. [60] Colavizza, G., Blanke, T., Jeurgens, C., and Noordegraaf, J., 2021. “Archives and AI: An Overview of Current Debates and Future Per- spectives”. ACM Journal on Computing and Cultural Heritage (JOCCH), 15(1), p. 1–15. 10