Co-researcher Dr. Fujita, M1 student Manabe, and Hayashi had presentations at IEEE BigData 2023.

2023-12-21

news
research

The co-researcher Dr. Fujita (Toyota Motor Corporation), Manabe (M1 student in Hayashi Laboratory), and Lecturer Hayashi gave an oral presentation at IEEE International Conference on Big Data 2023 (IEEE BigData2023) held at Sorrento, Italy, from 15th to 18th December 2023.

Title: Topic-Based Search: Dataset Search without Metadata and Users' Knowledge about Data

Authors: Yukihisa Fujita, Teruaki Hayashi, and Masahiro Kuwahara

Abstract: With the advancement of information technologies, we can obtain various kinds of data, which can be leveraged for various purposes. The availability of a large amount of data is a desirable situation. However, it makes dataset retrieval a time-consuming and complex task. Conventional dataset search methods require unified metadata and knowledge about keywords representing the datasets. In other words, they require user knowledge regarding the datasets, such as the terms used in the dataset and fields in the metadata. To address this issue, we propose a topic-based search method without metadata, especially for users lacking knowledge about the datasets. The topic-based search can find datasets by using not the exact keywords but abstract keywords described as topics. In this paper, we focus on table data, which contain column names and data values and are widely used for storing data. As preliminary analysis, we collected and analyzed public datasets available in Japanese data portals to clarify the features of datasets that should be searched through dataset search. The analysis results revealed the use of many general and common keywords as column names, but it is difficult to implement a dataset search using only column names. Therefore, based on the analysis results, we decided to use embeddings converted from the datasets to utilize both column names and data values to extract topics from datasets. The experimental results showed that we can extract topics from datasets by using the topic modeling method and obtain better search results when compared with the search method using exact keywords.

This study was supported by the joint research project with Toyota Motor Corporation.

Title: Variable-based Learning Considering Topic Specificity in Heterogeneous Data Clustering Tasks

Authors: Kosuke Manabe, Yukihisa Fujita, Masahiro Kuwahara, and Teruaki Hayashi

Abstract: In recent years, data mining by interdisciplinary co-creation has been attracting social attention, and various types of data have been published for free or for a fee. Data publication encourages the exchange and combination of data between different institutions, which is helpful for interdisciplinary data collaboration. However, issues remain in data search to design high-quality data for interdisciplinary data discovery. Variables are frameworks for data and are thought to have reflected the topics of the data and the intent of the data design. This study aims to quantitatively investigate the relationships between data topics and variables for a large dataset and provide suggestions that contribute to data design and exploration. In this study, we quantitatively evaluated the probability of occurrence of variables and their pairs for each topic to determine a specific relationship between the topics and variables and applied the clustering considering them.

This study was supported by the joint research project with Toyota Motor Corporation, and the JSPS Kakenhi (JP20H02384) "Elucidation of the Dynamics of Data Markets and Institutional Design."

Title: Exploring the Fundamental Units of Semantic Representation of Data Using Heterogeneous Variable Network in Data

Authors: Teruaki Hayashi, Yukihisa Fujita, and Masahiro Kuwahara

Abstract: The value creation achieved through the exchange, distribution, and collaboration of data among different organizations has garnered significant attention as a new source of innovation. The mathematical treatment of the meaning of data helps measure its "quality" to formulate evaluation criteria for data exchange between stakeholders with distinct background knowledge in data ecosystems. This study examines the structure of data morphemes, the fundamental units of semantic representation of data, by conducting network and association analyses of variables present in metadata from diverse fields. Network analysis identifies the globally sparse and locally dense characteristics of variable co-occurrence networks and highlights essential relationships and core variables. Key findings include the discovery of "depth," "sediment/rock," and "sample code/label" as both universal variables and crucial nodes between datasets used in the experiment. Association analysis reveals vital variable pairs, such as "age" and "ring width" or "latitude" and "longitude." This research may provide a understanding of the structure and meaningful representation of data, facilitating smooth data exchange and utilization practices among stakeholders with different domains, purposes of data use, and background knowledge in data ecosystems.

This study was supported by the joint research project with Toyota Motor Corporation, and the JSPS Kakenhi (JP20H02384) "Elucidation of the Dynamics of Data Markets and Institutional Design."