Research

Research Contents

Understanding the dynamics of data ecosystem and developing institutional design
Digitization of unobserved events with tailor-made data design support
Creating heterogeneous data linkage schemes and analytical models
Sensing of information digestion and value deepening in crossover sphere

1. Understanding the dynamics of data ecosystem and developing institutional design

A data ecosystem is a complex ecosystem created by the accumulation of autonomous elements such as people and businesses that are emerged by data. We generate a variety of information from time to time, and records of these actions accumulate as data. The service providers extract what they need from these data and turn it into services for third parties. The findings, learning and analysis results obtained from data will not only be fed back to ourselves, but will also influence the preferences and behaviors of others.

The dynamic interaction of people, objects, technologies, and services connected by this cycle of data generates new functions and properties, creating a complex but orderly system. This is the data ecosystem.

However, the characteristics of the data handled in a data ecosystem and the mechanisms of relationships among different data and industries are not clear. Although interest and expectations for value creation through cross-disciplinary data collaboration in the real world are growing, the complexity of people, data, environment, and legal systems requires a comprehensive discussion and understanding of the entire ecosystem.

1-1. Datascape: Overarching understanding through heterogeneous metadata network analysis

While no appropriate unit of analysis has been established to understand the data ecosystem, a panoramic view of the relationships among data and market players is one useful way to promote a common understanding. In this study, we focused on the individual datasets handled in the data ecosystem and attempted to elucidate the relationships among heterogeneous data using metadata.

As a result, we found that data networks have a "locally dense, globally sparse" structure similar to human relationships, and that although data is created in a completely disordered manner in individual organizations, ordered characteristics emerge when viewed from a panoramic view. By dynamically capturing the process of the growth of heterogeneous data networks, we are also examining ways to support the strategic planning of data platform providers who wish to promote data distribution.

1-2. Data Value chain analysis focusing on stakeholders

This study focused on people (stakeholders) in the data ecosystem, considering their roles based on cross-sectional stakeholder relationships across businesses, and attempted to assess bottlenecks and business model robustness in the event of data leakage.

We developed a data value chain notation (SVC: Stakeholder-sentric Value Chain) that incorporates human factors and analyzed the structure of business relationships. The analysis revealed the emergence of a hierarchical structure in stakeholder relationships across business models, suggesting the existence of segregation by stakeholder groups with different functions within the ecosystem. Furthermore, research is underway to evaluate the soundness of the data business and robustness of the ecosystem, including the calculation of the influence of stakeholders that serve as hubs and the risk of data leakage based on network analysis. Furthermore, SVC notation has been methodized and introduced as a workshop technique to the field of data collaboration in different industries in companies.

1-3. Data market simulator with multi-agent simulation

Compared to many goods and resources available in the marketplace, data has unique characteristics. For example, data requires no physical storage space and thus has extremely low storage costs, is easily duplicated, and has almost zero distribution costs. Therefore, data distribution and transactions across businesses require a different model than known commodities.

In this study, we conducted simulations in a data distribution market where replication and resale are possible. Experimental results show that data prices are more stable when trading by rational agents, and that gross profit settings have a significant impact on data distribution.

We also modeled the interaction between exchange datasets and buyers in data markets and developed a multi-agent simulator for institutional design. We compared data purchases at different market sizes and their effects on the appearance of popular datasets through experiments, and found that the frequency distribution of purchases of popular datasets differed depending on the market size, suggesting the need for different rules and institutions depending on the market size.

2. Digitization of unobserved events with tailor-made data design support

Because data potentially reflect the design intent of the data obtainers, there is an inherent risk of false hypothesis testing in the analysis of third-party data that does not account for design intent or bias.

There is, therefore, a need for a mechanism to support data design that does not rely solely on experience and intuition, but understands the differences and diversity brought about by the human cognitive model of observing and dataing a dynamic world.

2-1. Unexplored data design and data origination

Unexplored data refers to potential data in areas not directly connected to our dynamically changing perceptions and knowledge. For example, the classification is shown in the figure on the left. Unexplored data refers specifically to potential data in areas where we have data but do not know what events can be observed, and where there is no data and the events we wish to observe are unknown or known.

Data origination is an approach to data design, acquisition, and utilization that goes back to the human act of observation, the origin of data, for such unexplored data. Variable Quest and TEEDA are tools that support the design and origination for such unexplored data.

2-2. VARIABLE QUEST (VQ)

The methodology of how to observe events that have not yet been converted into data is highly individualistic, and no systematic knowledge exists. Therefore, it is important to have a precise data design and its design support methodologies to determine the purpose of data collection and how to utilize the data.

In this study, we focused on the variables (the data attributes that make up the data), which are the main components of the data. We implemented Variable Quest (VQ), a system to estimate a set of variables from an summary of data in which the variables are unknown. By considering the two models of "data similarity" and "variable co-occurrence," VQ makes it possible to supplement variables that are missing in one set of data with variables that appear in other sets of data.

2-3. Data matching system: TEEDA

Data catalogs and portal sites are increasing the opportunities for users to learn about the third-party data. However, information on what kind of data users want and for what purpose (Call for Data) is not sufficiently shared.

TEEDA is a web application that matches data providers and users based on the data information they want (data requests) and the data they can provide (providable data). TEEDA supports encounters between data providers and users by exchanging information on what kind of data users want, in addition to one-way information provision by data holders.

TEEDA is an acronym for "Treasuring Every Encounter of Data Affairs.

2-4. Human and object contact data: The use case of unexplored data design and origination

This is one of the cases where data design and origination were practiced to address the problem of how unknown events should be observed and converted into data, which became apparent in the COVID-19 disaster.

Infection countermeasures due to human-object contact (people touching something) have been limited to measures such as encouraging hand washing, sterilization of hands with chemicals, and disinfection of objects, with little data leading to effective infection countermeasures. The workshop using TEEDA revealed that there is a high need to "examine the locations and objects that are frequently contacted and come up with guidelines for disinfection locations, priority disinfection items, and contact behavior in locations and vehicles. Furthermore, using Variable Quest, we designed "contact behavior data for people and objects" in the COVID-19 Disaster, and cooperated with the Cabinet Secretariat's "COVID-19 AI and Simulation Project" to actually acquire and analyze the data. As a result, we obtained guidelines for disinfectant locations, priority disinfected objects, and contact behavior at locations and in vehicles.

3. Creating heterogeneous data linkage schemes and analytical models

With the rapid advancement of data distribution and the growing demand for buying and selling data in the marketplace, there is great anticipation for new knowledge discovery and value creation by combining heterogeneous datasets. However, it is important to note that existing data was rarely designed or collected with interoperability. While linking disparate data can enhance our understanding of target phenomena and reveal novel insights, it also increases complexity and the risk of diminishing the explainability and persuasive power of analytical results.

Therefore, a pressing challenge is to simultaneously search for and recommend data that are well suited for integration with one’s own assets, while also establishing data integration and analysis methods that account for the characteristics of each combined dataset. Moreover, there is a growing need for techniques that assess the reliability of analyses based on heterogeneous data linkages and make their results explainable to stakeholders.

3-1. Data similarity evaluation and exploration system using embedded representations of datasets

Data platforms for exchanging heterogeneous datasets have emerged, but most services still rely on general information retrieval methods applied to metadata written in natural language, which leads to challenges such as incomplete, inconsistent or unreliable metadata and insufficient information volume. To address these issues, our study proposes a metadata-less dataset embedding model based on tabular language models (TaLMs) that enhances search and recommendation functions by using the actual data. Specifically, we apply metric learning to HyTrel, one of the TaLMs, to train a dataset embedding model, and we then perform domain adaptation through additional training on GitTables and Kaggle datasets together with their associated metadata. Experimental results demonstrate that our metadata-less approach achieves search and recommendation performance in topic estimation and human evaluation tasks that is equivalent to or better than methods using metadata.

Previous research has depended heavily on metadata, which is prone to variable‐naming inconsistencies and misalignment with the actual data, and the cost of creating metadata for data publication or sale has been high. Our method is expected to support the development of systems capable of accurately exploring both internal and external datasets without relying on metadata.

3-2. Context-Aware Graph-Based Dataset Embedding Representations

Not only are organizations increasingly designing, generating, and using data internally, but they are also more often leveraging datasets created by other companies and combining them with their own. At the same time, challenges have emerged around integrating and jointly using multiple, heterogeneous datasets.

In this study, we propose a novel embedding method that harnesses graph structures encoding each dataset’s contextual information, and we apply it to the tasks of schema matching and entity resolution across datasets with diverse characteristics. Building on graph-based dataset embedding techniques, we construct a four-part graph for each table—consisting of rows, cells, variables, and variable semantics—and generate sequences from this graph to learn embedding representations. In experiments using both benchmark datasets and real operational datasets, our approach outperformed existing methods on both schema matching and entity resolution tasks.

4. Sensing of information digestion and value deepening in crossover sphere

The emergence of SNS, C2C online shopping platforms, the Metaverse, and Web3 technologies has significantly increased opportunities for different people, objects, and contexts—previously confined within specific domains—to interact across their boundaries. While such interactions between heterogeneous elements are often discussed positively in terms of innovation and cross-disciplinary collaboration, consumer-related issues and conflicts remain persistent challenges. Designing these crossover spaces and creating interfaces that resolve conflicts while fostering value creation are pressing issues. Additionally, although online (cyber) spaces are contextually continuous with the physical world, their interfaces are often limited to text and images, making communication between participants less natural. This leads to miscommunication and consumer disputes.

This is where the concept of "information digestion" becomes crucial. Information digestion refers to the process of observing whether the information provided is accurately understood by the receiver. If the provided information can be measured to ensure that it reaches the receiver effectively, it is possible to develop mechanisms for delivering valuable products to the right people while avoiding unnecessary conflicts and disputes between providers and buyers.

4-1. Information Digestion Efficiency Across Text and Image-Based Channels

This study introduces the concept of "information digestion" during the reception of transmitted information. By doing so, it proposes an evaluation model for information digestibility using hierarchical factor analysis and identifies the factors that constitute digestibility across four types of communication channels, such as online news articles and e-commerce pages, through experiments with participants. The results revealed differences in information digestion efficiency depending on the communication channel, amount of information, and type of information. Moreover, information processed based on the medium-specific quantitative evaluation model of information digestion efficiency was found to be more easily understood.

Conventional evaluations of information transmission only targeted the delivery of information to the receiver. However, this study extends its scope to include the perspectives through which the receiver interprets the content and purpose of the information obtained, enabling the development of methods to support receivers in obtaining information valuable to them.

4-2. Analysis of Differences in Information Digestion in C2C Markets

Online C2C (Consumer-to-Consumer) markets, where individuals exchange goods through web services, are rapidly expanding. However, excessive or inappropriate information makes it difficult for buyers to understand products, undermining trust in sellers and hindering the distribution of valuable goods. This study focuses on the act of "digesting" information in C2C online marketplace services and analyzes the differences in how sellers and buyers perceive and understand information.

Focusing on factors such as the roles of sellers and buyers, the components of product pages, and product categories, the study conducted a participant survey based on these roles. Using factor analysis and statistical tests, the study clarified the existence of differences in information digestion between sellers and buyers. Furthermore, using conjoint analysis, eye tracking, and text mining, the study examined the information that buyers wish to receive and the information sellers intend to convey. Insights were obtained on strategies to mitigate these differences between market participants. The outcomes of this research contribute to the development of personalized information delivery systems based on individual users' information digestion characteristics and can be applied to create effective communication methods that accurately convey the latent value of products.