DATA PRICING MECHANISM BASED ON PROPERTY RIGHTS COMPENSATION DISTRIBUTION CONFERENCE SUBMISSIONS

Abstract

While machine learning (ML) benefits from data, it also faces the challenges of ambiguous data ownership, including privacy violations and increased costs of using data. This suggests that the value created by data is determined not only by its utility but also by the cost of using the data (negative externalities). The existing pricing methods mainly value data based on its utility but ignore the negative externalities caused by fuzzy ownership, therefore can not design an efficient pricing mechanism. Throughout the data life cycle (creation, pre-processing, training, etc.), the usufruct and ownership of the data are transferred at the same time, so the benefits and costs are generated simultaneously. Considering that data rights confirmation and data pricing cannot be separated independently in the process of data transaction, we propose the first data valuation mechanism based on modern property rights theory in this paper. Specifically, we propose to clarify the ownership of property rights through the integration of property rights and improve the final revenue of the whole workflow "data link" through the form of the whole collective, while compensating process performers who lose ownership after the integration. Then, we consider the expectations of both the integrator and the integrated party during the compensation allocation. For the former, we apply compound interest to assess a total compensation equivalent to the time value for the Data chain. For the latter, we respect and meet their expectations as much as possible. To achieve this, we provide the framework based on Least-core to assign the compensation and prove that our framework can also work compared to existing algorithms. Finally, to cope with more complex situations, we adjust the traditional Least-core and demonstrate theoretically and experimentally that the compensation mechanism is feasible and effective in solving the data pricing problem.

1. INTRODUCTIOIN

While ML benefits from data, it also faces challenges brought by the ambiguity of data ownership (Maini et al., 2021) . According to the life cycle, data is generated by the producer and then passes through agents such as data preprocessors and model pre-trainers, and finally generates value through model training. In this process, data ownership and access are transferred simultaneously, which makes the utilization of data have a cost: On the one hand, this makes the subject of privacy protection unclear, posing the potential for privacy violations. The Cambridge Analytica scandal, for example, where Facebook and Cambridge Analytica collected the personal data of up to 87 million Facebook users without their consent (Kelly), gave rise to an unprecedented discussion on data privacy. On the other hand, even though a large number of data pricing models are already available in the data marketplace to value data (Chen et al., 2019; Liu, 2020; Li et al., 2013; Koutris et al., 2013; 2015) , each use of a dataset requires a valuation, since datasets behave differently in different models. This increases the negative externalities of ML (e.g., huge computational volumes). The imbalance between the benefits of data and the costs of using it makes clarifying property rights an important issue. Demsetz (1967) points out that the generation of property rights is essentially a process of cost-benefit tradeoff. Property rights arise when the benefits of internalizing externalities by defining property rights are greater than the costs of engaging in the act. Furthermore, the purpose of clarifying property rights is to maximize returns by internalizing externalities. Taking the data life cycle as an example, the concrete task of confirming the ownership is to delineate the controllable rights and clarify the owner of the rights (Asswad & Marx Gómez, 2021), whose purpose is to internalize externalities to the maximum extent. In other problem setting, such as feature selection or data source valuation, giving the most valuable feature or data source ownership can internalize externalities ( e.g., reduce computations after setting the appropriate utility function ) and improve the performance. It should be noted that the revenue of data transactions under the premise of data ownership includes both traditional revenue and the cost caused by negative externalities that need to be subtracted. And the purpose of rights confirmation is to maximize the revenue of data at this time. The existing data marketplace always assumes that the ownership of data belongs to its producer, and attributes the costs of data transactions to privacy protection, making data privacy protection becomes an issue juxtaposed with data pricing rather than unified. Specifically, traditional data pricing schemes including data-based pricing, model-based pricing (Chen et al., 2019; Liu, 2020) , and query-based pricing (Li et al., 2013; Koutris et al., 2013; 2015) , have paid varied degrees of attention to avoiding privacy leaks. The marketplace with data-based pricing allows the customers to access datasets entries directly, which makes it challenging for ownership protection. To address this challenge, the model-based pricing framework propose creating different query versions by carefully adding different noise (Chen et al., 2019) or sells a series of differentially private models to respect data owners' privacy restrictions (Liu, 2020) . In terms of the query-based pricing framework, it makes decisions about the restrictions on data usage (Li et al., 2013) , which partially alleviate the shortcomings of privacy protection. However, their pricing mechanism still reflects the value of the data by quantifying the utility of the data or quantifying the model training examples, which, incidentally, avoids the data owner's privacy from being breached. Such passive protection can only strive to prevent privacy leaks but fail to integrate more negative externalities. In this case, we propose a data pricing method based on property rights compensation, which is different from the previous pricing mechanism. Our contributions are mainly reflected in: • Clarifying the data ownership. Through the introduction of modern property rights theory, we answer the question of maximizing the internalization of externality: integrate the cooperative parties into a whole, and the one who makes the most marginal contribution holds the overall data ownership. In fact, according to the logic of modern property rights theory, ownership can be determined for all agents with cooperative relations through integration. The effect of it will be improved with the increase of marginal contribution difference of agents, which can extend our framework to the whole life cycle of data. In this paper, we take the background of feature selection and data source pricing as an example to prove the feasibility of the method. • Discussing the Data pricing from the perspective of compensation that the transfer of data property rights necessitates the payment of compensation. Since the use of data brings both benefits and costs, we propose that the valuation of data should not only be based on benefits but also consider its costs. Although the cost cannot be directly quantified, we internalized externalities furthest through property rights integration to maximize the value of the data. On this basis, the compensation for distorted ownership is estimated by the time value of the data. Experimental results show that this method can still complete the task of data pricing. • We propose using the least-core rather than other concepts in cooperative games to solve the allocation scheme. Since the process of ownership integration requires the integrated party to transfer ownership to form a grand coalition, the withdrawal of either agent will increase the negative externalities of the coalition. We are more inclined to realize the stability of the distribution scheme through the core. In this process, we balance the expected compensation of both the integrator and the integrated party. Finally, we discuss the feasibility of this framework under different conditions by adjusting the coefficients of the deficit parameter.

