Abstract
Machine learning technology is developing rapidly and has been continuously changing our daily life. However, a major limiting factor that hinders many machine learning tasks is the need of huge and diverse training data. Crowdsourcing has been shown effective to collect data labels with a centralized server. The emergence of blockchain technology makes a decentralized platform possible, which provides better reliability and discoverability. While blockchain provides an ideal platform for crowdsourcing, all data become publicly available once being put onto todays blockchain platform such as Ethereum. This could discourage users from contributing their data, which may contain highly sensitive information, e.g., medical records. In this proposal, we aim to design a blockchain-based data sharing and training platform, that allows participants to contribute data and train models in a fully decentralized and privacy-preserving way. Compared with solutions that naively run training algorithms on blockchain, our proposal has the following advantages. (1) Efficiency: we borrow ideas from federated machine learning. Instead of contributing raw data, each participant locally trains a model and only contributes model parameters to the blockchain; blockchain nodes simply aggregate the contributed models to build a global model. (2) Privacy: we adapt a secure aggregation protocol to hide each contributed model from other participants. Moreover, differential privacy is applied to protect the global model from revealing a particular client’s information. Finally, we use system log anomaly detection as a case study to demonstrate the wide applicability of the proposed platform.