Federated Learning. Yang Liu

Preface

This book is about how to build and use machine learning (ML) models in artificial intelligence (AI) applications when the data are scattered across different sites, owned by different individuals or organizations, and there is no easy solution to bring the data together. Nowadays, we often hear that we are in the era of big data, and big data is an important ingredient that fuels AI advances in today’s society. However, the truth is that we are in an era of small, isolated, and fragmented data silos. Data are collected and located at edge devices such as mobile phones. Organizations such as hospitals often have limited views on users’ data due to their specialties. However, privacy and security requirements make it increasingly infeasible to merge the data at different organizations in a simple way. In such a context, federated machine learning (or federated learning, in short) emerges as a functional solution that can help build high-performance models shared among multiple parties while still complying with requirements for user privacy and data confidentiality.

Besides privacy and security concerns, another strong motivation for federated learning is to maximally use the computing power at the edge devices of a cloud system, where the communication is most efficient when only the computed results, rather than raw data, are transmitted between devices and servers. For example, autonomous cars can handle most computation locally and exchange the required results with the cloud at intervals. Satellites can finish most of the computation for information that they are to gather and communicate with the earth-based computers using minimal communication channels. Federated learning allows synchronization of computation between multiple devices and computing servers by exchanging only computed results.

We can explain federated learning with an analogy. That is, an ML model is like a sheep and the data is the grass. A traditional way to rear sheep is by buying the grass and transferring it to where the sheep is located, much like when we buy the datasets and move them to a central server. However, privacy concerns and regulations prevent us from physically moving the data. In our analogy, the grass can no longer travel outside its local area. Instead, federated learning employs a dual methodology. We can let the sheep graze multiple grasslands, much like our ML model that is built in a distributed manner without the data traveling outside its local area. In the end, the ML model grows from everyone’s data, just like the sheep feed on everyone’s grass.

Today, our modern society demands more responsible use of AI, and user privacy and data confidentiality are important properties of AI systems. In this direction, federated learning is already making significant positive impact, ranging from securely updating user models on mobile phones to improving medical imaging performance with multiple hospitals. Many existing works in different computer science areas have laid the foundation for the technology, such as distributed optimization and learning, homomorphic encryption, differential privacy, and secure multi-party computation.

There are two types of federated learning, horizontal and vertical. The Google GBoard system adopts horizontal federated learning and shows an example of B2C (business-to-consumer) applications. It can also be used to support edge computing, where the devices at the edge of a cloud system can handle many of the computing tasks and thus reduce the need to communicate via raw data with the central servers. Vertical federated learning, proposed and advanced by WeBank, represents the B2B (business-to-business) model, where multiple organizations join an alliance in building and using a shared ML model. The model is built while ensuring that no local data leaves any sites and maintaining the model performance according to business requirements. In this book, we cover both the B2C and B2B models.

To develop a federated learning system, multiple disciplines are needed, including ML algorithms, distributed machine learning (DML), cryptography and security, privacy-preserving data mining, game theory and economic principles, incentive mechanism design, laws and regulatory requirements, etc. It is a daunting task for someone to be well-versed in so many diverse disciplines, and the only sources for studying this field are currently scattered across many research papers and blogs. Therefore, there is a strong need for a comprehensive introduction to this subject in a single text, which this book offers.

This book is an introduction to federated learning and can serve as one’s first entrance into this subject area. It is written for students in computer science, AI, and ML, as well as for big data and AI application developers. Students at senior undergraduate or graduate levels, faculty members, and researchers at universities and research institutions can find the book useful. Lawmakers, policy regulators, and government service departments can also consider it as a reference book on legal matters involving big data and AI. In classrooms, it can serve as a textbook for a graduate seminar course or as a reference book on federated learning literature.

The idea of this book came about in our development of a federated learning platform at WeBank known as Federated AI Technology Enabler (FATE), which became the world’s first open-source federated learning platform and is now part of the Linux Foundation. WeBank is a digital bank that serves hundreds of millions of people in China. This digital bank has a business alliance across diverse backgrounds, including banking, insurance, Internet, and retail and supply-chain companies, just to name a few. We observe firsthand that data cannot be easily shared, but the need to collaborate to build new businesses supported by ML is very strong.

Federated learning was practiced by Google at large-scale in its mobile services for consumers as an example of B2C applications. We took one step further in expanding it to enable partnerships between multiple businesses in a partnership for B2B applications. The horizontal, vertical, and transfer learning-based federated learning categorization was first summarized in our survey paper published in ACM Transactions on Intelligent Systems and Technology (ACM TIST) [Yang et al., 2019] and was also presented at the 2019 AAAI Conference on Artificial Intelligence (organized by the Association for the Advancement of Artificial Intelligence) in Hawaii. Subsequently, various tutorials were given at conferences such as the 14th Chinese Computer Federation Technology Frontier in 2019. In the process of developing this book, our open-source federated learning system, FATE, was born and publicized [WeBank FATE, 2019] (see https://www.fedai.org), and the first international standard on federated learning via IEEE is being developed [IEEE P3652.1, 2019]. The tutorial notes and related research papers served as the basis for this book.

Qiang Yang, Yang Liu, Yong Cheng, Yan Kang, Tianjian Chen, and Han Yu

November 2019, Shenzhen, China

Acknowledgments

The writing of this book involved huge efforts from a group of very dedicated contributors. Besides the authors, different chapters were contributed by Ph.D. students, researchers, and research partners at various stages. We express our heartfelt gratitude to the following people who have made contributions toward the writing and editing of this book.

• Dashan Gao helped with writing Chapters 2 and 3.

• Xueyang Wu helped with writing Chapters 3 and 5.

• Xinle Liang helped with writing Chapters 3 and 9.

• Yunfeng Huang helped with writing Chapters 5 and 8.

• Sheng Wan helped with writing Chapters

Скачать книгу