Brief Primer on Federated Learning, Part 1.

colin byrne
7 min readDec 17, 2020

Ever heard the phrase “If you aren’t moving forward your standing still?” A cursory look over technological items from the last three decades help give credence this sentiment, e.g. VHS, Betamax, DVD, Vinyl, Cassette, CD, MP3, Home Phone, Cell Phone, Smart Phone, Home Assistants etc. Advances in technology strive to increase user adoption by making products that are easier to use, easier to consume, and provide more personalized user experiences. The field of Artificial Intelligence (AI) is one area of technology that is driving innovations personalizing user interactions.

AI embedded smart phone. (pixabay.com)

Have you ever typed a message on your smart phone and wondered “How are these personalized next word predictions so accurate?” words that may not have been in your lexicon a couple of weeks previous are now being correctly suggested as you type. “How is this possible?”

Federated Learning (FL).

Part 1 of this article briefly introduces the related AI concepts and approaches that provide the platforms that enable these “Intelligent” interactions. Traditional AI approaches like Machine Learning (ML) typically require “Centralized data” and “Centralized compute” resources but!!!, this is becoming less feasible as both data volumes and privacy concerns increase.

FL a recent research field within AI proposes a new approach, where computation is “moved” to the data. It enables remote devices such as smart phones to “collaboratively learn” a shared model using personalized data on the device.

Common Federated Learning Considerations.

Federated Learning has its challenges, “How to create reliable models when the data resides at the end of unreliable internet connection?” to name only one. Federated algorithms such as Federated Average (FedAvg¹) have been proposed to tackle this challenge however, FL developers are also faced with numerous modelling design considerations such as;

  • How many rounds of remote training should take place?
  • How much local on device training should take place?
  • What strategy should be used when evaluating FL models?

Part 2 of this article details how to practically implement FL to assist in answering these core questions. It demonstrates the FedAvg algorithm within the TensorFlow Federated (TFF) simulation environment. Sample characterization experiments are depicted along with the appropriate model parameters.

Federated Learning Theory and Concepts.

Most Machine Learning (M.L) approaches work in an iterative process. They work on the intuition that there is some function that maps the input” data to the “output” data. Generally, all these approaches can be simply thought of as types of Universal Function Approximators. How they approximate the required function (f)is different for each type of machine learner, but they all follow the basic three step flow.

Generic three step AI/ML flow.

The machine learning software ingest input “training data”, digest “learns the appropriate function parameters”, and egests “a trained model” which can then be used to make reasonable predictions on new previously unseen data.

When we say a machine is learning” what does this mean?

Generally ML can be considered an optimization problem. For supervised learning, a loss function f(x) is approximated and the model prediction is measured to see how “close / accurate” it is to the correct answers. In a sense the loss function assists with quantifying the predictive accuracy of the trained model.

Loss function.

Mathematically; where L is the full loss for the predictive function (f), (x) input and (y) output.

There are many methods and algorithms for minimizing loss.

A common loss method called Gradient Descent is typically used in Deep Learning Neural Networks (DLNN) networks. Initially, the network is “naïve”, it doesn’t know the function mapping the inputs to the outputs. The network learns by showing it samples of training data, then adjust the network parameters so that it “approximates this function”.

To find these parameters, we need to know how poorly the network is predicting the real outputs. We iteratively optimize the “loss / cost” function parameters using various algorithms such as Gradient descent on the training data, moving the “learned” parameters in the appropriate direction, to minimize loss.

Centralized Machine Training Vs Federated Training.

Location of where Learning takes place differentiates traditional ML and FL approaches.
“On device” Gradient decent.

Remote communication is the “Inherent” constraint that the Federated paradigm must contend with. Moving the learning steps / Gradient descent to the individual devices is one approach to mitigate the communication constraint.

FL inherently involves an iterative process². Seeking to limit centralized communication requirements by moving these sequential steps to local devices assists in optimizing the overall training process.

Federated Averaging Algorithm (FedAvg).

A typical FedAvg training round consists of the following sequence³:

a) Random subset of members of the Federation (clients/device) selected to receive initial global model (synchronously);

b) Each client selected for current round of training, computes updates to its local model using local data “On Device training”;

c) The client sends its updates to the server;

d) The central server aggregates all the client updates (by averaging) to construct an improved global model.

The process repeats with another round of training until the model parameters converge, as determined by appropriate criteria.

FedAvg Algorithm.

Client: “Gradient descent” takes place.

Server: Aggregate “averaged clients update”.

K: clients are indexed by k; η is the learning rate.

The amount of client computation is controlled by three key parameters.

  • C The fraction of clients that perform computation on each round
  • B refers to local minibatch size
  • E refers to number of local epochs.

Federated Averaging Hyperparameters.

FedAvg key Tuning / optimization parameters.

Selecting suitable values for # of Rounds to train on and the # of remote user devices is generally considered a heuristic approach.

Random devices selected to participate in a round of FL.

In the real word application of FL where the domain of remote devices is very large, devices selection is done randomly, i.e.

•[K]: Domain of available devices,

  • Randomly select t subset of user/devices where t⊆[K] to train on.

Summary.

The twenty twenties have been called the “Data Decade” ³. With the advent of technologies like Edge computing and 5G communications, the trajectory for the generation of personalized data is ever increasing. The challenge is, much of this data is created and resides at the edge, encapsulated⁴ on “isolated islands”, i.e. smart phones, edge devices.

There is major insight potential if suitable methods can be found to enable these devices to globally collaborate. The global health pandemic at the start of the 2020’s has seen mutually beneficial collaborations between users and health authorities with contact tracing applications⁵ based on highly personalized “on device” user data. Utilizing FL to modernize the way census data is assimilated could revolutionize governmental planning. Imagine the potential societal benefits were real time census data became available based on, “on device” user data.

Remote devices participate in collaborative learning.(freepik.com)

Federated Learning is poised as the vanguard technology to liberate this potential. Its success will largely be dependent on the availability of complementary development tools and continued confidence in the privacy and security benefits it offers. Having the technology is one thing, but successfully provisioning it, is another!.

Part 1 of this article took a brief look at some key Federated Learning concepts and Part 2 practically applied them within the TensorFlow Federated simulation environment. It seeks to provide some insight and intuition for further research & experimentation for the would-be Federated Learning developer.

Concluding Comments.

(Descartes, 1637) The phrase “I think therefore I am”, by René Descarte, implies that thought is the defining human characteristic. The advent of writing allowed early societies to collaborate and share these thoughts. Humanity and the society we live in today is a manifestation of these collaborations and actions. In the same way as the pen enabled humans to effectively share their thoughts, Ubiquitous Personal Devices (UPDs) imbued with a plethora of sensory objects offers new potential for personal collaboration in ways heretofore unseen. What will humanity and the society of tomorrow look like, if we find mutually beneficial ways to share the corpus of data generated on our personal devices?

Notes & Citations.

About the Author:

Colin Byrne is an IT professional with over 25 years industry experience, and a recent (2020) MSc. post grad in Artificial Intelligence, from the University of Limerick.

References:

[1] 2017 “Communication-Efficient Learning of Deep Networks from Decentralized Data”, Available at: https://arxiv.org/abs/1602.05629, McMahan, H. Brendan, et al.

[2] 2017 “DIMACS/Northeast Big Data Hub Workshop on Overcoming Barriers to Data Sharing including Privacy and Fairness”, Available at:http://dimacs.rutgers.edu/archive/Workshops/Barriers/Slides/mcmahan.pdf, McMahan, B.

[3] 2020 “CRN. News”, Available at: https://www.crn.com/news/running-your-business/ceo-outlook-why-2020-will-be-the-launch-of-the-data-decade-, Haranas, M.

[4] 2020 “An Overview of Federated Learning”, Available at:https://medium.com/datadriveninvestor/an-overview-of-federated-learning-8a1a62b0600d, Han,B.

[5] 2020, “ Why are Google and Apple dictating how European democracies fight coronavirus?”,
Available at: https://www.theguardian.com/commentisfree/2020/jun/16/google-apple-dictating-european-democracies-coronavirus?CMP=Share_AndroidApp_Gmail, Ivea, I, The Guardian.

--

--

colin byrne

IT professional, MSc. post grad in Artificial Intelligence.