Brief Primer on Federated Learning, Part 2.

colin byrne
8 min readDec 17, 2020


Part 1 of this article took a brief look at some key Federated Learning (FL) concepts and here in Part 2 we practically apply them within the TensorFlow Federated simulation environment.

Implementing Federating Learning.

The intention of the various Federated Learning frameworks, platforms, and tools is to reduce the burden on developers in dealing with massively distributed dataset and the associated complexities therein¹. Two such popular frameworks are Pysyft & TensorFlow Federated.

The OpenMined² organization offers several opensource libraries and platforms focused on several remote execution problems, including FL (Pysyft), differential privacy, and homomorphic encryption. They enable this by extending existing Machine Learning (ML) code libraries, like PyTorch, TensorFlow and Keras.

TensorFlow Federated is a shared FL opensource platform by Google. It is in its nascent stages, primarily providing experimental and run time simulations with suitable decentralized datasets.

TensorFlow Federated³ enables developers to declaratively express federated computations, so they can be deployed to diverse runtime environments. Included with TFF is a single-machine simulation runtime for experiments. A layered architecture comprises of two API’s, (FL API, Federated Core API) enables separation of development tasks.

Example TFF Task⁴: “Read sensor from remote client devices and identify clients exceeding global average”.

Example TFF task.

Federated abstraction “wraps python functions as TFF federated/composite computations”.

  • Basic unit of composition in TFF is a federated computation”.
  • TFF offers several “federated operators”, e.g. federated mean.

TFF provides access to a number of suitable datasets suitable for FL learning experimentation, i.e. Massively Distributed Datasets (MDD) mirroring real world data distributions and several worked tutorials. The TFF Image classification tutorial leverages a federated version of the original NIST (NIST, 2019) dataset comprising of handwritten digits, curated by LEAF⁵.


The Dataset is indexed by the original author of the handwritten digits⁶.

Image classification dataset

It’s comprised of 62 different classes (10 digits, 26 lowercase, 26 uppercase), with image dimensions of 28 by 28 pixels.

Dataset distribution⁹, samples per device.
  • 3,550 Authors/Device.
  • 805,263 Sample digit .png files
  • ~ 226 samples per author
  • ~ 89 (std) # samples per author.

Federated Experiments.

The following experiments were conducted within TFF simulation framework. They strive to characterize the design consideration questions mentioned above by varying some key FL modelling parameters utilizing the code provided in TFF Image classification tutorial⁷.

Rounds of Training:

Modify the existing TFF image classification tutorial code to vary the # of training rounds & vary the # of devices that participate in the experiment.

TFF Image classification, Training iteration experiment.
  • Increasing the # of training rounds & # of devices, generally resulted in increased accuracy.
  • Both experiments (A ,C) & ( B,D) show similar accuracy improvement by increasing the # of devices participating in training.
  • The experiments above used the same” client devices for both training and validation.
TFF Image classification Random device selection.
  • “Random” device selection resulted in ~ 10% reduction in accuracy with higher variation.
  • Random device selection is more reflective of real-word scenarios, but static device selection can be useful for “repeatable” experimental baseline.


The number of epochs specified in the Federated Average (FedAvg) algorithm, defines the number of sequential progress, i.e. “steps of on device learning”, each client makes before updating the global model after each round of training.

The greater the number of epochs the more local progress.

Note⁸: That “The quality of the FL model updates should be optimized as much as possible. Generally, this works by increasing the amount of computation performed on each client/device.”

Running several iterations locally before synchronizing (at end of each round) with the server is a purely heuristic method⁹. There are no theoretical guarantees that this technique improves convergence. In some simulations it has been shown that running multiple iterations (epochs) locally improves convergence by a factor between 3 and 100.

Modify Amount of Local Training:

Modify the existing TFF image classification tutorial code to vary the amount of “Local On Device” training during the FL experiment.

  • Experiment; the same 10 devices are selected for 100 rounds of training.
TFF Image Classification Epoch experiment.

Epoch Analysis: Increasing the # of Epochs, manifests as a much faster convergence rate (Green line) and higher accuracy. However the more local on device training training that takes place, the higher the danger of “Overfitting”¹⁰ and the higher the chance an individual client may “drift” from the overall average of the global data.

The task of model optimization is rather akin to trying to manipulate the shape of an inflated balloon. Pressing, compresses one side of the balloon, simultaneously expanding the other side, i.e.

“Increasing the amount of local learning may result in faster convergence, but conversely may reduce validation accuracy.”

Optimization¹¹ is a normal task with typical AI/ML models. With FL optimization however, there is an extra risk of overfitting due to data distribution on each client/device, i.e. Non-IID¹². If each client/device dataset has the same local data distribution as the global data distribution, the same optimization techniques used for typical AI/ML models can be used. The less similar each client local dataset is to the “global” dataset, the more likely there will be “drift” (clients converge to different optimal points) when using a high number of client epochs during later training rounds.

Training accuracy can still be good, as each client/device is fitting well to its own local data while training. However, test/validation accuracy is less likely to improve, as the global models average will likely average out to be very small. i.e. Different client/device local optimal points cancelling each other out. Praneeth¹³ refers to this effect of optimizing the FedAvg algorithm with high number of epochs, as “Client-drift”.

Evaluation Perspective:

When validating FL results, do you validate on one or more devices?

“What strategy should be used when evaluating FL models?”

Validating a typical AI/ML model is predicated on the dataset being fully available. The model developer generally splits the dataset e.g. 80/20% split, 80% data for training and 20% for test/validation.

While a dataset maybe fully available in a FL “simulation” environment, this scenario isn’t applicable to real-world FL instances, where the dataset availability is dependent on access to remote client devices. In real-world FL scenarios the federated dataset should be seen as “dynamic” in nature, with random devices selected for both training and evaluation. Evaluation in the FL context is also complicated by the question of perspective.

You want to train a model in a way that enables good generalization over the full domain of user/devices. Evaluation¹⁴ choice maybe be somewhat context dependent.

Example scenario, e.g. Federated personalization in a smart phone application. The solution developer may wish to optimize for the averagefuture user’s” experience, which is better modelled by the per-client weighted average than the per-example weighted average. Their aim maybe to make the personalized application work better on average across all users rather than work better for individuals that use it more. Problem context should drive evaluation strategies.

Modify the existing TFF image classification tutorial code to vary the # of devices that are used to validate the models accuracy, during the FL experiment.

  • Experiment; 192 Random devices are selected for 200 rounds of training.
  • When complete the trained model is then evaluated on a separate random selection of 10, 192 devices over 200 evaluation rounds.
TFF Image Classification, Evaluation Perspective.
  • Training on 192 clients resulted in a mean validation accuracy of ~ 82% measured on (1, 10,192) clients & Standard Deviation variance of 0.122, 0.037, 0.008 measured on measured on (1, 10,192) clients respectively.

Applying the intuition gleaned from the validation perspective experiment to a real-world scenario. Example context, Google gBoard¹⁵ next word prediction use case. From the perspective of the FL model developer they may very well be happy with 81% accuracy @ 0.008 variance, while a single device user may also be happy with 81.9% accuracy albeit @ 0.12 variance for next word predictions on their smart phone. Context and perspective matter when evaluating FL models.

Experiments Conclusion.

Simple analogy to better understand the FL modelling design considerations;

  • How many rounds of remote training should take place?
  • How much local on device training should take place?
  • What strategy should be used when evaluation FL models?

When studying for an exam, best practice is to study material form as wide a source as possible, i.e. (number of remote devices”), and the longer you study, i.e. (“number of rounds of training”) the better the learning outcome. When it comes to exam validation, i.e.(“model evaluation”) exam results can be measured at a;

  • Single student,
  • Class,
  • Semester,
  • Faculty level etc etc.

from a Federated Learning point of view at a (“Single user device , several devices”, etc).

The key is to try to optimizing the learning process and to contextualize its evaluation. You the developer decide “At what level evaluation is required!”

Notes & Citations.

About the Author:

Colin Byrne is an IT professional with over 25 years industry experience, and a recent (2020) MSc. post grad in Artificial Intelligence, from the University of Limerick.


[1] 2019,”How Federated Learning is going to revolutionize AI”, Available at:, Gupta, A.

[2] 2020 OpenMined, Available at:

[3] 2020 “TensorFlow Federated: Machine Learning on Decentralized Data”
Available at:, Google.

[4] 2020 Google, “Tensore Flow Federated Modules”,
Available at:, Google.

[5] 2018. “A Benchmark for Federated Settings”.
Available at:, Caldas, S. et al.

[6] 2019, “Workshop on Federated Learning for Data Privacy and Confidentiality.” Available at: , Caldas, S. et al.

[7] 2020 “Federated Learning for Image Classification”, Available at : Google.

[8][9] 2018, University of Berlin. Federated Learning. [Online]
Available at:, Hartmann F.

[10] 2020, Overfitting in Machine Learning: What It Is and How to Prevent It Available at:

[11][13] 2020, “Stochastic Controlled Averaging for Federated Learning”,
Available at:, Sai Praneeth, K. et al.

[12] 2018, “Federated Learning with Non-IID Data”,
Available at:, Yue, Z. et al.


[15] 2019, “Federated Learning for Emoji Prediction in a Mobile Keyboard”,
Available at:, Ramaswamy, S, et al.



colin byrne

IT professional, MSc. post grad in Artificial Intelligence.