PhD Position F/M Privacy-preserving decentralized learning through Model fragmentation and Private Aggregation

Inria
April 09, 2023
Contact:N/A
Offerd Salary:Negotiation
Location:N/A
Working address:N/A
Contract Type:Other
Working Time:Negotigation
Working type:N/A
Ref info:N/A

2023-05787 - PhD Position F/M Privacy-preserving decentralized learning through Model fragmentation and Private Aggregation

Contract type : Fixed-term contract

Level of qualifications required : Graduate degree or equivalent

Fonction : PhD Position

About the research centre or Inria department

The Inria Rennes - Bretagne Atlantique Centre is one of Inria's eight centres and has more than thirty research teams. The Inria Center is a major and recognized player in the field of digital sciences. It is at the heart of a rich R&D and innovation ecosystem: highly innovative PMEs, large industrial groups, competitiveness clusters, research and higher education players, laboratories of excellence, technological research institute, etc.

Context

The WIDE team is involved in a number of projects that tackle related problems. In the context of the SOTERIA H2020, Davide FREY is currently working on decentralized and privacy-preserving machine learning algorithms using trusted execution environments. This thesis provides a complementary approach, and there is thus a concrete possibility to directly apply the results of this Ph.D. thesis to the Personal Data Vault being developed by the SOTERIA project. Davide Frey is also active, with François Taiani, in the FedMalin Inria Challenge, which also investigates decentralized machine learning platforms. In particular, in the context of FedMalin, WIDE is currently developing a library for decentralized machine learning that can be exploited by this thesis. Moreover, we envision close collaboration with other teams involved in the FedMalin project. In addition to the collaborations we mentioned above with the partners of the SOTERIA H2020 project and of the FedMalin project, we are planning to collaborate with Anne-Marie Kermarrec's group at EPFL.

Assignment

Machine learning consists in producing (learning) a computer-based function (usually referred to as a model) from examples (training data). The accuracy and quality of the resulting model are usually directly related to the size of the training data, but training from very large datasets raises at least two problems. First, very large training sets require substantial computing power to train the model in a reasonable time. Second, as machine learning is increasingly applied to sensitive and personal data (e.g. health records, personal messages, user preferences, browsing histories), exposing this data to the learning algorithm raises far-reaching privacy-protection concerns and carries important risks of privacy violation.

These two problems have prompted the emergence of a range of distributed learning techniques, which seek to distribute the learning effort on many machines to scale the learning process and limit privacy leaks by keeping sensitive data on the learning devices. Two related strategies have, in particular, emerged to address these challenges: Federated Learning, initially promoted by Google, and Decentralized Learning, which forgoes entirely any centralized entity in the learning process.

Unfortunately, recent works have shown that, in spite of their promises, both of these approaches can be subject to privacy attacks, such as membership inference, data reconstruction, or attribute inference, that make it possible for malicious participants to access private and or sensitive information through the learning process. This PhD aims to improve the privacy protection granted by decentralized learning by exploring how model fragmentation, a technique developed by the WIDE team within the ANR Pamela project (2016-2020), can be combined with private aggregation and random peer sampling, two of the strategies successfully applied to P2P networks.

Main activities

The Ph.D. will investigate

(1) how to organize model fragmentation to obtain privacy protection gains (e.g. which parts of a model are more sensitive than others),

(2) how to combine it with other obfuscation mechanisms (private epidemic aggregation, masking, randomization), and

(3) how to characterize the protection it brings in terms of privacy, and the costs (networks, time, loss of model quality) it causes.

  • 1. How to fragment, distribute, and reconstruct decentralized models? Fragmentation has long been identified as a useful principle to hide and protect sensitive information. Applying fragmentation on models during the learning process in a decentralized setting brings however a fresh range of research problems. How should a given model be partitioned? Finer fragments carry less information but may excessively slow down the learning procedure. Some parts of a model might carry more information than others. Can such differences be exploited to deliver varying levels of fragmentation, tailored to the sensitiveness of individual model parts? Once fragmented, model updates must be disseminated to learning nodes and recombined. Al- though these steps appear trivial, the underlying topology used to perform them has far-reaching consequences for the overall security of the system. A static graph, in particular, allows dishonest peers to conduct colluding attacks, while the use of a peer-sampling service (RPS) brings new potential weaknesses, as malicious peers could try to influence the service to serve their goals.
  • 2. How to combine fragmentation and privacy-preserving averaging without disrupting learning? Privacy-preserving averaging can be applied in a straightforward manner by using a privacy-preserving averaging protocol (PPAP) for all gradient exchanges. However, PPAP rounds are typically much slower than those of straightforward averaging. To reduce this cost, we plan to explore how privacy-preserving averaging could be applied only for the most sensitive part of the model, in tandem with heterogeneous fragmentation. One central challenge of this approach lies in designing a hybrid clear-text/obfuscated averaging algorithm that minimally disrupts learning while significantly raising the bar for image-reconstruction attacks. One second path we would like to explore is the notion of pipelining for the obfuscated part. When using a PPA protocol to compute averaged models (or gradients), peers must be delayed until the PPA protocol completes, which can add some substantial overheads to the overall learning procedure. When a PPA protocol executes, however, although the initial values computed by individual nodes are random, the actual aggregated value is progressively revealed as the PPA protocol progresses. The intuition behind pipelining consists in exploiting these noisy yet partially converged values to reduce the delay caused by PPA averaging. Concretely, after an initial static delay, in which the part of the model subject to the PPA protocol remains unchanged, we plan to progressively apply PPA updates in a continuous fashion.
  • 3. How to characterize the gains of protection obtained from fragmentation and privacy-preserving averaging? Although experiments do suggest that privacy attacks on distributed, and in particular reconstruction attacks, are highly sensitive to a range of experimental parameters, this relationship is today not fully understood, and no framework exists to assess privacy protection mechanisms in a systematic and sound way. Although this problem is particularly thorny, and in all likelihood cannot be fully exhausted within a Ph.D., we will strive to develop a comprehensive privacy benchmark to assess the level of protection granted by the mechanisms developed in this thesis. We will in particular aim for representativeness, reproducibility, and as much as possible generalization.
  • We envision the following high-level work plan.

  • M0-M3: The PhD student will perform a thorough state-of-the-art of the existing attacks and countermeasures in the context of federated and decentralized learning.

  • M4-M12: The student will then leverage the state-of-the-art to design and implement a benchmark suite that incorporates the major attack techniques and will apply it to existing decentralized learning solutions. This will allow them to identify the strength and pitfalls of existing solutions.

  • M13-M24: They will then leverage the developed benchmark to test the privacy guarantees offered by the current model-fragmentation approach developed by the WIDE team. In particular, they will evaluate several fragmentation strategies in combination with different topology-management approaches.

  • M25-M30: The student will then focus on how to combine fragmentation with privacy-preserving averaging. The goal here consists in designing a hybrid protocol that can combine privacy- preserving steps with clear-text steps on fragmented models.

  • M31-M36: The final months will be devoted to writing the manuscript and on finalizing the pub- lications of the thesis results.

  • Skills

    Good programming skills and a willingness to learn about new techniques (decentralized machine learning and privacy protection) are also crucial, as well as good writing skills and the ability to propose, present, and discuss new ideas in a collaborative setting.

    Benefits package
  • Subsidized meals
  • Partial reimbursement of public transport costs
  • Possibility of teleworking (90 days per year) and flexible organization of working hours
  • Partial payment of insurance costs
  • Remuneration

    monthly gross salary amounting to 2051 euros for the first and second years and 2158 euros for the third year

    General Information
  • Theme/Domain : Distributed Systems and middleware Statistics (Big data) (BAP E)

  • Town/city : Rennes

  • Inria Center : Centre Inria de l'Université de Rennes
  • Starting date : 2023-09-01
  • Duration of contract : 3 years
  • Deadline to apply : 2023-04-09
  • Contacts
  • Inria Team : WIDE
  • PhD Supervisor : Frey Davide / [email protected]
  • The keys to success

    The candidate recruited for this Ph.D. should have a Master's Degree in Computer Science or equivalent, with a solid algorithmic and systems background, particularly regarding at least one of the folowing: distributed computer systems, machine learning, and/or mobile computing.

    About Inria

    Inria is the French national research institute dedicated to digital science and technology. It employs 2,600 people. Its 200 agile project teams, generally run jointly with academic partners, include more than 3,500 scientists and engineers working to meet the challenges of digital technology, often at the interface with other disciplines. The Institute also employs numerous talents in over forty different professions. 900 research support staff contribute to the preparation and development of scientific and entrepreneurial projects that have a worldwide impact.

    Instruction to apply

    Please submit online : your resume, cover letter and letters of recommendation eventually

    For more information, please contact [email protected]

    Defence Security : This position is likely to be situated in a restricted area (ZRR), as defined in Decree No. 2011-1425 relating to the protection of national scientific and technical potential (PPST).Authorisation to enter an area is granted by the director of the unit, following a favourable Ministerial decision, as defined in the decree of 3 July 2012 relating to the PPST. An unfavourable Ministerial decision in respect of a position situated in a ZRR would result in the cancellation of the appointment.

    Recruitment Policy : As part of its diversity policy, all Inria positions are accessible to people with disabilities.

    Warning : you must enter your e-mail address in order to save your application to Inria. Applications must be submitted online on the Inria website. Processing of applications sent from other channels is not guaranteed.

    From this employer

    Recent blogs

    Recent news