Context

Differentiable programming is about languages with a built-in notion of efficient differentiation of the kind used, for example, for gradient descent in machine learning. Stemming from the field known as automatic differentiation, which has been developed since the 1970s, the topic has been the subject of intense research in the last few years. This workshop is intended to bring together as many as possible of those who have contributed to the theoretical understanding of it, in all sorts of different ways.

Venue

The workshop will take place on June 29th and 30th, 2022, at the University of Paris Cité, near IRIF:

Amphi Pierre-Gilles de Gennes
Condorcet building, Université Paris Cité
4 rue Elsa Morante
75013 Paris

Attendance will be in-person only. Talks will not be recorded. This is because the equipment of the room does not permit the organization of a decent hybrid event, and also because this workshop is intended to be more for stimulating informal interaction than for disseminating work to large audiences.

Participants

A picture gallery (courtesy of Dan Zheng)

Registration

Participation is free and open to everyone. However, the number of places being limited (especially considering room capacity restrictions under COVID-19 regulations), we ask those who wish to attend to fill in and submit the registration form linked below. Registration will be on a first come, first served basis and will be closed as soon as the capacity of the room will be attained.

Click here to access the registration form.

Program

Click on the titles to jump to the abstracts (slides and related documents are also available for some talks).

Wednesday, 29 June 2022

09:30-10:20 Adam Paszke: You Only Linearize Once: Tangents Transpose to Gradients
10:20-10:50 coffee break
10:50-11:40 Dan Zheng: Demystifying differentiable programming
11:40-12:30 Mario Alvarez-Picallo: Pearlmutter-Siskind Automatic Differentiation Revisited
12:30-14:00 lunch break
14:00-14:50 Faustyna Krawiec: Provably Correct, Asymptotically Efficient, Higher-Order Reverse-Mode Automatic Differentiation
14:50-15:40 Tom Smeding: From dual-numbers reverse AD to an efficient implementation
15:40-16:10 coffee break
16:10-17:00 Thomas Ehrhard: Towards a comparison between automatic differentiation and the differential lambda-calculus
17:00-17:50 Aloïs Brunel: Functional data-types as a central component to deep learning architectures
20:00 social dinner

Thursday, 30 June 2022

09:30-10:20 Quentin Berthet: Perturbed Optimizers for Machine Learning
10:20-10:50 coffee break
10:50-11:40 Fabio Zanasi: A Lens-Theoretic Approach to Machine Learning
11:40-12:30 Jesse Sigal: Automatic differentiation via algebraic effects and handlers
12:30-14:00 lunch break
14:00-14:50 Wonyeol Lee: On Correctness of Automatic Differentiation for Non-Differentiable Functions
14:50-15:40 Alex Lew: Towards Semantics for Automatic Differentiation in Probabilistic Programming Languages
15:40-16:10 coffee break
16:10-17:00 Gordon Plotkin: Differentiation using Contexts
17:00-17:50 Jesse Michel: Automatic Differentiation of Discontinuous Integrals

Abstracts of the Talks

Mario Alvarez-Picallo
Pearlmutter-Siskind Automatic Differentiation Revisited

Pearlmutter and Siskind's work established the first automatic differentiation algorithm capable of correctly handling higher-order programs, kickstarting a fruitful trend of developments in functional AD. Their algorithm, however, has been criticised as being notably hard to understand, and while it has been thoroughly tested, it is known to produce incorrect outputs when evaluated on higher-order functions (as opposed to first-order functions containing higher-order subterms).

In this talk, we will show how to reformulate the core of Pearlmutter and Siskind's algorithm and formalize it as a collection of syntactic transformations on hierarchical string diagrams, a convenient graphical representation for lambda-terms. Using this formulation, we manage to provide what is, to our knowledge, the first proof of correctness of Pearlmutter and Siskind's algorithm.


Quentin Berthet
Perturbed Optimizers for Machine Learning

Machine learning pipelines often rely on optimizers procedures to make discrete decisions (e.g., sorting, picking closest neighbors, or shortest paths). Although these discrete decisions are easily computed in a forward manner, they break the back-propagation of computational graphs. In order to expand the scope of learning problems that can be solved in an end-to-end fashion, we propose a systematic method to transform optimizers into operations that are differentiable and never locally constant. Our approach relies on stochastically perturbed optimizers, and can be used readily within existing solvers. Their derivatives can be evaluated efficiently, and smoothness tuned via the chosen noise amplitude. We also show how this framework can be connected to a family of losses developed in structured prediction, and give theoretical guarantees for their use in learning tasks. We demonstrate experimentally the performance of our approach on various tasks, including recent applications on protein sequences.

Based on


Aloïs Brunel
Functional data-types as a central component to deep learning architectures

In this talk, we will explore the question of the benefits of bringing deep learning and functional programming together. We propose to revisit neural network architectures in the light of functional programming patterns, in particular inductive datatypes (such as algebraic data types) and their associated recursion schemes. We will show evidence that these patterns have been widely used to design successful architectures, but not in an explicit way. We advocate that reframing neural networks in these terms allows for a more unified and principled approach to architecture design and opens up interesting possibilities.


Thomas Ehrhard
Towards a comparison between automated differentiation and the differential lambda-calculus

The differential lambda-calculus and linear logic (DiLL) have strong similarities with automatic differentiation (AD) and these similarities are even stronger in the recently introduced setting of coherent differentiation. There are also major differences, the most striking one being the fact that DiLL does not rely on a ground type of real numbers with respect to which derivatives are computed. This talk will introduce the basic ideas and intuitions of DiLL in a probabilistic setting and try to initiate a comparison with AD.


Faustyna Krawiec
Provably Correct, Asymptotically Efficient, Higher-Order Reverse-Mode Automatic Differentiation

Reverse-mode AD emerged as a solution to the problem of calculating large gradients more efficiently than the repeated application of forward-mode AD would allow. Most correctness proofs of reverse-mode AD however reason about algorithms that are not even asymptotically efficient. This is perhaps unsurprising, considering that the best implementations of reverse-mode AD are impressively complex pieces of software, whose inherently stateful and imperative nature makes them difficult to reason about.

Forward-mode AD, on the other hand, is conceptually simple and easy to prove correct, thanks to its presentation as a nearly-homomorphic and compositional source-to-source translation. Real numbers and operators are replaced with their dual-number counterparts, while the rest of the program remains basically unchanged, even in the presence of higher-order functions.

In this talk, I will show step-by-step how applying only a few simple optimisations to this standard forward-mode AD algorithm yields a monadic translation that computes gradients without sacrificing efficiency. The resulting algorithm works on a purely functional language with higher-order functions and can be proved correct using a logical relations argument.


Wonyeol Lee
On Correctness of Automatic Differentiation for Non-Differentiable Functions

Differentiation lies at the core of many machine-learning algorithms, and is well supported by popular autodiff systems, such as TensorFlow and PyTorch. Originally, these systems have been developed to compute derivatives of differentiable functions, but in practice, they are commonly applied to functions with non-differentiabilities. For instance, neural networks using ReLU define nondifferentiable functions in general, but the gradients of losses involving those functions are computed using autodiff systems in practice. This status quo raises a natural question: are autodiff systems correct in any formal sense when they are applied to such non-differentiable functions?

In this talk, we will give a positive answer to this question. Using counterexamples, we first point out flaws in often used informal arguments, such as: non-differentiabilities arising in deep learning do not cause any issues because they form a measure-zero set. We then investigate a class of functions, called PAP functions, that includes nearly all (possibly nondifferentiable) functions in deep learning nowadays. For these PAP functions, we propose a new type of derivatives, called intensional derivatives, and prove that these derivatives always exist and coincide with standard derivatives for almost all inputs. We also show that these intensional derivatives are what most autodiff systems compute or try to compute essentially. In this way, we formally establish the correctness of autodiff systems applied to non-differentiable functions.


Alex Lew
Towards Semantics for Automatic Differentiation in Probabilistic Programming Languages

We present a new denotational model for reasoning about the use of automatic differentiation by probabilistic programming systems. First, we develop semantics for a higher-order language with probability, conditioning, and piecewise-differentiable primitives. Then, we present and justify an AD macro for the language, and use it to model and validate gradient-based features of modern PPLs, including AD Hamiltonian Monte Carlo, AD variational inference, and AD change-of-variable corrections. Finally, we show how our semantics can be extended to model partiality and recursion, yielding straightforward denotational arguments for recent results about AD in recursive languages (the a.e.-correctness of AD in PCF [Mazza and Pagani 2021], and the a.e.-differentiability of density functions of a.s.-terminating probabilistic programs [Mak et al. 2021]). Joint work with Mathieu Huot.


Jesse Michel
Automatic Differentiation of Discontinuous Integrals

Automatic differentiation generally ignores discontinuities, but still produces correct results almost everywhere. In the presence of integration, ignoring discontinuities produces incorrect results and often has a significant impact on the optimization process. In my talk, I will discuss how to automatically compute the derivative of discontinuous integrands. I will also present applications in graphics and physical simulations, including image stylization and trajectory optimization.


Adam Paszke
You Only Linearize Once: Tangents Transpose to Gradients

Automatic differentiation (AD) is conventionally understood as a family of distinct algorithms, rooted in two "modes" -- forward and reverse -- which are typically presented (and implemented) separately. Can there be only one? Following up on the AD systems developed in the JAX and Dex projects, in this talk I’ll present how to decompose reverse-mode AD into thee modular parts: (i) forward-mode AD followed by (ii) unzipping the linear and non-linear parts and then (iii) transposition of the linear part.

This decomposition also sheds light on checkpointing, which emerges naturally from a free choice in unzipping let expressions. As a corollary, checkpointing techniques are applicable to general-purpose partial evaluation, not just AD. We hope that our formalization will lead to a deeper understanding of automatic differentiation and that it will simplify implementations, by separating the concerns of differentiation proper from the concerns of gaining efficiency (namely, separating the derivative computation from the act of running it backward).


Gordon Plotkin
Differentiation using Contexts

Using contexts, we give compositional AD source code transformations for reverse-mode differentiation and for transposition for a simple first-order language. Neither the transformations nor the target code produced employ continuations or similar devices. The essential idea is that contexts act as a syntactic version of continuation transformers.

The target language is a first-order mixed linear/nonlinear language. We also give a standard forward-mode differentiation source code transformation to the mixed language, tracking primals and tangents in tandem in the usual way.

The target code produced is efficient (e.g., linear in the input code), and, in the relevant case, as efficient as backprop. Further, reverse-mode differentiation is syntactically identical to forward-mode differentiation followed by transposition, so the two processes produce equally efficient code (by any measure!).

That all said, this is only a report on work in progress. There remains much to do, even at first-order. To give just two examples, it would be important to add sum types, and a richer linear language would enable iterated differentiation.


Jesse Sigal
Automatic differentiation via algebraic effects and handlers

Algebraic effects and handlers are powerful control flow construct which are a more structured version of delimited continuations. I will show how to implement forward, reverse, checkpointed reverse, and a second-order reverse mode which computes the full Hessian in one pass (given in Mu Wang's thesis "High Order Reverse Mode of Automatic Differentiation"). Additionally, the constructions are compositional. Furthermore, I will show how various bugs such as perturbation confusion are avoided via the effect type system. Finally, I will discuss various benchmarks and issues concerning implementation in Koka. This talk builds on the related PEPM'21 short paper.


Tom Smeding
From dual-numbers reverse AD to an efficient implementation

Dual-numbers-style reverse AD replaces the tangent scalars of the well-known dual-numbers forward AD algorithm with backpropagators taking the cotangent of the scalar and returning the corresponding cotangent of the program input. The method captures the elegance of dual-numbers forward AD, but not naturally its efficiency. However, staging/delaying calls to these backpropagators in order to take multiple calls together using linear factoring [1], followed by some standard algorithmic improvements from functional programming, can transform the algorithm into one that enjoys an efficient implementation. This efficient algorithm still generalises to complex language features such as recursive types, just like dual-numbers forward AD.

We discuss the optimisations that bring dual-numbers reverse AD to an efficient algorithm, and furthermore link the algorithm to existing work on CHAD [2] and AD on Haskell using sharing recovery [3][4].


Fabio Zanasi
A Lens-Theoretic Approach to Machine Learning

I will present recent work (joint with G. Cruttwell, B. Gavranovic, N. Ghani, and P. Wilson) on developing semantic foundations to machine learning based on the notion of lens. This framework provides a much needed unifying perspective on different training techniques for machine learning models, as well as paving the way for the application of methodologies from logic, algebra, and category theory. The abstract viewpoint does not only allow to study uniformly learning with neural networks, but also to extend these techniques to other, previously unexplored models, such as Boolean circuits.


Dan Zheng
Demystifying differentiable programming

This talk covers two topics: examining differentiable programming from a programming languages perspective, and developing differentiable programming as a feature in a general purpose programming language.

The first topic comes from “Demystifying Differentiable Programming: Shift/Reset the Penultimate Backpropagator”. The key insight is a connection between reverse-mode automatic differentiation (AD) and delimited continuations, that enables an AD implementation using operator overloading without managing auxiliary data structures. Combining this AD implementation with multi-stage programming achieves efficiency without sacrificing expressivity, scaling up to real world neural networks.

The second topic comes from work on differentiable programming in Swift. Swift is a general purpose and statically typed programming language, widely used for app development. This part of the talk focuses on the goals, design constraints, challenges, and learnings from adding differentiable programming as an idiomatic first-class feature in a general purpose programming language.

    

Invited Speakers

Organizers

Participants

This list will be progressively updated as people register to the workshop.