What follows is a high-level overview of this work, for more details refer to our paper. Given a reward
R ( x )
and a deterministic episodic environment where episodes end with a ``generate
x x
'' action, how do we generate diverse and high-reward
x x
s?
We propose to use Flow Networks to model discrete
p ( x ) R ( x ) p(x) \propto R(x)
from which we can sample sequentially (like episodic RL, rather than iteratively as MCMC methods would). We show that our method, GFlowNet, is very useful on a combinatorial domain, drug molecule synthesis, because unlike RL methods it generates diverse
x x
s by design.

Flow Networks

A flow network is a directed graph with sources and sinks, and edges carrying some amount of flow between them through intermediate nodes -- think of pipes of water. For our purposes, we define a flow network with a single source, the root or
s 0
; the sinks of the network correspond to the terminal states. We'll assign to each sink
x x
an ``out-flow''
R(x)=e^{-energy(x)}
.
Like MCMC methods, GFlowNet can turn a given energy function into samples but it does it in an amortized way, converting the cost a lot of very expensive MCMC trajectories (to obtain each sample) into the cost training a generative model (in our case a generative policy which sequentially builds up
x x
). Sampling from the generative model is then very cheap (e.g. adding one component at a time to a molecule) compared to an MCMC. But the most important gain may not be just computational, but in terms of the ability to discover new modes of the reward function.
MCMC methods are iterative, making many small noisy steps, which can converge in the neighborhood of a mode, and with some probability jump from one mode to a nearby one. However, if two modes are far from each other, MCMC can require exponential time to mix between the two. If in addition the modes occupy a tiny volume of the state space, the chances of initializing a chain near one of the unknown modes is also tiny, and the MCMC approach becomes unsatisfactory. Whereas such a situation seems hopeless with MCMC, GFlowNet has the potential to discover modes and jump there directly, if there is structure that relates the modes that it already knows, and if its inductive biases and training procedure make it possible to generalize there.
GFlowNet does not need to perfectly know where the modes are: it is sufficient to make guesses which occasionally work well. Like for MCMC methods, once a point in the region of new mode is discovered, further training of GFlowNet will sculpt that mode and zoom in on its peak.
Note that we can put
R ( x )
to some power
R(x)^\beta = e^{-\beta\; energy(x)}
, making it possible to focus more or less on the highest modes (versus spreading probability mass more uniformly).

Generating molecule graphs

The motivation for this work is to be able to generate diverse molecules from a proxy reward
R R
that is imprecise because it comes from biochemical simulations that have a high uncertainty. As such, we do not care about the maximizer as RL methods would, but rather about a set of ``good enough'' candidates to send to a true biochemical assay.
Another motivation is to have diversity: by fitting the distribution of rewards rather than trying to maximize the expected reward, we're likely to find more modes than if we were being greedy after having found a good enough mode, which again and again we've found RL methods such as PPO to do.
Here we generate molecule graphs via a sequence of additive edits, i.e. we progressively build the graph by adding new leaf nodes to it. We also create molecules block-by-block rather than atom-by-atom.
We find experimentally that we get both good molecules, and diverse ones. We compare ourselves to PPO and MARS (an MCMC-based method).
Figure 3 shows that we're fitting a distribution that makes sense. If we change the reward by exponentiating it as
R β
with
, this shifts the reward distribution to the right.
Figure 4 shows the top-
k k
found as a function of the number of episodes.

Finally, Figure 5 shows that using a biochemical measure of diversity to estimate the number of distinct modes found, GFlowNet finds much more varied candidates.

Active Learning experiments

The above experiments assume access to a reward
R R
that is cheap to evaluate. In fact it uses a neural network proxy trained from a large dataset of molecules. This setup isn't quite what we would get when interacting with biochemical assays, where we'd have access to much fewer data. To emulate such a setting, we consider our oracle to be a docking simulation (which is relatively expensive to run, ~30 cpu seconds).
In this setting, there is a limited budget for calls to the true oracle
O O
. We use a proxy