Strategies for solution
Approaches to solving the MBAR equations
The multistate reweighting approach to calculate free energies can be
formulated in several ways. The multistate reweighting equations are
a set of coupled, implicit equations for the free energies of K
states, given samples from these K states. If one can calculate the
energies of each of the K states, for each sample, then one can
solve for the K free energies satisfying the equations. The
solutions are unique only up to an overall constant, which pymbar
removes by setting the first free energy to zero to 0, leaving K-1.
free energies.
By rearrangement, this set of self-consistent equations can be written as simultaneous roots to K equations. This set of roots also turns out to be the Jacobian of single maximum likelihood function of all the free energies. We then can find the MBAR solutions by either maximization/minimization techiques, or by root finding.
Because the second derivative of the likelihood is always negative,
there is only one possivle solution. However, if there is poor
overlap, it is not uncommon that some of the optimal could be
in extremely flat region of solution space, and therefore have
significant round-off errors resulting in slow or no convergence to the
solution, and low overlap can also lead to underflow and overflow
leading to crashed solutions.
scipy
solvers
pymbar
is set up to use the scipy.optimize.minimize()
and
scipy.optimize.roots()
functionality to perform this
minimization. We use only the gradient-based methods, as the
analytical gradient-based optimization is obtainable from the MBAR
equations. Available scipy.optimize.minimize()
methods include
L-BFGS-B
, dogleg
, CG
, BFGS
, Newton-CG
, TNC
, trust-ncg
,
trust-krylov
, trust-exact
, and SLSQP
. and
scipy.optimize.roots
options are hybr
and lm
. Methods that
take a Hessian (dogleg
, Newton-CG
, trust-ncg
, trust-krylov
,
trust-exact
) are passed the analytical Hessian. Options can be
passed to each of these methods through the MBAR
object
initialization interface.
Built-in solutions
In addition to the scipy
solcers, pymbar
also includes an
adaptive solver designed directly for MBAR. At every step, the
adaptive solver calculates both the next iteration of the
self-consistent iterative formula presented in Shirts et
al. [2], and takes a Newton-Raphson
step. In both cases, it calculates the gradients of the points
resulting after the two steps, and selects the move that makes the
magnitude of the gradient (i.e. the dot product of the gradient with
itself) smallest. Far from the solution, the self-consistent iteration
tends have the smaller gradient, while closer to the solution, the
Newton-Raphson step tends to have the smallest gradient. It always
chooses the self-consistent iteration for the first min_sc_iter
iterations, as if overlap is poor then Newton=Raphson iteration can
fail. min_sc_iter
’s default is 2, but if one is starting from a
good guess for the free energies, one could start with
min_sc_iter=0
Constructing solver protocols
We have found that in general, different solvers have different convergence properties and difference convergence behavior. Even at the same input tolerance level, different algorithms may not yield the same result. Additionally, accurate free energy estimates and other. Although data with states that have significant overlap in configuration states usually converge successfully with all simulation algorithms, different algorithms succeed and fail on different “hard” data sets. We have therefore constructed a very general interface to allow the user to try different algorithms.
pymbar
uses the concept of a “solver protocol”, where one applies
a series of methods one or multiple times. In most cases, one will
never need to interact with this interface, because several different
protocols have already been implemented, and are accessible with
string keywords. These are currently solver_protocl=default
(also
the default) and solver_protocol=robust
.
However, a user has the option of creating their own solver protocol. Solver protocols are created as tuples of dictionaries, where each dictionary is an optimization operation. The user has the ability to continue with the resulting free energies each time, or restarting back from the initialized free energies.
Take for example the default solver protocol, designed to give a high accuracy answer in most circumstances:
solver_protocol = (
dict(method="hybr", continuation=True),
dict(method="adaptive", options=dict(min_sc_iter=0)),
)
Here, the first pass through optimization uses the scipy.optimize.roots
function hybr
,
and then if it is successful, continues on by running the built-in adaptive
method, but
with no self-consistent iteration choices forced, as described above. If hybr fails,
it will still attempted to continue on with the resulting free energies, from whatever point it
ended up, issue a warning in case that adaptive is still unable to solve the result.
The options
dictionary is passed onto the method, so whatever
options the scipy method uses it its documentation, it can be passed
on through this approach. tol
is a direct option to
scipy.optimize methods, and not passed on through the dictionary,
and thus is passed on directly in the solver protocol.
solver_protocol = (
dict(method="hybr"),
dict(method="L-BFGS-B", tol = 1.0e-5, continuation = True, options=dict(maxiter=1000)),
dict(method="adaptive", tol = 1.0e-12, options=dict(maxiter=1000,min_sc_iter=5))
)
In this case, it would first use “hybr” with the default options and
tolerance and if successful, exit. If not successful with “hybr”, it
would continue on to “L-BFGS-B”, with 1000 maximum iterations, and a
tolerance of but would not use the results from the “hybr”
call. If “L-BFGS-B” was either successful or unsuccessful, it would
pass the results to “adaptive”, where it would choose the
self-consistent iteration the first five times, the numberof maximum
iteractions would be 1000, and the tolerance is
.
Initialization
One can initialize the soution process in a number of ways. The
simplest is to start from all zeros, which is the default (and also
has keyword initialize=zeros
). If the keyword f_k_initial
is
used, then the length K.
Two other options for initialize
are BAR
and
average-enthalpies
. average-enthalpies
which approximates the
free energy of each state using the average enthalpy of each states,
which will be valid in the limit of no entropy differences beween
states. initialize=BAR
can be used whenever states are given in a
natural sequence of overlap, such that state 0 has the most
configurational overlap with state 1, state 1 has significant
configurational overlap with both states 0 and state 2, and so forth.
In the limit there is only overlap between neighboring states, MBAR
converges to give the same answer for
that BAR gives. Although BAR also requires an iterative solution, it
is a single variable problem, and thus the K-1 BAR iterations that
need to be done are much faster than a single K-1 dimensional
problem. The initial input for state k in the solution process is
then
Note that if both initialize
and f_k_initial
are selected, the
logic is somewhat different. Specifing f_k_initial
overwrites
initialize=zeros
, but initialize=BAR
starts each application
of BAR with the (reduced) free energy difference between states k
and k+1 in from f_k_initial
.
Calculating uncertainties
The MBAR equations contain analytical estimates of uncertainties. These are essentially, however, the functional form is bit more complicated, since they include modifications for error propagation with implicit equations.
For free energies and expectations, one includes the analytical
uncertainties by adding the keyword compute_uncertainties=True
.
In some cases, to peform additional error analysis, one might need
access to the covariance matrix of $ln f_k$. This is accessed in
results['Theta']
, and included by setting compute_theta=True
, or
if compute_uncertainties=True
and uncertainty_method is not
bootstrap
.
If uncertainty_method=bootstrap
, then the analytical error
analysis is not performed, and instead bootstrap samples are pulled
from the original distribution. Bootstrapping is done on each set of
$N_k$ samples from each K states individually, rather than on the
set as a whole, as the number of samples drawn from each state should
not change in the bootstrapping process, or it would be a different
process.
For bootstrapping to be used in calculating error estimates, the
MBAR
object must be initialized with the keyword n_bootstraps,
which must be an integer greater than zero. In general, 50–200
bootstraps should be used to estimate uncertainties with a good degree
of accuracy.
Note that users have complete control over the solver sequence for
bootstrapped solutions, using the same API as for solvers of the
original solution, with keyword bootstrap_solver_protocol
. As an
example, the default bootstrap protocol is:
bootstrap_solver_protocol = (dict(method="adaptive", tol = 1e-6, options=dict(min_sc_iter=0,maximum_iterations=100)))
The solutions for bootstrapped data should be relatively close to the solutions for the original data set; additionally, they do not need to be quite as accurate, since they are used to compute the variances.
Bootstrapped uncertainties (using uncertainty_method=bootstrap
is
also available for all functions calculating expectations, but again
requires initialization with “n_bootstraps” when initalizing the MBAR
object.