The timeseries module pymbar.timeseries
The pymbar.timeseries
module contains tools for dealing with timeseries data.
The MBAR method is only applicable to
uncorrelated samples from probability distributions, so we provide a number of tools that can be used to decorrelate
simulation data.
Automatically identifying the equilibrated production region
Most simulations start from initial conditions that are highly unrepresentative of equilibrated samples that occur
late in the simulation.
We can improve our estimates by discarding these initial regions to “equilibration” (also known as “burn-in”).
We recommend a simple scheme described in Ref. [3], which
identifies the production region as the final contiguous region containing the largest number of uncorrelated samples.
This scheme is implemented in the detect_equilibration()
method:
from pymbar import timeseries
t0, g, Neff_max = timeseries.detect_equilibration(A_t) # compute indices of uncorrelated timeseries
A_t_equil = A_t[t0:]
indices = timeseries.subsample_correlated_data(A_t_equil, g=g)
A_n = A_t_equil[indices]
In this example, the detect_equilibration()
method is used on the correlated timeseries A_t
to identify the
sample index corresponding to the beginning of the production region, t_0
, the statistical inefficiency of the
production region [t0:]
, g
, and the effective number of uncorrelated samples in the production
region, Neff_max
.
The production (equilibrated) region of the timeseries is extracted as A_t_equil
and then subsampled using
the subsample_correlated_data()
method with the provided statistical inefficiency g
.
Finally, the decorrelated samples are stored in A_n
.
Note that, by default, the statistical inefficiency is computed for every time origin in a call to
detect_equilibration()
, which can be slow.
If your dataset is more than a few hundred samples, you may want to evaluate only every nskip
samples as potential
time origins.
This may result in discarding slightly more data than strictly necessary, but may not have a significant impact if the
timeseries is long.
nskip = 10 # only try every 10 samples for time origins
t0, g, Neff_max = timeseries.detect_equilibration(A_t, nskip=nskip)
Subsampling timeseries data
If there is no need to discard the initial transient to equilibration, the subsample_correlated_data()
method can be
used directly to identify an effectively uncorrelated subset of data.
from pymbar import timeseries
indices = timeseries.subsample_correlated_data(A_t_equil)
A_n = A_t_equil[indices]
Here, the statistical inefficiency g
is computed automatically.
Other utility timeseries functions
A number of other useful functions for computing autocorrelation functions from one or more timeseries sampled from the same process are also provided.
A module for extracting uncorrelated samples from correlated timeseries data.
This module provides various tools that allow one to examine the correlation functions and integrated autocorrelation times in correlated timeseries data, compute statistical inefficiencies, and automatically extract uncorrelated samples for data analysis.
Please reference the following if you use this code in your research:
[1] Shirts MR and Chodera JD. Statistically optimal analysis of samples from multiple equilibrium states. J. Chem. Phys. 129:124105, 2008 http://dx.doi.org/10.1063/1.2978177
[2] J. D. Chodera, W. C. Swope, J. W. Pitera, C. Seok, and K. A. Dill. Use of the weighted histogram analysis method for the analysis of simulated and parallel tempering simulations. JCTC 3(1):26-41, 2007.
- pymbar.timeseries.detect_equilibration(A_t, fast=True, nskip=1)
Automatically detect equilibrated region of a dataset using a heuristic that maximizes number of effectively uncorrelated samples.
- Parameters:
A_t (np.ndarray) – timeseries
nskip (int, optional, default=1) – number of samples to sparsify data by in order to speed equilibration detection
- Returns:
t (int) – start of equilibrated data
g (float) – statistical inefficiency of equilibrated data
Neff_max (float) – number of uncorrelated samples
Notes
If your input consists of some period of equilibration followed by a constant sequence, this function treats the trailing constant sequence as having Neff = 1.
Examples
Determine start of equilibrated data for a correlated timeseries.
>>> from pymbar import testsystems >>> A_t = testsystems.correlated_timeseries_example(N=1000, tau=5.0) # generate a test correlated timeseries >>> [t, g, Neff_max] = detect_equilibration(A_t) # compute indices of uncorrelated timeseries
Determine start of equilibrated data for a correlated timeseries with a shift.
>>> from pymbar import testsystems >>> A_t = testsystems.correlated_timeseries_example(N=1000, tau=5.0) + 2.0 # generate a test correlated timeseries >>> B_t = testsystems.correlated_timeseries_example(N=10000, tau=5.0) # generate a test correlated timeseries >>> C_t = np.concatenate([A_t, B_t]) >>> [t, g, Neff_max] = detect_equilibration(C_t, nskip=50) # compute indices of uncorrelated timeseries
- pymbar.timeseries.detect_equilibration_binary_search(A_t, bs_nodes=10)
Automatically detect equilibrated region of a dataset using a heuristic that maximizes number of effectively uncorrelated samples.
- Parameters:
A_t (np.ndarray) – timeseries
bs_nodes (int > 4) – number of geometrically distributed binary search nodes
- Returns:
t (int) – start of equilibrated data
g (float) – statistical inefficiency of equilibrated data
Neff_max (float) – number of uncorrelated samples
Notes
Finds the discard region (t) by a binary search on the range of possible lagtime values, with logarithmic spacings. This will give a local maximum. The global maximum is not guaranteed, but will likely be found if the N_eff[t] varies smoothly.
- pymbar.timeseries.integrated_autocorrelation_time(A_n, B_n=None, fast=False, mintime=3)
Estimate the integrated autocorrelation time.
See also
statisticalInefficiency
- pymbar.timeseries.integrated_autocorrelation_timeMultiple(A_kn, fast=False)
Estimate the integrated autocorrelation time from multiple timeseries.
See also
- pymbar.timeseries.normalized_fluctuation_correlation_function(A_n, B_n=None, N_max=None, norm=True)
Compute the normalized fluctuation (cross) correlation function of (two) stationary timeseries.
C(t) = (<A(t) B(t)> - <A><B>) / (<AB> - <A><B>)
This may be useful in diagnosing odd time-correlations in timeseries data.
- Parameters:
A_n (np.ndarray) – A_n[n] is nth value of timeseries A. Length is deduced from vector.
B_n (np.ndarray) – B_n[n] is nth value of timeseries B. Length is deduced from vector.
N_max (int, default=None) – if specified, will only compute correlation function out to time lag of N_max
norm (bool, optional, default=True) – if False will return the unnormalized correlation function D(t) = <A(t) B(t)>
- Returns:
C_n – C_n[n] is the normalized fluctuation auto- or cross-correlation function for timeseries A(t) and B(t).
- Return type:
np.ndarray
Notes
The same timeseries can be used for both A_n and B_n to get the autocorrelation statistical inefficiency. This procedure may be slow. The statistical error in C_n[n] will grow with increasing n. No effort is made here to estimate the uncertainty.
References
[1] J. D. Chodera, W. C. Swope, J. W. Pitera, C. Seok, and K. A. Dill. Use of the weighted histogram analysis method for the analysis of simulated and parallel tempering simulations. JCTC 3(1):26-41, 2007.
Examples
Estimate normalized fluctuation correlation function.
>>> from pymbar import testsystems >>> A_t = testsystems.correlated_timeseries_example(N=10000, tau=5.0) >>> C_t = normalized_fluctuation_correlation_function(A_t, N_max=25)
- pymbar.timeseries.normalized_fluctuation_correlation_function_multiple(A_kn, B_kn=None, N_max=None, norm=True, truncate=False)
Compute the normalized fluctuation (cross) correlation function of (two) timeseries from multiple timeseries samples.
C(t) = (<A(t) B(t)> - <A><B>) / (<AB> - <A><B>) This may be useful in diagnosing odd time-correlations in timeseries data.
- Parameters:
A_kn (Python list of numpy arrays) – A_kn[k] is the kth timeseries, and A_kn[k][n] is nth value of timeseries k. Length is deduced from arrays.
B_kn (Python list of numpy arrays) – B_kn[k] is the kth timeseries, and B_kn[k][n] is nth value of timeseries k. B_kn[k] must have same length as A_kn[k]
N_max (int, optional, default=None) – if specified, will only compute correlation function out to time lag of N_max
norm (bool, optional, default=True) – if False, will return unnormalized D(t) = <A(t) B(t)>
truncate (bool, optional, default=False) – if True, will stop calculating the correlation function when it goes below 0
- Returns:
C_n[n] – The normalized fluctuation auto- or cross-correlation function for timeseries A(t) and B(t).
- Return type:
np.ndarray
Notes
The same timeseries can be used for both A_n and B_n to get the autocorrelation statistical inefficiency. This procedure may be slow. The statistical error in C_n[n] will grow with increasing n. No effort is made here to estimate the uncertainty.
References
[1] J. D. Chodera, W. C. Swope, J. W. Pitera, C. Seok, and K. A. Dill. Use of the weighted histogram analysis method for the analysis of simulated and parallel tempering simulations. JCTC 3(1):26-41, 2007.
Examples
Estimate a portion of the normalized fluctuation autocorrelation function from multiple timeseries of different length.
>>> from pymbar import testsystems >>> N_k = [1000, 2000, 3000, 4000, 5000] >>> tau = 5.0 # exponential relaxation time >>> A_kn = [ testsystems.correlated_timeseries_example(N=N, tau=tau) for N in N_k ] >>> C_n = normalized_fluctuation_correlation_function_multiple(A_kn, N_max=25)
- pymbar.timeseries.statistical_inefficiency(A_n, B_n=None, fast=False, mintime=3, fft=False)
Compute the (cross) statistical inefficiency of (two) timeseries.
- Parameters:
A_n (np.ndarray, float) – A_n[n] is nth value of timeseries A. Length is deduced from vector.
B_n (np.ndarray, float, optional, default=None) – B_n[n] is nth value of timeseries B. Length is deduced from vector. If supplied, the cross-correlation of timeseries A and B will be estimated instead of the autocorrelation of timeseries A.
fast (bool, optional, default=False) – f True, will use faster (but less accurate) method to estimate correlation time, described in Ref. [1] (default: False). This is ignored when B_n=None and fft=True.
mintime (int, optional, default=3) – minimum amount of correlation function to compute (default: 3) The algorithm terminates after computing the correlation time out to mintime when the correlation function first goes negative. Note that this time may need to be increased if there is a strong initial negative peak in the correlation function.
fft (bool, optional, default=False) – If fft=True and B_n=None, then use the fft based approach, as implemented in statistical_inefficiency_fft().
- Returns:
g – g is the estimated statistical inefficiency (equal to 1 + 2 tau, where tau is the correlation time). We enforce g >= 1.0.
- Return type:
np.ndarray,
Notes
The same timeseries can be used for both A_n and B_n to get the autocorrelation statistical inefficiency. The fast method described in Ref [1] is used to compute g.
References
[1] J. D. Chodera, W. C. Swope, J. W. Pitera, C. Seok, and K. A. Dill. Use of the weighted histogram analysis method for the analysis of simulated and parallel tempering simulations. JCTC 3(1):26-41, 2007.
Examples
Compute statistical inefficiency of timeseries data with known correlation time.
>>> from pymbar.testsystems import correlated_timeseries_example >>> A_n = correlated_timeseries_example(N=100000, tau=5.0) >>> g = statistical_inefficiency(A_n, fast=True)
- pymbar.timeseries.statistical_inefficiency_fft(A_n, mintime=3)
Compute the (cross) statistical inefficiency of (two) timeseries.
- Parameters:
A_n (np.ndarray, float) – A_n[n] is nth value of timeseries A. Length is deduced from vector.
mintime (int, optional, default=3) – minimum amount of correlation function to compute (default: 3) The algorithm terminates after computing the correlation time out to mintime when the correlation function first goes negative. Note that this time may need to be increased if there is a strong initial negative peak in the correlation function.
- Returns:
g – g is the estimated statistical inefficiency (equal to 1 + 2 tau, where tau is the correlation time). We enforce g >= 1.0.
- Return type:
np.ndarray,
Notes
The same timeseries can be used for both A_n and B_n to get the autocorrelation statistical inefficiency. The fast method described in Ref [1] is used to compute g.
References
- [1] J. D. Chodera, W. C. Swope, J. W. Pitera, C. Seok, and K. A. Dill. Use of the weighted
histogram analysis method for the analysis of simulated and parallel tempering simulations. JCTC 3(1):26-41, 2007.
- pymbar.timeseries.statistical_inefficiency_multiple(A_kn, fast=False, return_correlation_function=False)
Estimate the statistical inefficiency from multiple stationary timeseries (of potentially differing lengths).
- Parameters:
A_kn (list of np.ndarrays) – A_kn[k] is the kth timeseries, and A_kn[k][n] is nth value of timeseries k. Length is deduced from arrays.
fast (bool, optional, default=False) – f True, will use faster (but less accurate) method to estimate correlation time, described in Ref. [1] (default: False)
return_correlation_function (bool, optional, default=False) – if True, will also return estimates of normalized fluctuation correlation function that were computed (default: False)
- Returns:
g (np.ndarray,) – g is the estimated statistical inefficiency (equal to 1 + 2 tau, where tau is the correlation time). We enforce g >= 1.0.
Ct (list (of tuples)) – Ct[n] = (t, C) with time t and normalized correlation function estimate C is returned as well if return_correlation_function is set to True
Notes
The autocorrelation of the timeseries is used to compute the statistical inefficiency. The normalized fluctuation autocorrelation function is computed by averaging the unnormalized raw correlation functions. The fast method described in Ref [1] is used to compute g.
References
- [1] J. D. Chodera, W. C. Swope, J. W. Pitera, C. Seok, and K. A. Dill. Use of the weighted
histogram analysis method for the analysis of simulated and parallel tempering simulations. JCTC 3(1):26-41, 2007.
Examples
Estimate statistical efficiency from multiple timeseries of different lengths.
>>> from pymbar import testsystems >>> N_k = [1000, 2000, 3000, 4000, 5000] >>> tau = 5.0 # exponential relaxation time >>> A_kn = [ testsystems.correlated_timeseries_example(N=N, tau=tau) for N in N_k ] >>> g = statistical_inefficiency_multiple(A_kn)
Also return the values of the normalized fluctuation autocorrelation function that were computed.
>>> [g, Ct] = statistical_inefficiency_multiple(A_kn, return_correlation_function=True)
Determine the indices of an uncorrelated subsample of the data.
- Parameters:
A_t (np.ndarray) – A_t[t] is the t-th value of timeseries A(t). Length is deduced from vector.
g (float, optional) – if provided, the statistical inefficiency g is used to subsample the timeseries – otherwise it will be computed (default: None)
fast (bool, optional, default=False) – fast can be set to True to give a less accurate but very quick estimate (default: False)
conservative (bool, optional, default=False) – if set to True, uniformly-spaced indices are chosen with interval ceil(g), where g is the statistical inefficiency. Otherwise, indices are chosen non-uniformly with interval of approximately g in order to end up with approximately T/g total indices
verbose (bool, optional, default=False) – if True, some output is printed
- Returns:
indices – the indices of an uncorrelated subsample of the data
- Return type:
list of int
Notes
The statistical inefficiency is computed with the function computeStatisticalInefficiency().
Examples
Subsample a correlated timeseries to extract an effectively uncorrelated dataset.
>>> from pymbar import testsystems >>> A_t = testsystems.correlated_timeseries_example(N=10000, tau=5.0) # generate a test correlated timeseries >>> indices = subsample_correlated_data(A_t) # compute indices of uncorrelated timeseries >>> A_n = A_t[indices] # extract uncorrelated samples
Extract uncorrelated samples from multiple timeseries data from the same process.
>>> # Generate multiple correlated timeseries data of different lengths. >>> T_k = [1000, 2000, 3000, 4000, 5000] >>> K = len(T_k) # number of timeseries >>> tau = 5.0 # exponential relaxation time >>> A_kt = [ testsystems.correlated_timeseries_example(N=T, tau=tau) for T in T_k ] # A_kt[k] is correlated timeseries k >>> # Estimate statistical inefficiency from all timeseries data. >>> g = statistical_inefficiency_multiple(A_kt) >>> # Count number of uncorrelated samples in each timeseries. >>> N_k = np.array([ len(subsample_correlated_data(A_t, g=g)) for A_t in A_kt ]) # N_k[k] is the number of uncorrelated samples in timeseries k >>> N = N_k.sum() # total number of uncorrelated samples >>> # Subsample all trajectories to produce uncorrelated samples >>> A_kn = [ A_t[subsample_correlated_data(A_t, g=g)] for A_t in A_kt ] # A_kn[k] is uncorrelated subset of trajectory A_kt[t] >>> # Concatenate data into one timeseries. >>> A_n = np.zeros([N], np.float32) # A_n[n] is nth sample in concatenated set of uncorrelated samples >>> A_n[0:N_k[0]] = A_kn[0] >>> for k in range(1,K): A_n[N_k[0:k].sum():N_k[0:k+1].sum()] = A_kn[k]