Attention
You are looking at the documentation for the pymbar 3 LTS branch. If you want to update to the latest pymbar version, see this page https://pymbar.readthedocs.io/en/stable/moving_from_pymbar3.html
The timeseries module pymbar.timeseries
¶
The pymbar.timeseries
module contains tools for dealing with timeseries data.
The MBAR method is only applicable to
uncorrelated samples from probability distributions, so we provide a number of tools that can be used to decorrelate
simulation data.
Automatically identifying the equilibrated production region¶
Most simulations start from initial conditions that are highly unrepresentative of equilibrated samples that occur
late in the simulation.
We can improve our estimates by discarding these initial regions to “equilibration” (also known as “burn-in”).
We recommend a simple scheme described in Ref. [2], which
identifies the production region as the final contiguous region containing the largest number of uncorrelated samples.
This scheme is implemented in the detectEquilibration()
method:
from pymbar import timeseries
[t0, g, Neff_max] = timeseries.detectEquilibration(A_t) # compute indices of uncorrelated timeseries
A_t_equil = A_t[t0:]
indices = timeseries.subsampleCorrelatedData(A_t_equil, g=g)
A_n = A_t_equil[indices]
In this example, the detectEquilibration()
method is used on the correlated timeseries A_t
to identify the
sample index corresponding to the beginning of the production region, t_0
, the statistical inefficiency of the
production region [t0:]
, g
, and the effective number of uncorrelated samples in the production
region, Neff_max
.
The production (equilibrated) region of the timeseries is extracted as A_t_equil
and then subsampled using
the subsampleCorrelatedData()
method with the provided statistical inefficiency g
.
Finally, the decorrelated samples are stored in A_n
.
Note that, by default, the statistical inefficiency is computed for every time origin in a call to
detectEquilibration()
, which can be slow.
If your dataset is more than a few hundred samples, you may want to evaluate only every nskip
samples as potential
time origins.
This may result in discarding slightly more data than strictly necessary, but may not have a significant impact if the
timeseries is long.
nskip = 10 # only try every 10 samples for time origins
[t0, g, Neff_max] = timeseries.detectEquilibration(A_t, nskip=nskip)
Subsampling timeseries data¶
If there is no need to discard the initial transient to equilibration, the subsampleCorrelatedData()
method can be
used directly to identify an effectively uncorrelated subset of data.
from pymbar import timeseries
indices = timeseries.subsampleCorrelatedData(A_t_equil)
A_n = A_t_equil[indices]
Here, the statistical inefficiency g
is computed automatically.
Other utility timeseries functions¶
A number of other useful functions for computing autocorrelation functions from one or more timeseries sampled from the same process are also provided.
A module for extracting uncorrelated samples from correlated timeseries data.
This module provides various tools that allow one to examine the correlation functions and integrated autocorrelation times in correlated timeseries data, compute statistical inefficiencies, and automatically extract uncorrelated samples for data analysis.
Please reference the following if you use this code in your research:
[1] Shirts MR and Chodera JD. Statistically optimal analysis of samples from multiple equilibrium states. J. Chem. Phys. 129:124105, 2008 http://dx.doi.org/10.1063/1.2978177
[2] J. D. Chodera, W. C. Swope, J. W. Pitera, C. Seok, and K. A. Dill. Use of the weighted histogram analysis method for the analysis of simulated and parallel tempering simulations. JCTC 3(1):26-41, 2007.
- pymbar.timeseries.detectEquilibration(A_t, fast=True, nskip=1)¶
Automatically detect equilibrated region of a dataset using a heuristic that maximizes number of effectively uncorrelated samples.
- Parameters
- A_tnp.ndarray
timeseries
- nskipint, optional, default=1
number of samples to sparsify data by in order to speed equilibration detection
- Returns
- tint
start of equilibrated data
- gfloat
statistical inefficiency of equilibrated data
- Neff_maxfloat
number of uncorrelated samples
Notes
If your input consists of some period of equilibration followed by a constant sequence, this function treats the trailing constant sequence as having Neff = 1.
Examples
Determine start of equilibrated data for a correlated timeseries.
>>> from pymbar import testsystems >>> A_t = testsystems.correlated_timeseries_example(N=1000, tau=5.0) # generate a test correlated timeseries >>> [t, g, Neff_max] = detectEquilibration(A_t) # compute indices of uncorrelated timeseries
Determine start of equilibrated data for a correlated timeseries with a shift.
>>> from pymbar import testsystems >>> A_t = testsystems.correlated_timeseries_example(N=1000, tau=5.0) + 2.0 # generate a test correlated timeseries >>> B_t = testsystems.correlated_timeseries_example(N=10000, tau=5.0) # generate a test correlated timeseries >>> C_t = np.concatenate([A_t, B_t]) >>> [t, g, Neff_max] = detectEquilibration(C_t, nskip=50) # compute indices of uncorrelated timeseries
- pymbar.timeseries.detectEquilibration_binary_search(A_t, bs_nodes=10)¶
Automatically detect equilibrated region of a dataset using a heuristic that maximizes number of effectively uncorrelated samples.
- Parameters
- A_tnp.ndarray
timeseries
- bs_nodesint > 4
number of geometrically distributed binary search nodes
- Returns
- tint
start of equilibrated data
- gfloat
statistical inefficiency of equilibrated data
- Neff_maxfloat
number of uncorrelated samples
Notes
Finds the discard region (t) by a binary search on the range of possible lagtime values, with logarithmic spacings. This will give a local maximum. The global maximum is not guaranteed, but will likely be found if the N_eff[t] varies smoothly.
- pymbar.timeseries.integratedAutocorrelationTime(A_n, B_n=None, fast=False, mintime=3)¶
Estimate the integrated autocorrelation time.
See also
- pymbar.timeseries.integratedAutocorrelationTimeMultiple(A_kn, fast=False)¶
Estimate the integrated autocorrelation time from multiple timeseries.
See also
- pymbar.timeseries.normalizedFluctuationCorrelationFunction(A_n, B_n=None, N_max=None, norm=True)¶
Compute the normalized fluctuation (cross) correlation function of (two) stationary timeseries.
C(t) = (<A(t) B(t)> - <A><B>) / (<AB> - <A><B>)
This may be useful in diagnosing odd time-correlations in timeseries data.
- Parameters
- A_nnp.ndarray
A_n[n] is nth value of timeseries A. Length is deduced from vector.
- B_nnp.ndarray
B_n[n] is nth value of timeseries B. Length is deduced from vector.
- N_maxint, default=None
if specified, will only compute correlation function out to time lag of N_max
- norm: bool, optional, default=True
if False will return the unnormalized correlation function D(t) = <A(t) B(t)>
- Returns
- C_nnp.ndarray
C_n[n] is the normalized fluctuation auto- or cross-correlation function for timeseries A(t) and B(t).
Notes
The same timeseries can be used for both A_n and B_n to get the autocorrelation statistical inefficiency. This procedure may be slow. The statistical error in C_n[n] will grow with increasing n. No effort is made here to estimate the uncertainty.
References
[1] J. D. Chodera, W. C. Swope, J. W. Pitera, C. Seok, and K. A. Dill. Use of the weighted histogram analysis method for the analysis of simulated and parallel tempering simulations. JCTC 3(1):26-41, 2007.
Examples
Estimate normalized fluctuation correlation function.
>>> from pymbar import testsystems >>> A_t = testsystems.correlated_timeseries_example(N=10000, tau=5.0) >>> C_t = normalizedFluctuationCorrelationFunction(A_t, N_max=25)
- pymbar.timeseries.normalizedFluctuationCorrelationFunctionMultiple(A_kn, B_kn=None, N_max=None, norm=True, truncate=False)¶
Compute the normalized fluctuation (cross) correlation function of (two) timeseries from multiple timeseries samples.
C(t) = (<A(t) B(t)> - <A><B>) / (<AB> - <A><B>) This may be useful in diagnosing odd time-correlations in timeseries data.
- Parameters
- A_knPython list of numpy arrays
A_kn[k] is the kth timeseries, and A_kn[k][n] is nth value of timeseries k. Length is deduced from arrays.
- B_knPython list of numpy arrays
B_kn[k] is the kth timeseries, and B_kn[k][n] is nth value of timeseries k. B_kn[k] must have same length as A_kn[k]
- N_maxint, optional, default=None
if specified, will only compute correlation function out to time lag of N_max
- norm: bool, optional, default=True
if False, will return unnormalized D(t) = <A(t) B(t)>
- truncate: bool, optional, default=False
if True, will stop calculating the correlation function when it goes below 0
- Returns
- C_n[n]np.ndarray
The normalized fluctuation auto- or cross-correlation function for timeseries A(t) and B(t).
Notes
The same timeseries can be used for both A_n and B_n to get the autocorrelation statistical inefficiency. This procedure may be slow. The statistical error in C_n[n] will grow with increasing n. No effort is made here to estimate the uncertainty.
References
[1] J. D. Chodera, W. C. Swope, J. W. Pitera, C. Seok, and K. A. Dill. Use of the weighted histogram analysis method for the analysis of simulated and parallel tempering simulations. JCTC 3(1):26-41, 2007.
Examples
Estimate a portion of the normalized fluctuation autocorrelation function from multiple timeseries of different length.
>>> from pymbar import testsystems >>> N_k = [1000, 2000, 3000, 4000, 5000] >>> tau = 5.0 # exponential relaxation time >>> A_kn = [ testsystems.correlated_timeseries_example(N=N, tau=tau) for N in N_k ] >>> C_n = normalizedFluctuationCorrelationFunctionMultiple(A_kn, N_max=25)
- pymbar.timeseries.statisticalInefficiency(A_n, B_n=None, fast=False, mintime=3, fft=False)¶
Compute the (cross) statistical inefficiency of (two) timeseries.
- Parameters
- A_nnp.ndarray, float
A_n[n] is nth value of timeseries A. Length is deduced from vector.
- B_nnp.ndarray, float, optional, default=None
B_n[n] is nth value of timeseries B. Length is deduced from vector. If supplied, the cross-correlation of timeseries A and B will be estimated instead of the autocorrelation of timeseries A.
- fastbool, optional, default=False
f True, will use faster (but less accurate) method to estimate correlation time, described in Ref. [1] (default: False). This is ignored when B_n=None and fft=True.
- mintimeint, optional, default=3
minimum amount of correlation function to compute (default: 3) The algorithm terminates after computing the correlation time out to mintime when the correlation function first goes negative. Note that this time may need to be increased if there is a strong initial negative peak in the correlation function.
- fftbool, optional, default=False
If fft=True and B_n=None, then use the fft based approach, as implemented in statisticalInefficiency_fft().
- Returns
- gnp.ndarray,
g is the estimated statistical inefficiency (equal to 1 + 2 tau, where tau is the correlation time). We enforce g >= 1.0.
Notes
The same timeseries can be used for both A_n and B_n to get the autocorrelation statistical inefficiency. The fast method described in Ref [1] is used to compute g.
References
- [1] J. D. Chodera, W. C. Swope, J. W. Pitera, C. Seok, and K. A. Dill. Use of the weighted
histogram analysis method for the analysis of simulated and parallel tempering simulations. JCTC 3(1):26-41, 2007.
Examples
Compute statistical inefficiency of timeseries data with known correlation time.
>>> from pymbar.testsystems import correlated_timeseries_example >>> A_n = correlated_timeseries_example(N=100000, tau=5.0) >>> g = statisticalInefficiency(A_n, fast=True)
- pymbar.timeseries.statisticalInefficiencyMultiple(A_kn, fast=False, return_correlation_function=False)¶
Estimate the statistical inefficiency from multiple stationary timeseries (of potentially differing lengths).
- Parameters
- A_knlist of np.ndarrays
A_kn[k] is the kth timeseries, and A_kn[k][n] is nth value of timeseries k. Length is deduced from arrays.
- fastbool, optional, default=False
f True, will use faster (but less accurate) method to estimate correlation time, described in Ref. [1] (default: False)
- return_correlation_functionbool, optional, default=False
if True, will also return estimates of normalized fluctuation correlation function that were computed (default: False)
- Returns
- gnp.ndarray,
g is the estimated statistical inefficiency (equal to 1 + 2 tau, where tau is the correlation time). We enforce g >= 1.0.
- Ctlist (of tuples)
Ct[n] = (t, C) with time t and normalized correlation function estimate C is returned as well if return_correlation_function is set to True
Notes
The autocorrelation of the timeseries is used to compute the statistical inefficiency. The normalized fluctuation autocorrelation function is computed by averaging the unnormalized raw correlation functions. The fast method described in Ref [1] is used to compute g.
References
- [1] J. D. Chodera, W. C. Swope, J. W. Pitera, C. Seok, and K. A. Dill. Use of the weighted
histogram analysis method for the analysis of simulated and parallel tempering simulations. JCTC 3(1):26-41, 2007.
Examples
Estimate statistical efficiency from multiple timeseries of different lengths.
>>> from pymbar import testsystems >>> N_k = [1000, 2000, 3000, 4000, 5000] >>> tau = 5.0 # exponential relaxation time >>> A_kn = [ testsystems.correlated_timeseries_example(N=N, tau=tau) for N in N_k ] >>> g = statisticalInefficiencyMultiple(A_kn)
Also return the values of the normalized fluctuation autocorrelation function that were computed.
>>> [g, Ct] = statisticalInefficiencyMultiple(A_kn, return_correlation_function=True)
- pymbar.timeseries.statisticalInefficiency_fft(A_n, mintime=3, memsafe=None)¶
Compute the (cross) statistical inefficiency of (two) timeseries.
- Parameters
- A_nnp.ndarray, float
A_n[n] is nth value of timeseries A. Length is deduced from vector.
- mintimeint, optional, default=3
minimum amount of correlation function to compute (default: 3) The algorithm terminates after computing the correlation time out to mintime when the correlation function first goes negative. Note that this time may need to be increased if there is a strong initial negative peak in the correlation function.
- memsafe: bool, optional, default=None (in depreciation)
If this function is used several times on arrays of comparable size then one might benefit from setting this option to False. If set to True then clear np.fft cache to avoid a fast increase in memory consumption when this function is called on many arrays of different sizes.
- Returns
- gnp.ndarray,
g is the estimated statistical inefficiency (equal to 1 + 2 tau, where tau is the correlation time). We enforce g >= 1.0.
Notes
The same timeseries can be used for both A_n and B_n to get the autocorrelation statistical inefficiency. The fast method described in Ref [1] is used to compute g.
References
- [1] J. D. Chodera, W. C. Swope, J. W. Pitera, C. Seok, and K. A. Dill. Use of the weighted
histogram analysis method for the analysis of simulated and parallel tempering simulations. JCTC 3(1):26-41, 2007.
Determine the indices of an uncorrelated subsample of the data.
- Parameters
- A_tnp.ndarray
A_t[t] is the t-th value of timeseries A(t). Length is deduced from vector.
- gfloat, optional
if provided, the statistical inefficiency g is used to subsample the timeseries – otherwise it will be computed (default: None)
- fastbool, optional, default=False
fast can be set to True to give a less accurate but very quick estimate (default: False)
- conservativebool, optional, default=False
if set to True, uniformly-spaced indices are chosen with interval ceil(g), where g is the statistical inefficiency. Otherwise, indices are chosen non-uniformly with interval of approximately g in order to end up with approximately T/g total indices
- verbosebool, optional, default=False
if True, some output is printed
- Returns
- indiceslist of int
the indices of an uncorrelated subsample of the data
Notes
The statistical inefficiency is computed with the function computeStatisticalInefficiency().
Examples
Subsample a correlated timeseries to extract an effectively uncorrelated dataset.
>>> from pymbar import testsystems >>> A_t = testsystems.correlated_timeseries_example(N=10000, tau=5.0) # generate a test correlated timeseries >>> indices = subsampleCorrelatedData(A_t) # compute indices of uncorrelated timeseries >>> A_n = A_t[indices] # extract uncorrelated samples
Extract uncorrelated samples from multiple timeseries data from the same process.
>>> # Generate multiple correlated timeseries data of different lengths. >>> T_k = [1000, 2000, 3000, 4000, 5000] >>> K = len(T_k) # number of timeseries >>> tau = 5.0 # exponential relaxation time >>> A_kt = [ testsystems.correlated_timeseries_example(N=T, tau=tau) for T in T_k ] # A_kt[k] is correlated timeseries k >>> # Estimate statistical inefficiency from all timeseries data. >>> g = statisticalInefficiencyMultiple(A_kt) >>> # Count number of uncorrelated samples in each timeseries. >>> N_k = np.array([ len(subsampleCorrelatedData(A_t, g=g)) for A_t in A_kt ]) # N_k[k] is the number of uncorrelated samples in timeseries k >>> N = N_k.sum() # total number of uncorrelated samples >>> # Subsample all trajectories to produce uncorrelated samples >>> A_kn = [ A_t[subsampleCorrelatedData(A_t, g=g)] for A_t in A_kt ] # A_kn[k] is uncorrelated subset of trajectory A_kt[t] >>> # Concatenate data into one timeseries. >>> A_n = np.zeros([N], np.float32) # A_n[n] is nth sample in concatenated set of uncorrelated samples >>> A_n[0:N_k[0]] = A_kn[0] >>> for k in range(1,K): A_n[N_k[0:k].sum():N_k[0:k+1].sum()] = A_kn[k]