The timeseries module pymbar.timeseries

The pymbar.timeseries module contains tools for dealing with timeseries data. The MBAR method is only applicable to uncorrelated samples from probability distributions, so we provide a number of tools that can be used to decorrelate simulation data.

Automatically identifying the equilibrated production region

Most simulations start from initial conditions that are highly unrepresentative of equilibrated samples that occur late in the simulation. We can improve our estimates by discarding these initial regions to “equilibration” (also known as “burn-in”). We recommend a simple scheme described in Ref. [2], which identifies the production region as the final contiguous region containing the largest number of uncorrelated samples. This scheme is implemented in the detectEquilibration() method:

from pymbar import timeseries
[t0, g, Neff_max] = timeseries.detectEquilibration(A_t) # compute indices of uncorrelated timeseries
A_t_equil = A_t[t0:]
indices = timeseries.subsampleCorrelatedData(A_t_equil, g=g)
A_n = A_t_equil[indices]

In this example, the detectEquilibration() method is used on the correlated timeseries A_t to identify the sample index corresponding to the beginning of the production region, t_0, the statistical inefficiency of the production region [t0:], g, and the effective number of uncorrelated samples in the production region, Neff_max. The production (equilibrated) region of the timeseries is extracted as A_t_equil and then subsampled using the subsampleCorrelatedData() method with the provided statistical inefficiency g. Finally, the decorrelated samples are stored in A_n.

Note that, by default, the statistical inefficiency is computed for every time origin in a call to detectEquilibration(), which can be slow. If your dataset is more than a few hundred samples, you may want to evaluate only every nskip samples as potential time origins. This may result in discarding slightly more data than strictly necessary, but may not have a significant impact if the timeseries is long.

nskip = 10 # only try every 10 samples for time origins
[t0, g, Neff_max] = timeseries.detectEquilibration(A_t, nskip=nskip)

Subsampling timeseries data

If there is no need to discard the initial transient to equilibration, the subsampleCorrelatedData() method can be used directly to identify an effectively uncorrelated subset of data.

from pymbar import timeseries
indices = timeseries.subsampleCorrelatedData(A_t_equil)
A_n = A_t_equil[indices]

Here, the statistical inefficiency g is computed automatically.

Other utility timeseries functions

A number of other useful functions for computing autocorrelation functions from one or more timeseries sampled from the same process are also provided.

A module for extracting uncorrelated samples from correlated timeseries data.

This module provides various tools that allow one to examine the correlation functions and integrated autocorrelation times in correlated timeseries data, compute statistical inefficiencies, and automatically extract uncorrelated samples for data analysis.

Please reference the following if you use this code in your research:

[1] Shirts MR and Chodera JD. Statistically optimal analysis of samples from multiple equilibrium states. J. Chem. Phys. 129:124105, 2008 http://dx.doi.org/10.1063/1.2978177

[2] J. D. Chodera, W. C. Swope, J. W. Pitera, C. Seok, and K. A. Dill. Use of the weighted histogram analysis method for the analysis of simulated and parallel tempering simulations. JCTC 3(1):26-41, 2007.

pymbar.timeseries.detectEquilibration(A_t, fast=True, nskip=1)

Automatically detect equilibrated region of a dataset using a heuristic that maximizes number of effectively uncorrelated samples.

Parameters:

A_t : np.ndarray

timeseries

nskip : int, optional, default=1

number of samples to sparsify data by in order to speed equilibration detection

Returns:

t : int

start of equilibrated data

g : float

statistical inefficiency of equilibrated data

Neff_max : float

number of uncorrelated samples

Notes

If your input consists of some period of equilibration followed by a constant sequence, this function treats the trailing constant sequence as having Neff = 1.

Examples

Determine start of equilibrated data for a correlated timeseries.

>>> from pymbar import testsystems
>>> A_t = testsystems.correlated_timeseries_example(N=1000, tau=5.0) # generate a test correlated timeseries
>>> [t, g, Neff_max] = detectEquilibration(A_t) # compute indices of uncorrelated timeseries

Determine start of equilibrated data for a correlated timeseries with a shift.

>>> from pymbar import testsystems
>>> A_t = testsystems.correlated_timeseries_example(N=1000, tau=5.0) + 2.0 # generate a test correlated timeseries
>>> B_t = testsystems.correlated_timeseries_example(N=10000, tau=5.0) # generate a test correlated timeseries
>>> C_t = np.concatenate([A_t, B_t])
>>> [t, g, Neff_max] = detectEquilibration(C_t, nskip=50) # compute indices of uncorrelated timeseries

Automatically detect equilibrated region of a dataset using a heuristic that maximizes number of effectively uncorrelated samples.

Parameters:

A_t : np.ndarray

timeseries

bs_nodes : int > 4

number of geometrically distributed binary search nodes

Returns:

t : int

start of equilibrated data

g : float

statistical inefficiency of equilibrated data

Neff_max : float

number of uncorrelated samples

Notes

Finds the discard region (t) by a binary search on the range of possible lagtime values, with logarithmic spacings. This will give a local maximum. The global maximum is not guaranteed, but will likely be found if the N_eff[t] varies smoothly.

pymbar.timeseries.integratedAutocorrelationTime(A_n, B_n=None, fast=False, mintime=3)

Estimate the integrated autocorrelation time.

pymbar.timeseries.integratedAutocorrelationTimeMultiple(A_kn, fast=False)

Estimate the integrated autocorrelation time from multiple timeseries.

pymbar.timeseries.normalizedFluctuationCorrelationFunction(A_n, B_n=None, N_max=None, norm=True)

Compute the normalized fluctuation (cross) correlation function of (two) stationary timeseries.

C(t) = (<A(t) B(t)> - <A><B>) / (<AB> - <A><B>)

This may be useful in diagnosing odd time-correlations in timeseries data.

Parameters:

A_n : np.ndarray

A_n[n] is nth value of timeseries A. Length is deduced from vector.

B_n : np.ndarray

B_n[n] is nth value of timeseries B. Length is deduced from vector.

N_max : int, default=None

if specified, will only compute correlation function out to time lag of N_max

norm: bool, optional, default=True

if False will return the unnormalized correlation function D(t) = <A(t) B(t)>

Returns:

C_n : np.ndarray

C_n[n] is the normalized fluctuation auto- or cross-correlation function for timeseries A(t) and B(t).

Notes

The same timeseries can be used for both A_n and B_n to get the autocorrelation statistical inefficiency. This procedure may be slow. The statistical error in C_n[n] will grow with increasing n. No effort is made here to estimate the uncertainty.

References

[1] J. D. Chodera, W. C. Swope, J. W. Pitera, C. Seok, and K. A. Dill. Use of the weighted histogram analysis method for the analysis of simulated and parallel tempering simulations. JCTC 3(1):26-41, 2007.

Examples

Estimate normalized fluctuation correlation function.

>>> from pymbar import testsystems
>>> A_t = testsystems.correlated_timeseries_example(N=10000, tau=5.0)
>>> C_t = normalizedFluctuationCorrelationFunction(A_t, N_max=25)
pymbar.timeseries.normalizedFluctuationCorrelationFunctionMultiple(A_kn, B_kn=None, N_max=None, norm=True, truncate=False)

Compute the normalized fluctuation (cross) correlation function of (two) timeseries from multiple timeseries samples.

C(t) = (<A(t) B(t)> - <A><B>) / (<AB> - <A><B>) This may be useful in diagnosing odd time-correlations in timeseries data.

Parameters:

A_kn : Python list of numpy arrays

A_kn[k] is the kth timeseries, and A_kn[k][n] is nth value of timeseries k. Length is deduced from arrays.

B_kn : Python list of numpy arrays

B_kn[k] is the kth timeseries, and B_kn[k][n] is nth value of timeseries k. B_kn[k] must have same length as A_kn[k]

N_max : int, optional, default=None

if specified, will only compute correlation function out to time lag of N_max

norm: bool, optional, default=True

if False, will return unnormalized D(t) = <A(t) B(t)>

truncate: bool, optional, default=False

if True, will stop calculating the correlation function when it goes below 0

Returns:

C_n[n] : np.ndarray

The normalized fluctuation auto- or cross-correlation function for timeseries A(t) and B(t).

Notes

The same timeseries can be used for both A_n and B_n to get the autocorrelation statistical inefficiency. This procedure may be slow. The statistical error in C_n[n] will grow with increasing n. No effort is made here to estimate the uncertainty.

References

[1] J. D. Chodera, W. C. Swope, J. W. Pitera, C. Seok, and K. A. Dill. Use of the weighted histogram analysis method for the analysis of simulated and parallel tempering simulations. JCTC 3(1):26-41, 2007.

Examples

Estimate a portion of the normalized fluctuation autocorrelation function from multiple timeseries of different length.

>>> from pymbar import testsystems
>>> N_k = [1000, 2000, 3000, 4000, 5000]
>>> tau = 5.0 # exponential relaxation time
>>> A_kn = [ testsystems.correlated_timeseries_example(N=N, tau=tau) for N in N_k ]
>>> C_n = normalizedFluctuationCorrelationFunctionMultiple(A_kn, N_max=25)
pymbar.timeseries.statisticalInefficiency(A_n, B_n=None, fast=False, mintime=3, fft=False)

Compute the (cross) statistical inefficiency of (two) timeseries.

Parameters:

A_n : np.ndarray, float

A_n[n] is nth value of timeseries A. Length is deduced from vector.

B_n : np.ndarray, float, optional, default=None

B_n[n] is nth value of timeseries B. Length is deduced from vector. If supplied, the cross-correlation of timeseries A and B will be estimated instead of the autocorrelation of timeseries A.

fast : bool, optional, default=False

f True, will use faster (but less accurate) method to estimate correlation time, described in Ref. [1] (default: False). This is ignored when B_n=None and fft=True.

mintime : int, optional, default=3

minimum amount of correlation function to compute (default: 3) The algorithm terminates after computing the correlation time out to mintime when the correlation function first goes negative. Note that this time may need to be increased if there is a strong initial negative peak in the correlation function.

fft : bool, optional, default=False

If fft=True and B_n=None, then use the fft based approach, as implemented in statisticalInefficiency_fft().

Returns:

g : np.ndarray,

g is the estimated statistical inefficiency (equal to 1 + 2 tau, where tau is the correlation time). We enforce g >= 1.0.

Notes

The same timeseries can be used for both A_n and B_n to get the autocorrelation statistical inefficiency. The fast method described in Ref [1] is used to compute g.

References

[1] J. D. Chodera, W. C. Swope, J. W. Pitera, C. Seok, and K. A. Dill. Use of the weighted
histogram analysis method for the analysis of simulated and parallel tempering simulations. JCTC 3(1):26-41, 2007.

Examples

Compute statistical inefficiency of timeseries data with known correlation time.

>>> from pymbar.testsystems import correlated_timeseries_example
>>> A_n = correlated_timeseries_example(N=100000, tau=5.0)
>>> g = statisticalInefficiency(A_n, fast=True)
pymbar.timeseries.statisticalInefficiencyMultiple(A_kn, fast=False, return_correlation_function=False)

Estimate the statistical inefficiency from multiple stationary timeseries (of potentially differing lengths).

Parameters:

A_kn : list of np.ndarrays

A_kn[k] is the kth timeseries, and A_kn[k][n] is nth value of timeseries k. Length is deduced from arrays.

fast : bool, optional, default=False

f True, will use faster (but less accurate) method to estimate correlation time, described in Ref. [1] (default: False)

return_correlation_function : bool, optional, default=False

if True, will also return estimates of normalized fluctuation correlation function that were computed (default: False)

Returns:

g : np.ndarray,

g is the estimated statistical inefficiency (equal to 1 + 2 tau, where tau is the correlation time). We enforce g >= 1.0.

Ct : list (of tuples)

Ct[n] = (t, C) with time t and normalized correlation function estimate C is returned as well if return_correlation_function is set to True

Notes

The autocorrelation of the timeseries is used to compute the statistical inefficiency. The normalized fluctuation autocorrelation function is computed by averaging the unnormalized raw correlation functions. The fast method described in Ref [1] is used to compute g.

References

[1] J. D. Chodera, W. C. Swope, J. W. Pitera, C. Seok, and K. A. Dill. Use of the weighted
histogram analysis method for the analysis of simulated and parallel tempering simulations. JCTC 3(1):26-41, 2007.

Examples

Estimate statistical efficiency from multiple timeseries of different lengths.

>>> from pymbar import testsystems
>>> N_k = [1000, 2000, 3000, 4000, 5000]
>>> tau = 5.0 # exponential relaxation time
>>> A_kn = [ testsystems.correlated_timeseries_example(N=N, tau=tau) for N in N_k ]
>>> g = statisticalInefficiencyMultiple(A_kn)

Also return the values of the normalized fluctuation autocorrelation function that were computed.

>>> [g, Ct] = statisticalInefficiencyMultiple(A_kn, return_correlation_function=True)
pymbar.timeseries.statisticalInefficiency_fft(A_n, mintime=3, memsafe=None)

Compute the (cross) statistical inefficiency of (two) timeseries.

Parameters:

A_n : np.ndarray, float

A_n[n] is nth value of timeseries A. Length is deduced from vector.

mintime : int, optional, default=3

minimum amount of correlation function to compute (default: 3) The algorithm terminates after computing the correlation time out to mintime when the correlation function first goes negative. Note that this time may need to be increased if there is a strong initial negative peak in the correlation function.

memsafe: bool, optional, default=None (in depreciation)

If this function is used several times on arrays of comparable size then one might benefit from setting this option to False. If set to True then clear np.fft cache to avoid a fast increase in memory consumption when this function is called on many arrays of different sizes.

Returns:

g : np.ndarray,

g is the estimated statistical inefficiency (equal to 1 + 2 tau, where tau is the correlation time). We enforce g >= 1.0.

Notes

The same timeseries can be used for both A_n and B_n to get the autocorrelation statistical inefficiency. The fast method described in Ref [1] is used to compute g.

References

[1] J. D. Chodera, W. C. Swope, J. W. Pitera, C. Seok, and K. A. Dill. Use of the weighted
histogram analysis method for the analysis of simulated and parallel tempering simulations. JCTC 3(1):26-41, 2007.
pymbar.timeseries.subsampleCorrelatedData(A_t, g=None, fast=False, conservative=False, verbose=False)

Determine the indices of an uncorrelated subsample of the data.

Parameters:

A_t : np.ndarray

A_t[t] is the t-th value of timeseries A(t). Length is deduced from vector.

g : float, optional

if provided, the statistical inefficiency g is used to subsample the timeseries – otherwise it will be computed (default: None)

fast : bool, optional, default=False

fast can be set to True to give a less accurate but very quick estimate (default: False)

conservative : bool, optional, default=False

if set to True, uniformly-spaced indices are chosen with interval ceil(g), where g is the statistical inefficiency. Otherwise, indices are chosen non-uniformly with interval of approximately g in order to end up with approximately T/g total indices

verbose : bool, optional, default=False

if True, some output is printed

Returns:

indices : list of int

the indices of an uncorrelated subsample of the data

Notes

The statistical inefficiency is computed with the function computeStatisticalInefficiency().

Examples

Subsample a correlated timeseries to extract an effectively uncorrelated dataset.

>>> from pymbar import testsystems
>>> A_t = testsystems.correlated_timeseries_example(N=10000, tau=5.0) # generate a test correlated timeseries
>>> indices = subsampleCorrelatedData(A_t) # compute indices of uncorrelated timeseries
>>> A_n = A_t[indices] # extract uncorrelated samples

Extract uncorrelated samples from multiple timeseries data from the same process.

>>> # Generate multiple correlated timeseries data of different lengths.
>>> T_k = [1000, 2000, 3000, 4000, 5000]
>>> K = len(T_k) # number of timeseries
>>> tau = 5.0 # exponential relaxation time
>>> A_kt = [ testsystems.correlated_timeseries_example(N=T, tau=tau) for T in T_k ] # A_kt[k] is correlated timeseries k
>>> # Estimate statistical inefficiency from all timeseries data.
>>> g = statisticalInefficiencyMultiple(A_kt)
>>> # Count number of uncorrelated samples in each timeseries.
>>> N_k = np.array([ len(subsampleCorrelatedData(A_t, g=g)) for A_t in A_kt ]) # N_k[k] is the number of uncorrelated samples in timeseries k
>>> N = N_k.sum() # total number of uncorrelated samples
>>> # Subsample all trajectories to produce uncorrelated samples
>>> A_kn = [ A_t[subsampleCorrelatedData(A_t, g=g)] for A_t in A_kt ] # A_kn[k] is uncorrelated subset of trajectory A_kt[t]
>>> # Concatenate data into one timeseries.
>>> A_n = np.zeros([N], np.float32) # A_n[n] is nth sample in concatenated set of uncorrelated samples
>>> A_n[0:N_k[0]] = A_kn[0]
>>> for k in range(1,K): A_n[N_k[0:k].sum():N_k[0:k+1].sum()] = A_kn[k]