The timeseries module pymbar.timeseries

The pymbar.timeseries module contains tools for dealing with timeseries data. The MBAR method is only applicable to uncorrelated samples from probability distributions, so we provide a number of tools that can be used to decorrelate simulation data.

Automatically identifying the equilibrated production region

Most simulations start from initial conditions that are highly unrepresentative of equilibrated samples that occur late in the simulation. We can improve our estimates by discarding these initial regions to “equilibration” (also known as “burn-in”). We recommend a simple scheme described in Ref. [3], which identifies the production region as the final contiguous region containing the largest number of uncorrelated samples. This scheme is implemented in the detect_equilibration() method:

from pymbar import timeseries
t0, g, Neff_max = timeseries.detect_equilibration(A_t) # compute indices of uncorrelated timeseries
A_t_equil = A_t[t0:]
indices = timeseries.subsample_correlated_data(A_t_equil, g=g)
A_n = A_t_equil[indices]

In this example, the detect_equilibration() method is used on the correlated timeseries A_t to identify the sample index corresponding to the beginning of the production region, t_0, the statistical inefficiency of the production region [t0:], g, and the effective number of uncorrelated samples in the production region, Neff_max. The production (equilibrated) region of the timeseries is extracted as A_t_equil and then subsampled using the subsample_correlated_data() method with the provided statistical inefficiency g. Finally, the decorrelated samples are stored in A_n.

Note that, by default, the statistical inefficiency is computed for every time origin in a call to detect_equilibration(), which can be slow. If your dataset is more than a few hundred samples, you may want to evaluate only every nskip samples as potential time origins. This may result in discarding slightly more data than strictly necessary, but may not have a significant impact if the timeseries is long.

nskip = 10 # only try every 10 samples for time origins
t0, g, Neff_max = timeseries.detect_equilibration(A_t, nskip=nskip)

Subsampling timeseries data

If there is no need to discard the initial transient to equilibration, the subsample_correlated_data() method can be used directly to identify an effectively uncorrelated subset of data.

from pymbar import timeseries
indices = timeseries.subsample_correlated_data(A_t_equil)
A_n = A_t_equil[indices]

Here, the statistical inefficiency g is computed automatically.

Other utility timeseries functions

A number of other useful functions for computing autocorrelation functions from one or more timeseries sampled from the same process are also provided.

A module for extracting uncorrelated samples from correlated timeseries data.

This module provides various tools that allow one to examine the correlation functions and integrated autocorrelation times in correlated timeseries data, compute statistical inefficiencies, and automatically extract uncorrelated samples for data analysis.

Please reference the following if you use this code in your research:

[1] Shirts MR and Chodera JD. Statistically optimal analysis of samples from multiple equilibrium states. J. Chem. Phys. 129:124105, 2008 http://dx.doi.org/10.1063/1.2978177

[2] J. D. Chodera, W. C. Swope, J. W. Pitera, C. Seok, and K. A. Dill. Use of the weighted histogram analysis method for the analysis of simulated and parallel tempering simulations. JCTC 3(1):26-41, 2007.

pymbar.timeseries.detect_equilibration(A_t, fast=True, nskip=1)

Automatically detect equilibrated region of a dataset using a heuristic that maximizes number of effectively uncorrelated samples.

Parameters:
  • A_t (np.ndarray) – timeseries

  • nskip (int, optional, default=1) – number of samples to sparsify data by in order to speed equilibration detection

Returns:

  • t (int) – start of equilibrated data

  • g (float) – statistical inefficiency of equilibrated data

  • Neff_max (float) – number of uncorrelated samples

Notes

If your input consists of some period of equilibration followed by a constant sequence, this function treats the trailing constant sequence as having Neff = 1.

Examples

Determine start of equilibrated data for a correlated timeseries.

>>> from pymbar import testsystems
>>> A_t = testsystems.correlated_timeseries_example(N=1000, tau=5.0) # generate a test correlated timeseries
>>> [t, g, Neff_max] = detect_equilibration(A_t) # compute indices of uncorrelated timeseries

Determine start of equilibrated data for a correlated timeseries with a shift.

>>> from pymbar import testsystems
>>> A_t = testsystems.correlated_timeseries_example(N=1000, tau=5.0) + 2.0 # generate a test correlated timeseries
>>> B_t = testsystems.correlated_timeseries_example(N=10000, tau=5.0) # generate a test correlated timeseries
>>> C_t = np.concatenate([A_t, B_t])
>>> [t, g, Neff_max] = detect_equilibration(C_t, nskip=50) # compute indices of uncorrelated timeseries

Automatically detect equilibrated region of a dataset using a heuristic that maximizes number of effectively uncorrelated samples.

Parameters:
  • A_t (np.ndarray) – timeseries

  • bs_nodes (int > 4) – number of geometrically distributed binary search nodes

Returns:

  • t (int) – start of equilibrated data

  • g (float) – statistical inefficiency of equilibrated data

  • Neff_max (float) – number of uncorrelated samples

Notes

Finds the discard region (t) by a binary search on the range of possible lagtime values, with logarithmic spacings. This will give a local maximum. The global maximum is not guaranteed, but will likely be found if the N_eff[t] varies smoothly.

pymbar.timeseries.integrated_autocorrelation_time(A_n, B_n=None, fast=False, mintime=3)

Estimate the integrated autocorrelation time.

See also

statisticalInefficiency

pymbar.timeseries.integrated_autocorrelation_timeMultiple(A_kn, fast=False)

Estimate the integrated autocorrelation time from multiple timeseries.

pymbar.timeseries.normalized_fluctuation_correlation_function(A_n, B_n=None, N_max=None, norm=True)

Compute the normalized fluctuation (cross) correlation function of (two) stationary timeseries.

C(t) = (<A(t) B(t)> - <A><B>) / (<AB> - <A><B>)

This may be useful in diagnosing odd time-correlations in timeseries data.

Parameters:
  • A_n (np.ndarray) – A_n[n] is nth value of timeseries A. Length is deduced from vector.

  • B_n (np.ndarray) – B_n[n] is nth value of timeseries B. Length is deduced from vector.

  • N_max (int, default=None) – if specified, will only compute correlation function out to time lag of N_max

  • norm (bool, optional, default=True) – if False will return the unnormalized correlation function D(t) = <A(t) B(t)>

Returns:

C_n – C_n[n] is the normalized fluctuation auto- or cross-correlation function for timeseries A(t) and B(t).

Return type:

np.ndarray

Notes

The same timeseries can be used for both A_n and B_n to get the autocorrelation statistical inefficiency. This procedure may be slow. The statistical error in C_n[n] will grow with increasing n. No effort is made here to estimate the uncertainty.

References

[1] J. D. Chodera, W. C. Swope, J. W. Pitera, C. Seok, and K. A. Dill. Use of the weighted histogram analysis method for the analysis of simulated and parallel tempering simulations. JCTC 3(1):26-41, 2007.

Examples

Estimate normalized fluctuation correlation function.

>>> from pymbar import testsystems
>>> A_t = testsystems.correlated_timeseries_example(N=10000, tau=5.0)
>>> C_t = normalized_fluctuation_correlation_function(A_t, N_max=25)
pymbar.timeseries.normalized_fluctuation_correlation_function_multiple(A_kn, B_kn=None, N_max=None, norm=True, truncate=False)

Compute the normalized fluctuation (cross) correlation function of (two) timeseries from multiple timeseries samples.

C(t) = (<A(t) B(t)> - <A><B>) / (<AB> - <A><B>) This may be useful in diagnosing odd time-correlations in timeseries data.

Parameters:
  • A_kn (Python list of numpy arrays) – A_kn[k] is the kth timeseries, and A_kn[k][n] is nth value of timeseries k. Length is deduced from arrays.

  • B_kn (Python list of numpy arrays) – B_kn[k] is the kth timeseries, and B_kn[k][n] is nth value of timeseries k. B_kn[k] must have same length as A_kn[k]

  • N_max (int, optional, default=None) – if specified, will only compute correlation function out to time lag of N_max

  • norm (bool, optional, default=True) – if False, will return unnormalized D(t) = <A(t) B(t)>

  • truncate (bool, optional, default=False) – if True, will stop calculating the correlation function when it goes below 0

Returns:

C_n[n] – The normalized fluctuation auto- or cross-correlation function for timeseries A(t) and B(t).

Return type:

np.ndarray

Notes

The same timeseries can be used for both A_n and B_n to get the autocorrelation statistical inefficiency. This procedure may be slow. The statistical error in C_n[n] will grow with increasing n. No effort is made here to estimate the uncertainty.

References

[1] J. D. Chodera, W. C. Swope, J. W. Pitera, C. Seok, and K. A. Dill. Use of the weighted histogram analysis method for the analysis of simulated and parallel tempering simulations. JCTC 3(1):26-41, 2007.

Examples

Estimate a portion of the normalized fluctuation autocorrelation function from multiple timeseries of different length.

>>> from pymbar import testsystems
>>> N_k = [1000, 2000, 3000, 4000, 5000]
>>> tau = 5.0 # exponential relaxation time
>>> A_kn = [ testsystems.correlated_timeseries_example(N=N, tau=tau) for N in N_k ]
>>> C_n = normalized_fluctuation_correlation_function_multiple(A_kn, N_max=25)
pymbar.timeseries.statistical_inefficiency(A_n, B_n=None, fast=False, mintime=3, fft=False)

Compute the (cross) statistical inefficiency of (two) timeseries.

Parameters:
  • A_n (np.ndarray, float) – A_n[n] is nth value of timeseries A. Length is deduced from vector.

  • B_n (np.ndarray, float, optional, default=None) – B_n[n] is nth value of timeseries B. Length is deduced from vector. If supplied, the cross-correlation of timeseries A and B will be estimated instead of the autocorrelation of timeseries A.

  • fast (bool, optional, default=False) – f True, will use faster (but less accurate) method to estimate correlation time, described in Ref. [1] (default: False). This is ignored when B_n=None and fft=True.

  • mintime (int, optional, default=3) – minimum amount of correlation function to compute (default: 3) The algorithm terminates after computing the correlation time out to mintime when the correlation function first goes negative. Note that this time may need to be increased if there is a strong initial negative peak in the correlation function.

  • fft (bool, optional, default=False) – If fft=True and B_n=None, then use the fft based approach, as implemented in statistical_inefficiency_fft().

Returns:

g – g is the estimated statistical inefficiency (equal to 1 + 2 tau, where tau is the correlation time). We enforce g >= 1.0.

Return type:

np.ndarray,

Notes

The same timeseries can be used for both A_n and B_n to get the autocorrelation statistical inefficiency. The fast method described in Ref [1] is used to compute g.

References

[1] J. D. Chodera, W. C. Swope, J. W. Pitera, C. Seok, and K. A. Dill. Use of the weighted histogram analysis method for the analysis of simulated and parallel tempering simulations. JCTC 3(1):26-41, 2007.

Examples

Compute statistical inefficiency of timeseries data with known correlation time.

>>> from pymbar.testsystems import correlated_timeseries_example
>>> A_n = correlated_timeseries_example(N=100000, tau=5.0)
>>> g = statistical_inefficiency(A_n, fast=True)
pymbar.timeseries.statistical_inefficiency_fft(A_n, mintime=3)

Compute the (cross) statistical inefficiency of (two) timeseries.

Parameters:
  • A_n (np.ndarray, float) – A_n[n] is nth value of timeseries A. Length is deduced from vector.

  • mintime (int, optional, default=3) – minimum amount of correlation function to compute (default: 3) The algorithm terminates after computing the correlation time out to mintime when the correlation function first goes negative. Note that this time may need to be increased if there is a strong initial negative peak in the correlation function.

Returns:

g – g is the estimated statistical inefficiency (equal to 1 + 2 tau, where tau is the correlation time). We enforce g >= 1.0.

Return type:

np.ndarray,

Notes

The same timeseries can be used for both A_n and B_n to get the autocorrelation statistical inefficiency. The fast method described in Ref [1] is used to compute g.

References

[1] J. D. Chodera, W. C. Swope, J. W. Pitera, C. Seok, and K. A. Dill. Use of the weighted

histogram analysis method for the analysis of simulated and parallel tempering simulations. JCTC 3(1):26-41, 2007.

pymbar.timeseries.statistical_inefficiency_multiple(A_kn, fast=False, return_correlation_function=False)

Estimate the statistical inefficiency from multiple stationary timeseries (of potentially differing lengths).

Parameters:
  • A_kn (list of np.ndarrays) – A_kn[k] is the kth timeseries, and A_kn[k][n] is nth value of timeseries k. Length is deduced from arrays.

  • fast (bool, optional, default=False) – f True, will use faster (but less accurate) method to estimate correlation time, described in Ref. [1] (default: False)

  • return_correlation_function (bool, optional, default=False) – if True, will also return estimates of normalized fluctuation correlation function that were computed (default: False)

Returns:

  • g (np.ndarray,) – g is the estimated statistical inefficiency (equal to 1 + 2 tau, where tau is the correlation time). We enforce g >= 1.0.

  • Ct (list (of tuples)) – Ct[n] = (t, C) with time t and normalized correlation function estimate C is returned as well if return_correlation_function is set to True

Notes

The autocorrelation of the timeseries is used to compute the statistical inefficiency. The normalized fluctuation autocorrelation function is computed by averaging the unnormalized raw correlation functions. The fast method described in Ref [1] is used to compute g.

References

[1] J. D. Chodera, W. C. Swope, J. W. Pitera, C. Seok, and K. A. Dill. Use of the weighted

histogram analysis method for the analysis of simulated and parallel tempering simulations. JCTC 3(1):26-41, 2007.

Examples

Estimate statistical efficiency from multiple timeseries of different lengths.

>>> from pymbar import testsystems
>>> N_k = [1000, 2000, 3000, 4000, 5000]
>>> tau = 5.0 # exponential relaxation time
>>> A_kn = [ testsystems.correlated_timeseries_example(N=N, tau=tau) for N in N_k ]
>>> g = statistical_inefficiency_multiple(A_kn)

Also return the values of the normalized fluctuation autocorrelation function that were computed.

>>> [g, Ct] = statistical_inefficiency_multiple(A_kn, return_correlation_function=True)
pymbar.timeseries.subsample_correlated_data(A_t, g=None, fast=False, conservative=False, verbose=False)

Determine the indices of an uncorrelated subsample of the data.

Parameters:
  • A_t (np.ndarray) – A_t[t] is the t-th value of timeseries A(t). Length is deduced from vector.

  • g (float, optional) – if provided, the statistical inefficiency g is used to subsample the timeseries – otherwise it will be computed (default: None)

  • fast (bool, optional, default=False) – fast can be set to True to give a less accurate but very quick estimate (default: False)

  • conservative (bool, optional, default=False) – if set to True, uniformly-spaced indices are chosen with interval ceil(g), where g is the statistical inefficiency. Otherwise, indices are chosen non-uniformly with interval of approximately g in order to end up with approximately T/g total indices

  • verbose (bool, optional, default=False) – if True, some output is printed

Returns:

indices – the indices of an uncorrelated subsample of the data

Return type:

list of int

Notes

The statistical inefficiency is computed with the function computeStatisticalInefficiency().

Examples

Subsample a correlated timeseries to extract an effectively uncorrelated dataset.

>>> from pymbar import testsystems
>>> A_t = testsystems.correlated_timeseries_example(N=10000, tau=5.0) # generate a test correlated timeseries
>>> indices = subsample_correlated_data(A_t) # compute indices of uncorrelated timeseries
>>> A_n = A_t[indices] # extract uncorrelated samples

Extract uncorrelated samples from multiple timeseries data from the same process.

>>> # Generate multiple correlated timeseries data of different lengths.
>>> T_k = [1000, 2000, 3000, 4000, 5000]
>>> K = len(T_k) # number of timeseries
>>> tau = 5.0 # exponential relaxation time
>>> A_kt = [ testsystems.correlated_timeseries_example(N=T, tau=tau) for T in T_k ] # A_kt[k] is correlated timeseries k
>>> # Estimate statistical inefficiency from all timeseries data.
>>> g = statistical_inefficiency_multiple(A_kt)
>>> # Count number of uncorrelated samples in each timeseries.
>>> N_k = np.array([ len(subsample_correlated_data(A_t, g=g)) for A_t in A_kt ]) # N_k[k] is the number of uncorrelated samples in timeseries k
>>> N = N_k.sum() # total number of uncorrelated samples
>>> # Subsample all trajectories to produce uncorrelated samples
>>> A_kn = [ A_t[subsample_correlated_data(A_t, g=g)] for A_t in A_kt ] # A_kn[k] is uncorrelated subset of trajectory A_kt[t]
>>> # Concatenate data into one timeseries.
>>> A_n = np.zeros([N], np.float32) # A_n[n] is nth sample in concatenated set of uncorrelated samples
>>> A_n[0:N_k[0]] = A_kn[0]
>>> for k in range(1,K): A_n[N_k[0:k].sum():N_k[0:k+1].sum()] = A_kn[k]