Validation and testing analysis

Here we present two experiments: testing the accuracy of HMMSeg in recovering known model parameters; and judging the efficacy of wavelet smoothing in removing noise from an embedded signal.

Accuracy of model estimation and state assignments

The purpose of this set of experiments is to judge the ability of HMMSeg to recapitulate parameters used to generate synthetic data from a model of the same type assumed by HMMSeg. Specifically, data are generated from a two-state model, with output values coming from two Gaussian distributions: N(0, sigma) for state 0 and N(1, sigma) for state 1. Random transitions between states are governed by fixed probabilities: 0.01 for switching between states and 0.99 for staying in the same state.

Each experiment therefore consists of generating a time series of a particular length from the model, and applying HMMSeg (no smoothing) to the resulting time series, with nstates set to 2. The fitted parameters for emission and transition probabilities are then compared to those of the generative model. The accuracy of the state assignment (concordance) is also measured, by computing the percent of observations that were assigned by HMMSeg to the same state as that of the generative model.

The heatmaps below show the average differences between the actual and the HMMSeg-predicted emission means and standard deviations, and the segmentation concordance. The standard deviation sigma was varied over the set 0.1, 0.3, 1, 3, and 10, and the time series length N varied from 200 to 102,400 (ten different lengths), for a total of 50 combinations of sigma and N. We performed 10 separate simulations for each (sigma, N) pair, for a total of 500 simulations. Each box in the heatmap reflects the average over ten simultaions. Column headers are emission standard deviations, and row headers are time series lengths. Heatmaps were generated using matrix2png.

We see that HMMSeg is generally very successful in recapitulating all of the generative model parameters and segmentations for sigma < = 1. Performance degrades rapidly for sigma > = 3. Although there seems to be improvement in all accuracy measures as N increases from 200 to 800, there is no clear further improvement when N increases past 800.

Concordances:

Larger values indicate greater degrees of agreement with the correct segmentation.

State 0 means:

Smaller values indicate smaller differences (greater degrees of agreement) between the actual mean of 0 and the estimated mean.

State 1 means:

Smaller values indicate smaller differences (greater degrees of agreement) between the actual mean of 1 and the estimated mean.

State 0 standard deviations:

Smaller values indicate smaller differences (greater degrees of agreement) between the actual standard deviation indicated in the column header and the estimated standard deviation.

State 1 standard deviations:

Smaller values indicate smaller differences (greater degrees of agreement) between the actual standard deviation indicated in the column header and the estimated standard deviation.

Regular-frequency noise filtering using wavelet smoothing

The purpose of this experiment is to illustrate the effectiveness of wavelet smoothing as a technique for removing high frequency noise. We start with a generative two-state model as above, with sigma = 0.5, and fixed state transitions every 500 observations (an effective transistion probability of 0.002). We then superimpose upon that signal a sinusoidal noise curve. The frequency of the noise is chosen to be much higher than the frequency of the state transitions (period 20 for the sine versus period 500 for the signal), while the amplitude of the noise (200) is chosen to be so great as to completely mask the amplitude of the state transitions.

We generated a single time series of length 10,000 using this model. We then applied HMMSeg to the raw data, and to the 64bp wavelet-smoothed data. The chart below illustrates the futility of trying to recover the embedded signal without wavelet noise reduction - the concordance of the final segmentation with the original segmentation is 50%, or essentially no better than random. When wavelet smoothing is applied, however, the concordance jumps to 97%. The figures below illustrate the effect.

Time series: length 10000, resolution 1 bp, alternating length-500 segments
Noise on top of signal described below: 400 cos(pi * i / 10) (20-bp period)
Smoothing scale: 64 bp

Statistic	Actual model	HMMSeg model (unsmoothed)	HMMSeg model (smoothed)
State 0 mean	0	-248.22	0.06
State 0 standard deviation	0.5	130.91	0.178
State 0 state-switch probability	0.002	0.099	0.002
State 1 mean	1	255.79	0.993
State 1 standard deviation	0.5	125.93	0.069
State 1 state-switch probability	0.002	0.101	0.002
Percent agreement with actual segmentation	n/a	50	97

All plots below were generated using R.

Original signal with state 1 indicated:

Signal with superimposed period-20 noise:

Without noise:

With noise: