A Two-stage Peak Alignment Algorithm for Two-Dimensional Gas Chromatography Time-of-Flight Mass Spectrometry-Based Metabolomics

Comprehensive two-dimensional gas chromatography coupled with time-of-flight mass spectrometry (GC×GC/TOF-MS) has been applied to metabolomics analyses recently. However, retention time shifts in the two-dimensional gas chromatography will introduce difficulty to compare compound profiles obtained from multiple samples. In this work, a novel two-stage peak alignment algorithm has been developed for data analysis of GC×GC/TOF-MS. In the first stage, our algorithm detects and merges multiple peak entries of the same metabolite into one peak entry. After a z-score transformation of metabolite retention times, landmark peaks will be selected from all samples based on both two-dimensional retention times and mass spectrum similarity of fragment ions measured by Pearson's correlation coefficient. In the second stage, the original two-dimensional retention time shift will be corrected using a local linear fitting method. A progressive retention time map searching method is used to align peaks in all samples together based on the parameters optimized in the first stage. Our algorithm can avoid defining a threshold of retention time window and spectrum similarity, which is very difficult for the users since the experimental condition is always changed in different experimental runs, even for the repeat experiments. The experimental results show that our algorithm can work well in peak alignment from real biological samples, which is very important for the further analysis.


Introduction
An emerging technology, comprehensive two-dimensional gas chromatography coupled with time-of-flight mass spectrometry (GC×GC/TOF-MS), brings much increased signal-to-noise ratio, dynamic range, chemical selectivity, and sensitivity to metabolomics analyses [1][2][3][4].This approach uses a multidimensional separation technique, where a short column after the main analytical column, to separate as many compounds as possible [5,6].The orthogonal setup of two columns in separation part makes GC×GC/TOF-MS platform get an order-of-magnitude increase in separation capacity, which is very important for the analysis of many complex samples.
After analyzing these samples using GC×GC/TOF-MS, it is necessary to recognize molecular features of the same compound occurring in different samples from each of the raw instrument data [7,8].Ideally, the same compound should have the identical retention times in the two-dimensional GC if the instrument configuration is the same.However, retention times always shift in both GC dimensions as a result of several, sometimes uncontrollable factors such as temperature and pressure fluctuations, matrix effects on samples, and stationary phase degradation.Retention time shifts introduce difficulty to compare compound profiles obtained from multiple samples.Therefore, aligning the instrumental signals which generated from same compounds in different samples, i.e., peak alignment, have to consider the retention time variation [9].
Currently, four studies addressed alignment issue for the two-dimensional GC separations using the raw instrument data as input material.Fraga et al. developed a rank-based algorithm using the generalized rank annihilation method (GRAM) to correct retention time variations in the two-dimensional GC [10,11] .Mispelaar et al. developed a correlation-optimized shifting-based algorithm to align a local region of a GC×GC chromatogram [12].These two methods can only be used to align small regions of interest in the two-dimensional GC data set.To correct the entire chromatogram in both GC dimensions, Pierce et al. proposed an indexing CSBJ Abstract: Comprehensive two-dimensional gas chromatography coupled with time-of-flight mass spectrometry (GC×GC/TOF-MS) has been applied to metabolomics analyses recently.However, retention time shifts in the two-dimensional gas chromatography will introduce difficulty to compare compound profiles obtained from multiple samples.In this work, a novel two-stage peak alignment algorithm has been developed for data analysis of GC×GC/TOF-MS.In the first stage, our algorithm detects and merges multiple peak entries of the same metabolite into one peak entry.After a z-score transformation of metabolite retention times, landmark peaks will be selected from all samples based on both two-dimensional retention times and mass spectrum similarity of fragment ions measured by Pearson's correlation coefficient.In the second stage, the original twodimensional retention time shift will be corrected using a local linear fitting method.A progressive retention time map searching method is used to align peaks in all samples together based on the parameters optimized in the first stage.Our algorithm can avoid defining a threshold of retention time window and spectrum similarity, which is very difficult for the users since the experimental condition is always changed in different experimental runs, even for the repeat experiments.The experimental results show that our algorithm can work well in peak alignment from real biological samples, which is very important for the further analysis.

A Two-stage Peak Alignment Algorithm for Two-Dimensional Gas Chromatography Time-of-Flight Mass Spectrometry-Based Metabolomics
Bing Wang a,b,c,* Volume No: 7, Issue: 8, April 2013, e201304002, http://dx.doi.org/10.5936/csbj.201304002scheme together with a piecewise retention time alignment algorithm [13].Zhang et al. developed a two dimensional correlation optimized warping (2-D COW) method by extending the correlation optimized warping method from 1-D to 2-D [14,15].However, these methods align the GCGC/TOF-MS data based on two-dimensional retention times alone, even though the signature feature of a metabolite, i.e., mass spectrum of fragment ions, is readily available in the raw instrument data.Aligning metabolite peaks solely based on the two dimensional retention times may introduce a high rate of false alignment because some metabolites with similar chemical functional groups have similar retention times in both GC dimensions.For this reason, two peak alignment methods, MSort [16] and DISCO [17], were developed.In these two methods, the raw instrument data are first subjected for spectrum deconvolution to generate a list of metabolite peaks for each sample, of which each metabolite peak is characterized by multiple molecular features including retention times in the two-dimensional GC, peak area, fragment spectrum, and other associated features.Both of MSort and DISCO employ two dimensional retention times and the mass spectrum of compound fragment ions for peak alignment.MSort was designed to align homogeneous data, while DISCO can align homogeneous and heterogeneous data.These two methods greatly reduced the rate of false alignment compared to existing alignment approaches.However, MSort and DISCO softwares use a user-defined retention time window with a fixed size in the two retention time dimensions to filter the peak candidates first, then the peak pair with the highest similarity will be choose as corresponding peaks in the different experiments if its value is bigger than a user-defined threshold.The separated application of retention time distance and spectrum similarity increase false alignment since there is no a golden criteria to select the distance window and the spectrum similarity threshold.
To overcome the limitations of current alignment algorithms, this paper reports a novel alignment algorithm for GC×GC/TOF-MS which can consider the distance and spectrum similarity together and automatically set up the threshold values.After the peak lists are generated from instrumental software, the present algorithm is implemented using a two-stage strategy.In the first stage, landmark peaks, a set of compound peaks present in each biological sample, are selected from the entire original peak lists using a mixture similarity comprised of the Euclidean distances of two-dimensional retention times and mass spectrum similarity of two corresponding peaks.In the second stage, retention time shifts are corrected using a local partial linear fitting method to handle nonlinear retention time distortion, and the compound peaks of all samples then are aligned using a progressive retention time map searching method.

Methods
For each experimental run, ChromaTOF will generate a peak list analyzed by instrumental software under predefined parameters.For all of the peak lists, our algorithm will align them using a two-stage peak alignment algorithm: stage one is full alignment which will find the peaks present in all lists and align them together; stage two is partial alignment which will align all remained peaks from stage one based on the results of full alignment.The workflow can be described as Figure 1.
In our method, a mixture similarity measurement is adopted to judge two peaks from different peak list are from the same compound or not.This mixture similarity mS can be calculated as follows: where w is contribution factor, d is the Euclidean distance of two peaks in two-dimensional retention time space, max d is the maximum distance for all peaks within all samples, and Sim is Pearson's correlation value of spectra of two peaks.
The alignment algorithm is implemented using a two-stage strategy: full alignment and partial alignment.In the full alignment, the peaks presented in all peak lists will be found and aligned together.Meanwhile, the parameter w and max d in the formula (1) can be obtained for the partial alignment.

Peak merging
Assuming there is a set of peak lists: , our algorithm will merge the peaks which come from the same compound in each peak list.Ideally, all instrument signals generated by one type of metabolite should be reported as a single peak in the output file of ChromaTOF software, i.e., one peak entry in the metabolite peak list.However, multiple peak entries of the same metabolite can be reported due to the abnormal metabolite peak shape and/or the high sensitivity of the peak detection algorithm.Therefore, the multiple entries should be merged before peak alignment.
1) Initialize a user-defined retention time window RT1w, RT2w, and a user-defined mass spectrum similarity 0 R , set in the th j peak list j s , if the th i peak is the last peak in the peak list j s , set , or stop when N j  , otherwise our algorithm will store this peak into a peak set from the peak list j s , then merge them together as a new peak within j s and set its peak area and retention time as: (2) where n p A denotes peak area of the representative peak, i p A is peak area of the ith multiple entry to be merged, k is the index number of multiple peak entries to be merged, n p RT denotes retention time of the representative peak, i p RT is the retention time of the ith multiple entry to be merged.Here, the retention time is either the first dimension retention time or the second dimension retention time.

Z-score transformation
For typical GC×GC/TOF-MS configuration, the second GC column is much shorter than the first column, which will result in the second retention time is smaller than the first retention time obviously.To balance the contribution of the two dimensional retention times, the retention time values in both the first and the second dimension GC are therefore transformed into z-scores as following: where 1z RT is the z-score value after transformation, 1 RT is the original value of the first dimension retention time, 1 RT  is the mean value of the original first dimension retention times within a peak list, 1 RT  is the standard deviation of the original first dimension retention times.Accordingly, the symbols in the second dimension retention time have similar meanings.For all the peaks in all peak lists, the pairwise distances will be calculated and the maximum value is chosen as max d .

Landmark peak finding
The purpose of full alignment is to find a list of landmark peaks which present in all peak lists.Therefore, one of peak lists has been selected randomly from sample set as reference sample R s , and the remain are seen as target sample set . Then landmark peaks can be found as follows: 1) Selecting a sample randomly as the target sample T s from the set of remaining samples S , then 2) For each peak r p in the reference sample R s , our algorithm will extract the peaks with same name from the target sample T s as a candidate peak set sameName candidate P , and choose the one with the minimum distance from r p in z-score transformation space of retention time as the corresponding peak t p of r p in T s .
3) Storing the peak pair ( r p , t p ) into a set of landmark peaks lp P .
4) Updating the reference sample R s using the peaks which has found the corresponding peak, and if the number of sample in S is not one, go to step 1), else stop.
After landmark peaks set lp P has been found, we can get an optimized value of w from a candidate value set (0.05,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95).For each sample pair s is the reference sample and from the remained samples, our algorithm calculate the mixture similarity mSim of the each landmark peak using the formula (1), and get a summary value mS for each sample pair.And choose the w value with the largest summary value of mS as the optimized value for each sample pair.

Two-stage Peak Alignment Algorithm for Metabolomics
The partial alignment is to align the remained peaks which are not the landmark peaks selected the full alignment.After full alignment, we already get the value of max d and w in formula (1) and a set of landmark peaks which present in all the samples.However the remained peaks also should be aligned together even they present in part of samples.Our algorithm will use an outlier detection method to decide the threshold of distance and similarity automatically, and implement partial alignment based on those threshold values.

Outlier detection
Our algorithm considers the landmark peaks as true peak alignment, therefore the information from the landmark peaks can be used to help the partial alignment because the remained peaks are detected by the same experimental conditions and the same data processing procedure.Here we select the maximum peak distance 0 d , the minimum similarity 0 Sim and the minimum mixture similarity 0 mS as threshold to judge the remained peaks from different sample come from same compound or not.
However, because the complexity of sample source and some inevitable variation from experiment, the data points will be further away from the mean value of the above three parameters than what is deemed reasonable.The simple selection of the extreme value of parameters as the threshold will cause faulty alignment of peaks.Therefore, our algorithm remove the outliers using Grubbs' test method from the above three parameters, and set the threshold values 0 d , 0 Sim and 0 mS after outlier removal.

Peak alignment
Our algorithm aligns the remained peaks in a pair wise way, i.e.

) , ( i T R s s
where R s is the reference sample and from the remained samples.For each sample pair , the remained peaks are aligned as follows: 1) For each peak i p in R s , the peak i p in i T s , if they satisfy the criteria that 3) For each compound name, we account the times present within sameName candidate P , if it just presents in one peak pair, this peak pair will be seen as the same compound and transfer this pair into aligned peak pair set i aligned P ; if it presents in more than one peak pair, the peak pair with the largest mS value will be transferred into 6) The above steps 1)-5) will be repeated until all sample pair are aligned and all
The performance of our algorithm was tested by aligning three datasets, which are replicate analyses of the same sample using an identical two dimensional GC configuration with different column temperature gradients, i.e., 5 o C /min, 7 o C/min, and 10 o C /min, respectively.To evaluate the performance of our proposed algorithm more objectively, the experiments ramped at different temperature gradients have been repeated different times, i.e., 10 replicate analyses for 5 o C /min, 3 replicate analyses for 7 o C/min, and 4 replicate analyses for 10 o C/min.The design of the different repeat times of different experimental conditions will help us to evaluate how robustness our algorithm is.The experiments can be found in our previous work in more details [17].
Here the algorithm performance is evaluated by three measurements, i.e., true positive rate (TPR), positive predictive value (PPV), and F1 score of the peak alignment.For the sample set , if a compound presents in all N samples, it will be called as a positive peak pair.After peak alignment, if the number of positive peak pairs is p N and the number of matched peak pairs is m N , then the values of TPR, PPV and F1 score can be calculated as follows: where TP is the number of positive peak pairs that were aligned as positive (true positive), FP is the number of negative peak pairs that were aligned as positive (false positive) and is m N -TP, FN is the number of positive peak pairs that were not aligned (false negative) and is p N -TP.TPR is also called recall and PPV precision and F1 score is their harmonic mean.
One of advantages of our algorithm proposed here is there are no predefined parameters when the peak alignment is implemented.The method can automatically decide the threshold of Pearson's correlation of two spectra, the distance window in zscore transformation retention time space and the mixture similarity value.However, the complexity of samples and some inevitable variations of experiments may make the distributions of these parameters not normal.Our algorithm decides the thresholds based on the landmark peaks, which are presented all the samples.Except the information of retention time and spectrum similarity, our algorithm also uses the peak name assigned by the ChromaTOF software.After the full alignment, the distribution of parameters: Eculid distance, spectrum similarity mS , and mixture similarity mSim can be found in Figure 2. It can be seen that the each distribution of three parameters has an apparent tail.Therefore, the simple selection of maximum or minimum of these parameters may cause some inaccurate setting.Our algorithm uses an outlier detection method to remove the observations which are distant from the rest of the parameters, which makes the selections more reasonable.
Our algorithm is tested using the experimental data which are generated under different temperature gradient setup, i.e., 5 o C/min, 7 o C/min, and 10 o C/min.Therefore, 17 peak lists can be obtained from ChromaTOF where each peak list means one sample analyzed by GC×GC/TOF-MS, and they can be denoted as for the different temperature gradients respectively.After peak alignment, the overall performance can be found in Figure 3.

Figure 1 .
Figure 1.Workflow of the two-stage peak alignment algorithm.
pairs have the same name which assigned by ChormaTOF software will be extracted as a subgroup sameName candidate P , and assign the remained peak pairs as another subgroup sameName candidate P .
, our algorithm will transfer the peak pair with the largest mS value from partial alignment, our align algorithm will be finished with combination of full alignment result lp P and partial alignment result aligned P .Two-stage Peak Alignment Algorithm for Metabolomics