- Mass Spectrometry for Cancer Biomarkers
- Product | Acceleration and Improvement of Protein Identification by Mass Spectrometry
- Related Articles
- Issues to be Considered
- Login using
However, the amount of work on each worker node might be significantly different. Another more careful strategy is based on the scoring number of each spectrum; each worker deals with the same number of the scoring process. However, different scoring process could have very different operations. For example, assuming peaks are fully matched, the SDP operation number is 20 for a matched spectrum pair with 5 hit peaks, whereas the number is for a matched spectrum pair with 50 hit peaks.
As a result, the above methods do not balance the workload on each worker very well. The communication between the master and the worker is also higher than in our strategy.
Mass Spectrometry for Cancer Biomarkers
Besides, it can also be very easily enhanced to support other similar scoring methods, such as XCorr, KSDP, or other probability-based methods. In the future, we will implement a complete GPU cluster-based search engine for protein identification, and the estimated initial effect could be seen in Additional file 1. We achieve an approximate 30 to 60 times speedup on a single GPU node, relative to a serial CPU version, and a favorable speedup effect with a GPU cluster with four nodes. The workflow of the scoring module contains three steps, as shown in Algorithm 1.
First, line 1 and 2 perform the theoretical and experimental spectrum matching. We adopt the spectrum hash indexing technology presented in our previous study [ 15 ] to find the matched spectrum in O 1 complexity, where the cost of the indexing is O C. Second, lines 4—7 conduct the peak matching of each matched theoretical and experimental spectrum pair.
We again adopt the spectrum peak hash indexing technology from our previous study [ 15 ] to find the matched peak in O 1 complexity. Third, lines 6, inside the second step, computes the SDP value by the matched peaks for each matched theoretical and experimental spectrum pair. The GPU threads are organized into thread blocks, and each block of threads is executed concurrently on one streaming multiprocessor SM. Each SM has four different types of on-chip memory, namely registers, shared memory, constant cache, and texture cache [ 31 ].
Constant cache and texture cache are both read-only memories shared by all of the scalar processors SPs. Off-chip memories, such as local memory and global memory, have more space but relatively long access latency, usually to clock cycles [ 35 ]. The properties of the different types of memory are summarized in [ 35 , 36 ].
In general, the scarce registers and shared memory should be carefully used to amortize the global memory latency cost. Our first SDP algorithm on a GPU is written so that each thread deals with one theoretical spectrum, scoring with its entire matched experimental spectrum, as shown is Algorithm 2.
The differences between Algorithms 1 and 2 are as follows. First, Algorithm 2 unfolds the first for in Algorithm 1, by assigning each theoretical spectrum to a thread, which decreases the time consumption significantly as many threads are working in parallel. Second, Algorithm 2 merges the peak matching and SDP calculation steps to decrease the space for the variable.
Product | Acceleration and Improvement of Protein Identification by Mass Spectrometry
When implementing Algorithm 2, we first copy the theoretical spectrum to the global memory, then store the experimental spectrum on the texture memory and put the spectrum index file on the constant memory. We notice that when the spectrum dataset is small, including the total number and the spectrum length, we can use the on-chip register for the experimental spectrum and other variables. We illustrate the effect in detail in the Results section. However, the problem in the implementation of Algorithm 2 is the limited size of the registers.
In fact, users are not allowed to fully control the registers, and can only adopt registers when the data size is small enough. As the size and length of the spectrum grows, the data cannot be loaded into the registers and are instead stored in local memory or global memory, which increases the reading latency and decreases the performance significantly. In each mass tolerance window, a group of experimental spectra will score with a group of theoretical spectra.
Based on this observation, we design Algorithm 3 for a dense mass, using registers and shared memory together. TH and TW are preset values, which could be the integral multiple number of the thread number in a half GPU warp, 16, 32 or 64, to make the best use of the GPU warp mechanism.
For each thread, indexT points to the right position of the theoretical spectrum, which contains the following three parts as shown in line 4: theo is the beginning address of the theoretical spectrum; as the height of the theo is divided by TH , blockIdx. The computing process in a dense mass window. The figure shows the calculation in one dense mass window. In line 5, indexC points to the right position in the experimental spectrum, which also has three parts: expe is the beginning address of the current spectrum; blockIdx. Obviously, the threads in one block would access the experimental spectrum in continuous addresses, which is also called coalesced accessing.
In the loop from line 11 to 16, the algorithm loads a tile of data from the global memory to the shared memory, and computes the SDP score saved in TResult, which is stored on the on-chip registers; the loop ends when the whole row has been calculated.
Issues to be Considered
Line 17 waits for all of the threads to finish their work. Line 18 writes the distance back from TResult to SR.
The main purpose of Algorithm 3 is to decrease the global memory access time and latency by loading the theoretical spectrum into the shared memory, tile by tile. Thus, Algorithm 3 reads each theoretical spectrum from global memory only once, the same as Algorithm 2. The key feature of Algorithm 3 is how efficiently it accesses the global memory and shared memory; this is achieved by adopting coalescing reading that accesses sixteen continuous addresses for the threads in a half warp to avoid the bank conflict.
On the GPU cluster, as each node could adopt Algorithm 2 and 3, the main concern is how to dispatch the work to each node and achieve a high workload balance. We design a new complete pre-calculation strategy to make each node work on nearly the same task. In this strategy, we first run the workflow of protein identification before the scoring stage to get experimental and theoretical matching results. These results tell us how many peptides each spectrum will score with, as shown in Algorithm 4, line 1—4.
As a result, we get the operation number for each mass range, such as one Dalton, from Dalton to Dalton, based on the experimental and theoretical precursor mass, matching results, and each SDP operation. Third, we dispatch the task by mass range and give each node the same amount of work. As shown in Algorithm 5, line 1—3 calculate the total number of operations; line 4 gets the average number of operations on each GPU node; line 5—10 travers the Oper array; when the temporal summary of operation exceeds the average number WorkerOper , Algorithm 5 call Algorithm 2 or 3, to deal with the current mass range, which is Oper [ p ] to Oper [ k ].
The overhead of the pre-calculation strategy is also very low, as shown in the Results section. After the calculation, the master node only transfers data to the computing node once, and this lowers the communication cost. Nat Biotechnol. Uy R, Wold F: Posttranslational covalent modification of proteins.
J Mol Biol. Nat Methods.
J Am Soc Mass Spectrom. Rapid Commun Mass Spectrom. J Proteome Res. Anal Chem. Dutta D, Chen T: Speeding up tandem mass spectrometry database search: metric embeddings and fast near neighbor search. Tandem, an improved method for running X! J Parallel Distrib Comput. J Comput Syst Sci. Edited by: Znati T, Zhang Y. Elias JE, Gygi SP: Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry.
Download references. Correspondence to Xiaowen Chu. In the past five years, all the authors have not received reimbursements, fees, funding, or salary from an organization that may in any way gain or lose financially from the publication of this manuscript.
- Protein Identification by Mass Spectrometry | Molecular & Cellular Proteomics?
- Optimization of a MALDI TOF-TOF mass spectrometer for intact protein analysis - ScienceDirect;
- The Mess Detectives: The Trouble with Larry / VeggieTales (Big Idea Books)!
And all the authors will not receive them in the future. No such organization is financing this manuscript including the article-processing charge. All the authors do not hold any stocks or shares in an organization that may in any way gain or lose financially from the publication of this manuscript, both now and in the future.
All the authors do not hold and are not applying for any patents relating to the content of the manuscript. All the authors have not received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript. There are not non-financial competing interests political, personal, religious, ideological, academic, intellectual, commercial or any other to declare in relation to this manuscript. YL and XC designed this study. YL implemented the algorithms and performed the experiment. All of the authors have read and approved the final manuscript.
Reprints and Permissions. Li, Y. Accelerating the scoring module of mass spectrometry-based peptide identification using GPUs. BMC Bioinformatics 15, doi Download citation. Search all BMC articles Search. Abstract Background Tandem mass spectrometry-based database searching is currently the main method for protein identification in shotgun proteomics.
Conclusions Our GPU-based SDP algorithm can significantly improve the speed of the scoring module in mass spectrometry-based protein identification. Results All the experiments were conducted on a GPU cluster that included one master node mu01 and four computing nodes Fermi. Figure 1. Full size image. Table 1 GPU cluster specifications Full size table.