Utilizes functionality from MSstats to clean, summarize, and normalize PTM and protein level data. Imputes missing values, protein and PTM level summarization from peptide level quantification. Applies global median normalization on peptide level data and normalizes between runs.
dataSummarizationPTM( data, logTrans = 2, normalization = "equalizeMedians", normalization.PTM = "equalizeMedians", nameStandards = NULL, nameStandards.PTM = NULL, fillIncompleteRows = TRUE, featureSubset = "all", featureSubset.PTM = "all", remove_uninformative_feature_outlier = FALSE, remove_uninformative_feature_outlier.PTM = FALSE, n_top_feature = 3, n_top_feature.PTM = 3, summaryMethod = "TMP", equalFeatureVar = TRUE, censoredInt = "NA", cutoffCensored = "minFeature", MBimpute = TRUE, MBimpute.PTM = TRUE, remove50missing = FALSE, address = "", maxQuantileforCensored = 0.999, clusters = NULL )
data | name of the list with PTM and (optionally) Protein data.tables, which can be the output of the MSstatsPTM converter functions |
---|---|
logTrans | logarithm transformation with base 2(default) or 10 |
normalization | normalization for the protein level dataset, to remove systematic bias between MS runs. There are three different normalizations supported. 'equalizeMedians'(default) represents constant normalization (equalizing the medians) based on reference signals is performed. 'quantile' represents quantile normalization based on reference signals is performed. 'globalStandards' represents normalization with global standards proteins. FALSE represents no normalization is performed |
normalization.PTM | normalization for PTM level dataset. Default is "equalizeMedians" Can be adjusted to any of the options described above. |
nameStandards | vector of global standard peptide names for protein dataset. only for normalization with global standard peptides. |
nameStandards.PTM | Same as above for PTM dataset. |
fillIncompleteRows | If the input dataset has incomplete rows, TRUE(default) adds the rows with intensity value=NA for missing peaks. FALSE reports error message with list of features which have incomplete rows |
featureSubset | For protein dataset only. "all"(default) uses all features that the data set has. "top3" uses top 3 features which have highest average of log2(intensity) across runs. "topN" uses top N features which has highest average of log2(intensity) across runs. It needs the input for n_top_feature option. "highQuality" flags uninformative feature and outliers |
featureSubset.PTM | For PTM dataset only. Options same as above. |
remove_uninformative_feature_outlier | For protein dataset only. It only works after users used featureSubset="highQuality" in dataProcess. TRUE allows to remove 1) the features are flagged in the column, feature_quality="Uninformative" which are features with bad quality, 2) outliers that are flagged in the column, is_outlier=TRUE, for run-level summarization. FALSE (default) uses all features and intensities for run-level summarization. |
remove_uninformative_feature_outlier.PTM | For PTM dataset only. Options same as above. |
n_top_feature | For protein dataset only. The number of top features for featureSubset='topN'. Default is 3, which means to use top 3 features. |
n_top_feature.PTM | For PTM dataset only. Options same as above. |
summaryMethod | "TMP"(default) means Tukey's median polish, which is robust estimation method. "linear" uses linear mixed model. |
equalFeatureVar | only for summaryMethod="linear". default is TRUE. Logical variable for whether the model should account for heterogeneous variation among intensities from different features. Default is TRUE, which assume equal variance among intensities from features. FALSE means that we cannot assume equal variance among intensities from features, then we will account for heterogeneous variation from different features. |
censoredInt | Missing values are censored or at random. 'NA' (default) assumes that all 'NA's in 'Intensity' column are censored. '0' uses zero intensities as censored intensity. In this case, NA intensities are missing at random. The output from Skyline should use '0'. Null assumes that all NA intensites are randomly missing. |
cutoffCensored | Cutoff value for censoring. only with censoredInt='NA' or '0'. Default is 'minFeature', which uses minimum value for each feature. 'minFeatureNRun' uses the smallest between minimum value of corresponding feature and minimum value of corresponding run. 'minRun' uses minumum value for each run. |
MBimpute | For protein dataset only. only for summaryMethod="TMP" and censoredInt='NA' or '0'. TRUE (default) imputes 'NA' or '0' (depending on censoredInt option) by Accelated failure model. FALSE uses the values assigned by cutoffCensored. |
MBimpute.PTM | For PTM dataset only. Options same as above. |
remove50missing | only for summaryMethod="TMP". TRUE removes the runs which have more than 50% missing values. FALSE is default. |
address | the name of folder that will store the results. Default folder is the current working directory. The command address can help to specify where to store the file as well as how to modify the beginning of the file name. |
maxQuantileforCensored | Maximum quantile for deciding censored missing values. default is 0.999 |
clusters | a user specified number of clusters. default is NULL, which does not use cluster. |
list of summarized PTM and Protein results. These results contain the reformatted input to the summarization function, as well as run-level summarization results.
#> # A tibble: 6 x 10 #> ProteinName PeptideSequence Condition BioReplicate Run Intensity #> <chr> <chr> <chr> <chr> <chr> <dbl> #> 1 Q9UHD8_K262 DAGLK*QAPASR CCCP BCH1 CCCP-B1T1 1423906. #> 2 Q9UHD8_K262 DAGLK*QAPASR CCCP BCH1 CCCP-B1T2 877045. #> 3 Q9UHD8_K262 DAGLK*QAPASR CCCP BCH2 CCCP-B2T1 384418. #> 4 Q9UHD8_K262 DAGLK*QAPASR CCCP BCH2 CCCP-B2T2 454858. #> 5 Q9UHD8_K262 DAGLK*QAPASR Combo BCH1 Combo-B1T1 1603377. #> 6 Q9UHD8_K262 DAGLK*QAPASR Combo BCH1 Combo-B1T2 676555. #> # ... with 4 more variables: PrecursorCharge <chr>, FragmentIon <lgl>, #> # ProductCharge <lgl>, IsotopeLabelType <chr>#> # A tibble: 6 x 10 #> ProteinName PeptideSequence Condition BioReplicate Run Intensity #> <chr> <chr> <chr> <chr> <chr> <dbl> #> 1 Q9UHD8 STLINTLFK CCCP BCH2 CCCP-B2T1 367944. #> 2 Q9UHD8 STLINTLFK CCCP BCH2 CCCP-B2T2 341207. #> 3 Q9UHD8 STLINTLFK Combo BCH2 Combo-B2T1 185843. #> 4 Q9UHD8 STLINTLFK Ctrl BCH2 Ctrl-B2T1 529224. #> 5 Q9UHD8 STLINTLFK Ctrl BCH2 Ctrl-B2T2 483355. #> 6 Q9UHD8 STLINTLFK USP30_OE BCH2 USP30_OE-B2T1 447795. #> # ... with 4 more variables: PrecursorCharge <chr>, FragmentIon <lgl>, #> # ProductCharge <lgl>, IsotopeLabelType <chr>quant.lf.msstatsptm <- dataSummarizationPTM(raw.input)#>#> INFO [2021-04-30 11:44:12] ** Features with one or two measurements across runs are removed. #> INFO [2021-04-30 11:44:12] ** Fractionation handled. #> INFO [2021-04-30 11:44:12] ** Updated quantification data to make balanced design. Missing values are marked by NA #> INFO [2021-04-30 11:44:12] ** Log2 intensities under cutoff = 13.751 were considered as censored missing values. #> INFO [2021-04-30 11:44:12] ** Log2 intensities = NA were considered as censored missing values. #> INFO [2021-04-30 11:44:12] ** Use all features that the dataset originally has. #> INFO [2021-04-30 11:44:12] #> # proteins: 125 #> # peptides per protein: 1-5 #> # features per peptide: 1-1 #> INFO [2021-04-30 11:44:12] Five or more proteins have only one feature: #> Q9UHD8_K028, #> Q9UHD8_K069, #> Q9UHD8_K141, #> Q9UHQ9_K046, #> Q9UHQ9_K062 ... #> INFO [2021-04-30 11:44:12] #> CCCP Combo Ctrl USP30_OE #> #> # runs 4 4 4 4 #> #> # bioreplicates 2 2 2 2 #> #> # tech. replicates 2 2 2 2 #> INFO [2021-04-30 11:44:12] Five or more features are completely missing in at least one condition, #> VYLK*GVHPK_2_NA_NA, #> GVHPK*FPEGGK_2_NA_NA, #> MSQYLDSLK*VGDVVEFR_3_NA_NA, #> DWAYSK*GFVTADMIR_2_NA_NA, #> DWAYSK*GFVTADMIR_3_NA_NA #> INFO [2021-04-30 11:44:12] #> == Start the summarization per subplot... #> | | | 0% | |= | 1% | |= | 2% | |== | 2% | |== | 3% | |=== | 4% | |=== | 5% | |==== | 6% | |===== | 7% | |====== | 8% | |====== | 9% | |======= | 10% | |======== | 11% | |======== | 12% | |========= | 13% | |========== | 14% | |=========== | 15% | |=========== | 16% | |============ | 17% | |============ | 18% | |============= | 18% | |============= | 19% | |============== | 20% | |=============== | 21% | |=============== | 22% | |================ | 22% | |================ | 23% | |================= | 24% | |================= | 25% | |================== | 26% | |=================== | 27% | |==================== | 28% | |==================== | 29%#> Warning: Ran out of iterations and did not converge#> | |===================== | 30% | |====================== | 31% | |====================== | 32% | |======================= | 33% | |======================== | 34% | |========================= | 35% | |========================= | 36% | |========================== | 37% | |========================== | 38% | |=========================== | 38% | |=========================== | 39% | |============================ | 40% | |============================= | 41% | |============================= | 42% | |============================== | 42% | |============================== | 43% | |=============================== | 44% | |=============================== | 45% | |================================ | 46% | |================================= | 47% | |================================== | 48% | |================================== | 49% | |=================================== | 50% | |==================================== | 51% | |==================================== | 52% | |===================================== | 53% | |====================================== | 54% | |======================================= | 55% | |======================================= | 56% | |======================================== | 57% | |======================================== | 58% | |========================================= | 58% | |========================================= | 59% | |========================================== | 60% | |=========================================== | 61% | |=========================================== | 62% | |============================================ | 62% | |============================================ | 63% | |============================================= | 64% | |============================================= | 65% | |============================================== | 66% | |=============================================== | 67% | |================================================ | 68% | |================================================ | 69% | |================================================= | 70% | |================================================== | 71% | |================================================== | 72% | |=================================================== | 73% | |==================================================== | 74% | |===================================================== | 75% | |===================================================== | 76% | |====================================================== | 77% | |====================================================== | 78% | |======================================================= | 78% | |======================================================= | 79% | |======================================================== | 80% | |========================================================= | 81% | |========================================================= | 82% | |========================================================== | 82% | |========================================================== | 83% | |=========================================================== | 84% | |=========================================================== | 85% | |============================================================ | 86%#> Warning: Ran out of iterations and did not converge#> | |============================================================= | 87% | |============================================================== | 88% | |============================================================== | 89% | |=============================================================== | 90% | |================================================================ | 91% | |================================================================ | 92% | |================================================================= | 93% | |================================================================== | 94% | |=================================================================== | 95% | |=================================================================== | 96% | |==================================================================== | 97% | |==================================================================== | 98% | |===================================================================== | 98% | |===================================================================== | 99% | |======================================================================| 100%INFO [2021-04-30 11:44:13] == the summarization per subplot is done.#>#> INFO [2021-04-30 11:44:14] ** Features with one or two measurements across runs are removed. #> INFO [2021-04-30 11:44:14] ** Fractionation handled. #> INFO [2021-04-30 11:44:14] ** Updated quantification data to make balanced design. Missing values are marked by NA #> INFO [2021-04-30 11:44:14] ** Log2 intensities under cutoff = 16.998 were considered as censored missing values. #> INFO [2021-04-30 11:44:14] ** Log2 intensities = NA were considered as censored missing values. #> INFO [2021-04-30 11:44:14] ** Use all features that the dataset originally has. #> INFO [2021-04-30 11:44:14] #> # proteins: 26 #> # peptides per protein: 1-9 #> # features per peptide: 1-1 #> INFO [2021-04-30 11:44:14] Five or more proteins have only one feature: #> Q9UHD8, #> Q9UHQ9, #> Q9UIF8, #> Q9UL25, #> Q9UNH7 ... #> INFO [2021-04-30 11:44:14] #> CCCP Combo Ctrl USP30_OE #> #> # runs 4 4 4 4 #> #> # bioreplicates 2 2 2 2 #> #> # tech. replicates 2 2 2 2 #> INFO [2021-04-30 11:44:14] Five or more features are completely missing in at least one condition, #> TVYSHLFDHVVNR_4_NA_NA, #> TDQFPLFLIIMGK_2_NA_NA, #> LMKMAR_2_NA_NA, #> TVYSHLFDHVVNR_4_NA_NA, #> TDQFPLFLIIMGK_2_NA_NA #> INFO [2021-04-30 11:44:14] #> == Start the summarization per subplot... #> | | | 0% | |=== | 4% | |===== | 8% | |======== | 12% | |=========== | 15% | |============= | 19% | |================ | 23% | |=================== | 27% | |====================== | 31% | |======================== | 35% | |=========================== | 38% | |============================== | 42% | |================================ | 46% | |=================================== | 50%#> Warning: Ran out of iterations and did not converge#> | |====================================== | 54% | |======================================== | 58% | |=========================================== | 62% | |============================================== | 65% | |================================================ | 69% | |=================================================== | 73% | |====================================================== | 77% | |========================================================= | 81% | |=========================================================== | 85% | |============================================================== | 88% | |================================================================= | 92% | |=================================================================== | 96% | |======================================================================| 100%INFO [2021-04-30 11:44:14] == the summarization per subplot is done.#> PROTEIN PEPTIDE TRANSITION #> 1 Q9UHQ9_K218 AILKVPEDPTQCFLLFANQTEK*DIILR_4 NA_NA #> 2 Q9UHQ9_K218 AILKVPEDPTQCFLLFANQTEK*DIILR_4 NA_NA #> 3 Q9UHQ9_K218 AILKVPEDPTQCFLLFANQTEK*DIILR_4 NA_NA #> 4 Q9UHQ9_K218 AILKVPEDPTQCFLLFANQTEK*DIILR_4 NA_NA #> 5 Q9UHQ9_K218 AILKVPEDPTQCFLLFANQTEK*DIILR_4 NA_NA #> 6 Q9UHQ9_K218 AILKVPEDPTQCFLLFANQTEK*DIILR_4 NA_NA #> FEATURE LABEL GROUP RUN SUBJECT FRACTION #> 1 AILKVPEDPTQCFLLFANQTEK*DIILR_4_NA_NA L CCCP 1 BCH1 1 #> 2 AILKVPEDPTQCFLLFANQTEK*DIILR_4_NA_NA L CCCP 2 BCH1 1 #> 3 AILKVPEDPTQCFLLFANQTEK*DIILR_4_NA_NA L CCCP 3 BCH2 1 #> 4 AILKVPEDPTQCFLLFANQTEK*DIILR_4_NA_NA L CCCP 4 BCH2 1 #> 5 AILKVPEDPTQCFLLFANQTEK*DIILR_4_NA_NA L Combo 5 BCH1 1 #> 6 AILKVPEDPTQCFLLFANQTEK*DIILR_4_NA_NA L Combo 6 BCH1 1 #> originalRUN censored INTENSITY ABUNDANCE newABUNDANCE predicted #> 1 CCCP-B1T1 FALSE 4930434 20.94886 20.94886 NA #> 2 CCCP-B1T2 FALSE 5473017 21.19945 21.19945 NA #> 3 CCCP-B2T1 TRUE NA NA NA NA #> 4 CCCP-B2T2 TRUE NA NA NA NA #> 5 Combo-B1T1 TRUE NA NA 21.03963 21.03963 #> 6 Combo-B1T2 FALSE 6020716 21.33488 21.33488 NA