
Omics Quality Control
quality_control.RdThis function is a wrapper function that performs the key quality controls steps on an 'omics data set. Key principles: 1. keep the source underlying data as it is 2. copy the source data to a new data layer called qcing for processing 3. build an exclusion list, accumulating codes for exclusion reasons 4. make any adjustments needed in the destination copy of the data, flag these in the exclusion list 5. copy the final result to a data layer called post_qc 6. return the Omiprep object with the newly populated data layers
Usage
quality_control(
omiprep,
source_layer = "input",
sample_missingness = 0.2,
feature_missingness = 0.2,
feature_skewness_threshold = NULL,
feature_skewness_direction = "left",
total_peak_area_sd = 5,
outlier_udist = 5,
outlier_treatment = "leave_be",
winsorize_quantile = 1,
tree_cut_height = 0.5,
feature_selection = "max_var_exp",
pc_outlier_sd = 5,
max_num_pcs = 10,
sample_ids = NULL,
feature_ids = NULL,
features_exclude_but_keep = NULL,
cores = NULL,
fast = FALSE
)Arguments
- omiprep
an object of class Omiprep
- source_layer
character, the data layer to summarise
- sample_missingness
numeric 0-1, percentage of data missingness which should prompt exclusion of a sample
- feature_missingness
numeric 0-1, percentage of data missingness which should prompt exclusion of a feature
- feature_skewness_threshold
numeric, optional skewness threshold to exclude features with skewed distributions. Set to `NULL` to disable.
- feature_skewness_direction
character, direction of skewness to apply when `feature_skewness_threshold` is set. One of `"left"`, `"right"`, or `"both"`.
- total_peak_area_sd
numeric, number of TPA SD after which a sample would be excluded
- outlier_udist
the unit distance in SD or IQR from the mean or median estimate, respectively outliers are identified at. Default value is 5.
- outlier_treatment
character, how to handle outlier data values - options 'leave_be', 'turn_NA', or 'winsorize'
- winsorize_quantile
numeric, quantile to winsorize to, only relevant if 'outlier_treatment'='winsorize'
- tree_cut_height
numeric, the threshold for feature independence in hierarchical clustering. Default is 0.5.
- feature_selection
character, either 'max_var_exp' or 'least_missingness', how to select the independent feature within clusters
- pc_outlier_sd
numeric, number of PCA SD after which a sample would be excluded
- max_num_pcs
numeric, the maximum number of PCs to use (look in) when filtering samples on PC outlier SD, default=10, set to NULL to use all informative PCs from the Scree analysis
- sample_ids
character, vector of sample ids to retain and work with, all others samples will be excluded
- feature_ids
character, vector of feature ids to retain and work with, all other features will be excluded
- features_exclude_but_keep
character, vector of feature ids indicating features to exclude from the sample and PCA quality control analysis but keep in the data, OR a name of a logical column in the features data indicating the same
- cores
number of cores available for parallelism; the default null will try find the maximum available cores - 1; set to 1 for linear, but potentially slow, computation of the correlation matrix.
- fast
If
TRUE, accelerates correlation computation by imputing missing values to the column minimum, pre-ranking all columns, and computing Pearson correlation on ranked data (approximating Spearman). Substantially faster than exact Spearman at large feature dimensions (\(p > 5000\)) but assumes missing data are missing at random. Features with high missingness will have inflated rank ties at the median (ensure these are filtered out appropriately with the missingness option). DefaultFALSE.