skip to content

Core Bioinformatics group

 

microRNAs play a key role in RNA interference, the sequence-driven targeting of mRNAs that regulates their translation to proteins, through translation inhibition or the degradation of the mRNA. Around ~30% of animal genes may be tuned by microRNAs. The prediction of miRNA/mRNA interactions is hindered by the short length of the interaction (seed) region (~7-8nt). We collate several large datasets overviewing validated interactions and propose feamiR, a novel pipeline comprising optimised classification approaches (Decision Trees/Random Forests and an efficient feature selection based on embryonic Genetic Algorithms used in conjunction with Support Vector Machines) aimed at identifying discriminative nucleotide features, on the seed, compensatory and flanking regions, that increase the prediction accuracy for interactions.

Common and specific combinations of features illustrate differences between reference organisms, validation techniques or tissue/cell localisation. feamiR revealed new key positions that drive the miRNA/mRNA interactions, leading to novel questions on the mode-of-action of miRNAs.

Manuscript preprint: https://www.biorxiv.org/content/10.1101/2020.12.23.424130v1

Github: https://github.com/Core-Bioinformatics/feamiR

Documentation: https://core-bioinformatics.github.io/feamiR/reference/index.html

CRAN: https://CRAN.R-project.org/package=feamiR

feamiR was presented at the EMBL Symposium on the Non-Coding Genome in October 2021 (poster and video).

 

Overview of the feamiR workflow.

feamiR R package

Preprocessing

The dataset used for classification and feature selection is generated using the preparedataset function. This function also performs statistical analysis using chi-squared and Fisher exact tests from which heatmaps can be created.

Heatmaps illustrating chi-squared and Fisher exact p-values when assessing differences in positional proportions of single nucleotides (left) and dinucleotides (right) for each position in the positive and negative datasets, using a subset of D. melanogaster miRNAs and mRNAs.

Based on the subsamples created using the preparedataset function, using feamiR several miRNA/mRNA classifiers can be tested. In addition, functions for selecting discriminative features are also available.

Classifiers

The classifier functions are: decisiontree, randomforest and svm. For the Random Forest (RF), the optimal number of trees can be selected as a hyperparameter (selectrfnumtrees); for the Support Vector Machines (SVMs) the optimal kernel can be selected as a hyperparameter (selectsvmkernel). Both optimisation functions rely on robustness on cross validation and will produce plots from which an appropriate number of trees and kernel can be identified. These functions should be tested on multiple subsamples to evaluate the robustness of the parameter selection.
 
Performance of SVM model with a variety of kernels on seed features using 10-fold cross-validation on a set of H. sapiens miRNA/mRNA interactions validated by Immunoprecipitation experiments. The distribution of training and test accuracy across cross-validation runs for each kernel is shown.
 
Performance of RF model with a range of number of trees, from 1 to 100, using 10-fold cross-validation on a set of H. sapiens miRNA/mRNA interactions validated by Immunoprecipitation experiments. The distribution of test and training accuracy across cross-validation runs for each number of trees is shown.
 
Next the runallmodels function creates the results for Decision Trees, Random Forests and SVMs on all 100 subsamples. The selected hyperparameters (determined using selectsvmkernel and selectrfnumtrees) should be input as parameters. The function will output a data frame containing test and training accuracy, sensitivity and specificity for each model on each subsample; summary box plots are also produced. Additional output comprises dtreevote containing the selection of features from the decision trees, for each subsample, and their level within the tree, ongoingginis containing the Gini index for each feature based on the Random Forest model. These methods are described in more detail in the next section.

 

Feature selection

The feamiR package implements 6 feature selection approaches:

  • dtreevoting: Uses trained Decision Trees (DTs) for feature selection, by maintaining a record of frequent features and the level of their first occurrence. Selection focuses only on features which appear in the top num_levels levels (from the root, default 10), as these are expected to have higher discriminative power. The selection of features is performed over num_runs runs (user-defined parameter with a default value of 100) on different subsamples output by preparedataset. The first column of the output table contains the number of runs for which each feature was used. The output table can then be ordered by frequency and highest-performing features selected.

Boxplot showing distribution, on DTs, over 100 subsamples of the full H. sapiens dataset, of depth from the root for each feature (considering only features appearing in the top 10 levels). The number of trees where each feature was used is shown above.

  • rfgini: Using Random Forest (RF) models, we assess variable importance with entropy-based measures. The importance of predictor variables is measured using ‘mean decrease in node impurity’, based on the Gini index. The function calculates the cumulative mean decrease in the Gini index across num_runs (default: 100) samples output by preparedataset. The first column of the output table contains the cumulative Gini index for each feature across all runs.

Boxplot showing distribution of mean decrease in Gini index of features across 100 runs on different subsamples of the full H. sapiens dataset, ordered by cumulative mean decrease in Gini.

  • forwardfeatureselection: Forward Feature Selection uses a greedy approach to select the most discriminative k features. Features are ordered by discriminative power by selecting the feature which increases test accuracy at each iteration. The accuracy with respect to the selection order of features is plotted and smoothed using LOESS. When the curve describing the distribution of accuracies plateaus i.e. adding extra features does not improve the accuracy significantly, the selection process stops. The accuracy is assessed using a specified model, e.g. linear SVM. The function outputs an ordered list of features, along with accuracy, sensitivity and specificity achieved using these features.

Improvement in performance when adding features, ordered on the x-axis, selected by FFS on a subset of the full H. sapiens dataset. The line type indicates the training (dotted line) and test (continuous line) accuracy, sensitivity and specificity for the first 20 features.

  • geneticalgorithm: Implements a standard genetic algorithm using GA package with a fitness function specialised for feature selection.
  • eGA: Feature selection based on Embryonic Genetic Algorithms. It performs feature selection by maintaining an ongoing set of 'good' features which are improved run by run. This is achieved by randomly selecting new features to combine with the ‘good’ features and performing forwardfeatureselection. It outputs training and test accuracy, sensitivity and specificity and a list of <=k features.

Summary of features selected by eGA on full H. sapiens positive and negative sets. Distribution of ranks (top) and accuracies (bottom) across 100 subsamples for the training data). On top of the boxplots we present the number of selections per feature.

It is recommended to use combinations of these feature selection approaches across multiple subsamples, corroborated with statistical analysis to select discriminative features.

The feature selected using these various methods can be summarized in heatmaps.

Heatmap showing the discriminative power of features across the 5 feature-selection methods on the full H. sapiens positive set: Fisher exact p-values, DT voting scheme, RF cumulative mean decrease in Gini, FFS and eGA. The top 10 features across methods were included; all scores were quantile normalised on all 144 features per method, for comparability. The stronger features are shown in darker red.
 
The truth is rarely pure and never simple. Oscar Wilde, The Importance of Being Earnest