dc.description.abstract |
Essential gene prediction helps to find minimal genes indispensable for the survival
of any organism. Machine learning (ML) algorithms have been useful for the
prediction of gene essentiality. Existing ML techniques for essential gene prediction
have inherent problems, like imbalanced provision of training datasets with sufficient
data (labeled ≥80%), limited (labeled ≥1%) experimental labeled data, biased choice of
the best model for a given balanced dataset, choice of a complex ML algorithm, and
data-based automated selection of biologically relevant features for classification. By
addressing these issues, two ML strategies (ML strategy 1 and ML strategy 2) were
developed to predict essential genes.
The ML strategy 1 was developed based on the supervised ML classifier - Support
Vector Machine (SVM) for predicting essential genes in Escherichia coli with sufficient
imbalanced experimental data (labeled ≥80%). As a novel feature, we introduced flux-
coupled metabolic subnetwork-based features for enhancing the classification
performance. Our strategy has proved to be superior when compared with existing
SVM-based strategies.
ML Strategy 1 underperforms for limited labeled data (labeled ≥ 1%), and hence ML
strategy 2 was developed to circumvent this issue. ML strategy 2 utilizes an
unsupervised feature selection technique, dimension reduction (Kamada-Kawai
algorithm), and semi-supervised ML algorithm (Laplacian Support Vector Machine).
A novel scoring technique, Semi-Supervised Model Selection Score (equivalent to the
area under the ROC curve (auROC)), was developed to select the best model when
supervised performance metrics calculation difficult due to lack of data. Validation of
this ML pipeline gave highly accurate (auROC > 0.85) performance even with 1%
labeled data on both Eukaryotes and Prokaryotes. This strategy was used on
Leishmania sp. to predict essential genes with inadequate experimental known data.
The existing essential genes prediction platforms such as Geptop, EGP, etc., can only
annotate essential genes for model prokaryotic organisms, not for eukaryotes and in
most cases, no source code is publicly available. Hence, for annotating the essential
genes with minimal effort and time, an open-source server, PRESGENE was
developed, by integrating these two ML strategies. The user can submit and analyze
their data for essential genes prediction through a user-friendly platform. The
essential genes predicted using this platform will provide an important lead for
predicting gene essentiality and identifying novel therapeutic targets for antibiotic
and vaccine development against disease-causing organisms. |
en |