CSIR-NCL Digital Repository

Development of machine learning strategies and integrated web platform for the prediction of essential genes

Show simple item record

dc.contributor.advisor Sarkar, Ram Rup
dc.contributor.author Nandi, Sutanu
dc.date.accessioned 2021-10-12T06:10:07Z
dc.date.available 2021-10-12T06:10:07Z
dc.date.issued 2021-05-25
dc.identifier.uri http://dspace.ncl.res.in:8080/xmlui/handle/20.500.12252/5994
dc.description.abstract Essential gene prediction helps to find minimal genes indispensable for the survival of any organism. Machine learning (ML) algorithms have been useful for the prediction of gene essentiality. Existing ML techniques for essential gene prediction have inherent problems, like imbalanced provision of training datasets with sufficient data (labeled ≥80%), limited (labeled ≥1%) experimental labeled data, biased choice of the best model for a given balanced dataset, choice of a complex ML algorithm, and data-based automated selection of biologically relevant features for classification. By addressing these issues, two ML strategies (ML strategy 1 and ML strategy 2) were developed to predict essential genes. The ML strategy 1 was developed based on the supervised ML classifier - Support Vector Machine (SVM) for predicting essential genes in Escherichia coli with sufficient imbalanced experimental data (labeled ≥80%). As a novel feature, we introduced flux- coupled metabolic subnetwork-based features for enhancing the classification performance. Our strategy has proved to be superior when compared with existing SVM-based strategies. ML Strategy 1 underperforms for limited labeled data (labeled ≥ 1%), and hence ML strategy 2 was developed to circumvent this issue. ML strategy 2 utilizes an unsupervised feature selection technique, dimension reduction (Kamada-Kawai algorithm), and semi-supervised ML algorithm (Laplacian Support Vector Machine). A novel scoring technique, Semi-Supervised Model Selection Score (equivalent to the area under the ROC curve (auROC)), was developed to select the best model when supervised performance metrics calculation difficult due to lack of data. Validation of this ML pipeline gave highly accurate (auROC > 0.85) performance even with 1% labeled data on both Eukaryotes and Prokaryotes. This strategy was used on Leishmania sp. to predict essential genes with inadequate experimental known data. The existing essential genes prediction platforms such as Geptop, EGP, etc., can only annotate essential genes for model prokaryotic organisms, not for eukaryotes and in most cases, no source code is publicly available. Hence, for annotating the essential genes with minimal effort and time, an open-source server, PRESGENE was developed, by integrating these two ML strategies. The user can submit and analyze their data for essential genes prediction through a user-friendly platform. The essential genes predicted using this platform will provide an important lead for predicting gene essentiality and identifying novel therapeutic targets for antibiotic and vaccine development against disease-causing organisms. en
dc.description.sponsorship DST-INSPIRE for Senior Research Fellowship [IF150015] en
dc.format.extent 208 p. en
dc.language.iso en_US en
dc.publisher CSIR–National Chemical Laboratory, Pune en
dc.subject Research Subject Categories::NATURAL SCIENCES::Biology::Other biology::Bioinformatics en
dc.title Development of machine learning strategies and integrated web platform for the prediction of essential genes en
dc.type Thesis(Ph.D.) en
local.division.division Chemical Engineering and Process Development (CEPD) Division en
dc.description.university AcSIR en
dc.identifier.accno Th2487


Files in this item

This item appears in the following Collection(s)

Show simple item record