Development of machine learning strategies and integrated web platform for the prediction of essential genes

Nandi, S.

dc.contributor.advisor	Sarkar, R. R.
dc.contributor.author	Nandi, S.
dc.date.accessioned	2021-10-12T06:10:07Z
dc.date.available	2021-10-12T06:10:07Z
dc.date.issued	2021-05-25
dc.identifier.uri	http://dspace.ncl.res.in:8080/xmlui/handle/20.500.12252/5994
dc.description.abstract	Essential gene prediction helps to find minimal genes indispensable for the survival of any organism. Machine learning (ML) algorithms have been useful for the prediction of gene essentiality. Existing ML techniques for essential gene prediction have inherent problems, like imbalanced provision of training datasets with sufficient data (labeled ≥80%), limited (labeled ≥1%) experimental labeled data, biased choice of the best model for a given balanced dataset, choice of a complex ML algorithm, and data-based automated selection of biologically relevant features for classification. By addressing these issues, two ML strategies (ML strategy 1 and ML strategy 2) were developed to predict essential genes. The ML strategy 1 was developed based on the supervised ML classifier - Support Vector Machine (SVM) for predicting essential genes in Escherichia coli with sufficient imbalanced experimental data (labeled ≥80%). As a novel feature, we introduced flux- coupled metabolic subnetwork-based features for enhancing the classification performance. Our strategy has proved to be superior when compared with existing SVM-based strategies. ML Strategy 1 underperforms for limited labeled data (labeled ≥ 1%), and hence ML strategy 2 was developed to circumvent this issue. ML strategy 2 utilizes an unsupervised feature selection technique, dimension reduction (Kamada-Kawai algorithm), and semi-supervised ML algorithm (Laplacian Support Vector Machine). A novel scoring technique, Semi-Supervised Model Selection Score (equivalent to the area under the ROC curve (auROC)), was developed to select the best model when supervised performance metrics calculation difficult due to lack of data. Validation of this ML pipeline gave highly accurate (auROC > 0.85) performance even with 1% labeled data on both Eukaryotes and Prokaryotes. This strategy was used on Leishmania sp. to predict essential genes with inadequate experimental known data. The existing essential genes prediction platforms such as Geptop, EGP, etc., can only annotate essential genes for model prokaryotic organisms, not for eukaryotes and in most cases, no source code is publicly available. Hence, for annotating the essential genes with minimal effort and time, an open-source server, PRESGENE was developed, by integrating these two ML strategies. The user can submit and analyze their data for essential genes prediction through a user-friendly platform. The essential genes predicted using this platform will provide an important lead for predicting gene essentiality and identifying novel therapeutic targets for antibiotic and vaccine development against disease-causing organisms.	en
dc.description.sponsorship	DST-INSPIRE for Senior Research Fellowship [IF150015]	en
dc.format.extent	208 p.	en
dc.language.iso	en_US	en
dc.publisher	CSIR–National Chemical Laboratory, Pune	en
dc.subject	Bioinformatics	en
dc.title	Development of machine learning strategies and integrated web platform for the prediction of essential genes	en
dc.type	Thesis(Ph.D.)	en
local.division.division	Chemical Engineering and Process Development Division	en
dc.description.university	AcSIR	en
dc.identifier.accno	TH2487