2021-2022 Capstone Projects – College of Science and Engineering

Applications of race classification are one of the base components of security and defense industries. This project attempts the race classification problem from facial images using transfer learning. All pre-trained models are from Keras Applications. The main pre-trained models are DenseNet121, ResNet50, MobileNetV3Large, and EfficientNetV2B2. These models are pre-trained on the ImageNet data set. This project uses the FairFace data set which includes several race groups: White, Black, Indian, East Asian, Southeast Asian, Middle Eastern, and Latino. The authors of FairFace collected the data set from the YFCC-100M Flickr data set and then labeled the images with race, gender, and age group. For some of the experiments, we use the MaskTheFace algorithm for adding surgical masks to the images from the FairFace data set. All convolutional neural network models include pre-trained Keras models and a classification layer. Firstly, the models are trained on the FairFace data set. Afterwards, the MobileNetV3Large is chosen as the most efficient model (in terms of the trade-off between accuracy and memory requirements) based on a comparative experiment. We evaluate that model on the data set including masked face images. Finally, the model is trained on the masked data set to compare its performance in two setups: when it sees only images without masks and when it sees images with and without masks.

Student: Arpi Hunanyan

Supervisor: Varduhi Yeghiazaryan

Pre-trained models like BERT, Word2Vec, FastText, etc., are widely used in many NLP applications such as chatbots, text classification, machine translation, etc.Such models are trained on huge corpora of text data and can capture statistical, semantic, and relational properties of the language. As a result, they provide numeric representations of text tokens (words, sentences) that can be used in downstream tasks. Having such pre-trained models off the shelf is convenient in practice, as it may not be possible to obtain good quality representations by training them from scratch due to lack of data or resource constraints. That said, in the practical setting, such embeddings are often used as inputs to models to serve the purpose of the task. For example, in a sentence classification task, it is possible to use a Logistic Regression on the top average Word2Vec embeddings. Using such embeddings on real-life industrial problems could produce some optimistic improvements over baselines; however, it is not clear whether those improvements are reliable or not. In our study, we intend to check the question at hand by formulating multiple applicable and viable tasks in the industry and replicating the workflow of data scientists. Our goal is to construct various models (different in sophistication) that use embeddings as inputs and use a methodology to report the confidence Bounds of the metric of interest. With this experiment we hope to develop an understanding of the phenomenon of having optimal results on the paper that might not be optimal in reality; thus, aiming to find a reliable method, that will aid the decision making process and facilitate the model selection.

Student: Larisa Poghosyan

Supervisor: Vahe Hakobyan

Adeno-associated viruses (AAV) are one of the most actively investigated gene therapy vehicles. However, several factors challenge their applications in humans, such as immune response and inability to reach target tissues. Various strategies are being applied to produce novel AAV variants with properties that overcome these challenges. One of them is directed evolution, which facilitates the engineering of proteins with desired features by applying mutagenesis and selective pressure. In particular, DNA shuffling is used to produce novel variants by fragmentation and reassembly of AAV capsid genes. However, systemic computational analysis of resulting variants is still limited. This paper introduces a new computational tool that enables comprehensive exploratory analysis of AAV chimeric libraries and identification of successful variants by extracting quantitative data from the sequence libraries.

Student: Tatevik Jalatyan

Supervisors: Dr. Lilit Nersisyan, Dr. Erik Aznauryan

This paper represents a research project done in the field of handwriting analysis with the purpose of designing models which could assess student performance in IT and engineering majors. The anonymous data was provided by the American University of Armenia, and the handwriting samples were from midterm exams of two different university subjects. Various machine learning approaches, such as Logistic Regression, Support Vector Machines, Random Forest, Decision Trees, K Nearest Neighbors, and Multilayer Perceptron, and statistical methods were used in order to design a predictive model which could infer student performance from their handwriting characteristics. The main findings were about the impact of the choice of the data, handwriting features and machine learning models. The research was divided into two stages, and each of those had a different dataset and problem statement. First stage was concerned with grade prediction, and second stage was about general student performance prediction. Many things learned during the first approach were applied during the second one. Some handwriting features were no longer considered, some domain specific handwriting features were added during the seconds stage, and in general directions for defining new features of handwriting in future were discovered. The dataset was limited and contained very small number of samples, which was a huge limitation for the study. Therefore in future stages, more data will need to be collected in order to get better results.

Student: Naira Khachatryan

Supervisor: Suren Khachatryan

Being the third most diagnosed and second most deadly cancer worldwide, colorectal cancer is a highly complex, multigenic disease that has very high inter-patient variability in terms of the genetics of the tumor. This raises the need for developing a personalized treatment for CRC patients for better efficacy and reduced toxicity. Molecular subtyping of the disease is a way to define biological subgroups for which targeted treatment can be optimized. Our research aimed to test different Machine Learning and Deep Learning models that combined theoretically and practically tested state-of-the-art concepts to obtain biologically meaningful clusters from somatic mutations and copy number alterations as CRC patient subgroups. Four different methods with different types of inputs were tested, Spectrum, xGeneModel, Kmeans clustering, and Deep Embedded Clustering. Our results proved the most efficient way of obtaining these subgroups to be a Deep Learning clustering model (DEC) applied to prior biologically enriched data using Biological Process genesets from Gene Ontology. The obtained clusters were treated as labels to build classifiers as a predictive tool for incoming patient records, from which Logistic Regression performed the best. Survival Analysis showed that the obtained clusters were not distinct in terms of overall survival patterns. However, we brought forward the hypothesis that these can be significantly different considering specific drug treatments, for which we did not have sufficient data to check the hypothesis. The code and material of the method are available at: https : //github.com/susieavagyan/capstone − cancer −subtyping.

Student: Susanna Avagyan

Supervisor: Hans Binder

Real estate is one of the major sectors of the Armenian economy and has been developing dynamically. Recently, large online platforms have developed in Armenia to advertise real estate offerings, thus reducing information asymmetry, and increasing liquidity in both sales and rental markets. With granular data concerning a representative portion of the real estate offering available online, it is increasingly tenable to monitor the real estate market and develop analytical tools that can accurately estimate the value of real estate assets based on their internal and external features. This research sets out to not only assess the performance of a special class of machine learning models – tree–based bagging and boosting ensembling methods, in estimating the prices of apartments and houses in Armenia, but also create a highly accurate Computer Vision framework, the purpose of which is to correctly predict which of the following design styles a real estate product corresponds to: Modern, classic or soviet. We created scalable data collection pipelines to create an Armenian Real Estate database which is further used to develop robust models for price prediction and interior style detection. Our experiments showed that the performance of XGBoost exceeds that of the Random Forest and Catboost models. Furthermore, using the SHAP approach for feature importance calculation, we have determined that the top three most decisive factors are surface area, coordinates and the material for predicting apartment value and amount of bathrooms, coordinates and interior area for predicting house value. The best results for price prediction were achieved through the XGBoost model, yielding an R-squared of 0.69 for houses, and 0.83 for apartments. We further enhance our understanding of the important features for automated price prediction by assessing various deep learning architectures for visual interior style prediction in a few-shot fine-tuning setting.

Student: Davit Martirosyan

Supervisor: Erik Arakelyan

Gradient boosting methods have been proven to be a very important strategy. Many successful machine learning solutions were developed using the XGBoost and its derivatives. The aim of this study is to investigate and compare the efficiency of XGBoost, LightGBM and Logistic Regression methods on car loan retargeting problem. Car loan dataset is used in this work which contains 232 features and 9,500,000 records. For the purpose of the study, the features are analyzed and several techniques are used to rank and select the best features. For filling in the missing values the MissForest, KNN and other simpler strategies, like filling with 0s and mean/median methods have been compared. The implementation indicates that the LightGBM in combination with MissForest is faster and more accurate than Logistic Regression and XGBoost using a variant number of features and records.

Student: Lia Harutyunyan

Supervisor: Anna Sargsyan