Mathematics, Statistics & Physics
http://hdl.handle.net/10576/3082
2024-03-28T20:36:30ZAssessment and Prediction of Body Fat Composition Using A Variety of Machine Learning Algorithms
http://hdl.handle.net/10576/48144
Assessment and Prediction of Body Fat Composition Using A Variety of Machine Learning Algorithms
Shajahan, Tahsin Raahila
Body composition is critical for health outcomes and has been researched in various populations and conditions like obesity, diabetes, and many more. Qatar Biobank collected anthropometric and biomedical data from individuals across all age groups. Body fat and lean mass are important measures of body composition which help identify several health risks including cardiovascular health and nutrition. Machine learning (ML) algorithms in Python were used to predict Total Fat Percentage (TFP) and Total Lean Mass (TLM).
All the variables in the dataset are used to test different ML algorithms on the TFP variable. Based on performance metrics like R2, Mean Absolute Error and Root Mean Square Error; linear regression, support vector regression (SVR) and extreme gradient boosting (XGBoost) performed well. Subsequently, further analysis on these models were performed using feature selection methods like forward, backward, stepwise and information gain for multiple cross-validation (CV) levels. We found that backward selection with a 10-CV on the SVR model predicted TFP the best with R2 of 86.7% (train), R2 of 80.2% (test) and MAE of 0.025 (train), MAE of 0.030 (test). Some of the best variables selected via this model are: testosterone, urea, gender, body mass index (BMI) and bone mineral density (BMD)
Next, TLM is analyzed using the three models that were selected earlier for TFP. It was found that linear regression and SVR models predicted TLM well, while XGBoost performed poorly. Since backward selection with 10-CV produced good results for TFP, the same is applied to the models for feature selection. Based on the results obtained we conclude that linear regression model after feature section predicts TLM the best with R2 of 83.7% (train), R2 of 82.9% (test) and MAE of 0.313 (train), MAE of 0.313 (test). Some of the best variables explaining TLM are: gender, age, BMI, cholesterol and BMD.
2023-06-01T00:00:00ZVOLATILITY ESTIMATION IN MISSING AT RANDOM HIGH-FREQUENCY FINANCIAL TIME SERIES
http://hdl.handle.net/10576/47659
VOLATILITY ESTIMATION IN MISSING AT RANDOM HIGH-FREQUENCY FINANCIAL TIME SERIES
ACHAIBOU, FERIEL
More than 15 years ago, the capital markets have seen significant development, introducing high-frequency trading and a shift of market towards high-frequency and algorithm trading. It was always believed that high-frequency trading and automated trading were source price shocks and rising of volatility. Therefore, more interest was recently given in modeling the volatility with high-frequency financial data. However, financial data can still be missing despite modern technology that allows data collection on a very fine time scale. Thus, this thesis focuses on the estimation of regression and volatility functions based on missing data using a nonparametric heteroscedastic regression model. A Nadaraya-Watson type estimator is used when the response variable is a real-valued random variable and subject to missing at random mechanism, while the predictor is a completely observed infinite-dimensional (functional) random variable. Based on the observed data, we first introduce a simplified, as well as inverse probability weighted, estimators. Second, these initial estimators are used to impute missing values and define estimators of the regression and volatility operators based on imputed data. Third, the performance of the proposed estimators is assessed using simulated data. Finally, an application to the estimation and forecasting of the daily volatility of Brent Oil Price returns conditionally to 1-minute frequency daily Natural Gas returns curves is also investigated.
2023-06-01T00:00:00ZON THE PREFERENCE OF ZERO-INFLATION MODELS WITH THE PRESENCE OF DATA CONTAMINATION
http://hdl.handle.net/10576/44991
ON THE PREFERENCE OF ZERO-INFLATION MODELS WITH THE PRESENCE OF DATA CONTAMINATION
ELSOUSY, REEM MOHAMAD RIFAAT
Nowadays, data has become a big concern for researchers to solve problems or improve a lifestyle. It is not odd that different data sources generate data with different characters. In fields such as engineering, epidemiology, psychology, sociology, public health, agriculture, road safety, economics, biology, medicine, public health, and others, data commonly contains a high probability of zeros as an outcome of the dependent variable, this phenomenon called Zero-Inflation of the count. For example, the number of deaths in a car accident is most likely zero, or counting rare birds in a specific region will yield zero in most sectors. When the outcome is non-negative count values with zero excessive, the classical models cannot infer the relationship between the covariances and the dependent variable. Furthermore, in some approaches like Neural Networks and Logistic Regression, modeling data requires an equal percentage of the output classes, for instance, an equal number of observations for class 0 and class 1. Following this leads to losing a big portion of the data, resulting in inference based on a sample that does not represent the entire population. To cope with these downsides, scientists focused on developing methods to model the entire set of Zero-Inflated count data. Several approaches have been constructed to express the dependency between the covariates and dependent variables with excessive zero-count data. However, the discrepancy in the performance was founded when some factors played a role. Some algorithms suffer from a certain obstacle. Poisson and Zero-Inflated Poisson regression is restricted with the assumption of equidispersion, and the Negative Binomial cannot handle Zero-Inflation in the response; Zero-Inflated Negative Binomial complains of failure to converge in some cases. Moreover, some other factors could influence the model's selection, such as the proportion of zeros, sample size, and degree of dispersion; different values of those factors could change the execution of the model. In addition, the structure of the selected population could be another factor that impacts the result and the data condition, whether it contains outliers, missing data, or measurement errors. Considering all the above, this study aims to lay down the basic guideline that researchers could follow when dealing with Zero-Inflated phenomena.
2023-06-01T00:00:00ZReliability analysis of the Stress-Strength model from truncated Pareto distribution based on progressive Type-II censored samples.
http://hdl.handle.net/10576/41067
Reliability analysis of the Stress-Strength model from truncated Pareto distribution based on progressive Type-II censored samples.
Ali, Hadeel Mohammed
In this project, we studied the stress strength reliability (SSR) models. The stress-strength model has many applications in engineering problems, for example the strength of a building being subjected to earthquake, the strength of a rocket motor being greater than its working pressure, and the strength of a bridge. We estimated the reliability parameter using maximum likelihood estimation method in three cases (arbitrary case, common truncated case, and common resilience parameter case). We computed the maximum likelihood estimator (MLE) of the reliability parameter R and studied the properties of the estimator of the parameter R using a great amount of simulation studies and illustrate our method through some real data examples. Moreover, we compute the generalized confidence intervals passed on pivotal quantities. We computed the bootstrap confidence intervals. We found that, the confidence interval is wider in the arbitrary parameter case, and that there is no large difference between the estimators of reliability parameter using different methods.
2023-01-01T00:00:00Z