Do My Homework / Homework Help Answers / Data Science Homework Help / Simple Machine Learning Project using Python with pandas, numpy and sklearn

Simple Machine Learning Project using Python with pandas, numpy and sklearn

Need help with this question or any other Data Science assignment help task?

Simple Machine Learning Project using Python with pandas, numpy and sklearn. Only Jupyter notebook, no presentation required. Should show data source (it can be from Kaggle), the steps of exploratory data analysis, data cleaning, visualization (charts), transformation, feature selection, etc. the model evaluation and selection. Preferred domain: retail - such as sales prediction or inventory optimization. I am attaching an example of a notebook I did on my own for a different project (it should be something similar, but a little better). Dataset requirement: minimum 10,000 rows and 5 columns.
Additional Instructions:
Hi @channel Please kindly use this for thread for Machine Learning Project Q&A -------------------------------------------------------------------------------------                                   Machine Learning Project Machine Learning Project Instructions + Hand in  --------------------------------------------------------------------------------------                                      Rules · Reply to this thread for Q & A with  · Don't share your code · SQL, Python documentation link is allowed ---------------------------------------------------------------------------------------                           General Questions Q: is the ML capstone in groups or individual? + What should we be handing in?A: It's individual, every student has to hand in: 1. Your short two-page analytics plan with a business background intro 2. Your notebook for the project (see Slides for more details on the project outline)  3. 15-20 page slide deck for the project presentation Q: Expectations for presentations/hand-in on the 9th is still a bit ambiguous. What does 70% complete look like? + what should we be presenting at this phase of the project?A: Here is the checklist for the 70%: Part 1 (Presentation) 1. Agenda 2. Motivation for dataset(business) chosen 3. Show your workflow + notebook 4. Insights/Conclusions 5. Challenges 6. Next steps Part 2 (Notebook) 1. Gathering Data(example: a dataset from Kaggle, dataset from the web scraping project) 2. Data Cleaning 3. Data Visualization 4. Data transformation 5. Create new feature and feature selection 6. Basic model 7. Training & Evaluation 8. GridSearchCV (optional) 9. Final ML model 10. Deep Learning (optional) 11. Prediction (Explain the Metrics you choose) Here is the checklist for the rest of 30%: 1. More Data Cleaning, Visualization and create new feature 2. Interpreting Machine Learning Model 3. Hyperparameter Tuning 4. Deep Learning 5. Post it in your blog or website ---------------------------------------------------------------------------------------                   Ideas for datasets · Practice Machine Learning with Small In-Memory Datasets · Tour of Real-World Machine Learning Problems · Work on Machine Learning Problems That Matter To You · Top 47 Machine Learning Projects for 2022 · 285+ Machine Learning Projects with Python ---------------------------------------------------------------------------------------                   Readings & Documentations Visualization · Plotly Open Source Graphing Library for Python ML · New Understanding Train Test Split · Preprocessing: OneHotEncoder() vs pandas.get_dummies · Ordinal and One-Hot Encodings for Categorical Data · sklearn.compose.ColumnTransformer · Various ways to evaluate a machine learning model’s performance · Pipelines and composite estimators · Cross-validation: evaluating estimator performance · Tuning the hyper-parameters of an estimator
Machine Learning Midterm Project Project Outline Domain Modern society relies heavily on institutionalized public policies to solve relevant problems, usually implemented by governments or nonprofit organizations. Such policies affect most aspects of our daily lives, and policies implemented inadequately or under incorrect assumptions bear a high cost to society (e.g. resources, life quality, health, etc.), and many times the damage is visible only in the long-term, surpassing the administration that has implemented them. Considering this, it feels natural that policymakers rely on vast amounts of historical data and statistical methods. As many policies target improvements in human development aiming at future goals, leveraging machine learning and human decision-making capabilities has the potential to improve the effectiveness of those policies and their outcomes for citizens. Life Expectancy as Target This project aims to visualize how income, health, education, etc. indices affect life expectancy, and use those to predict it. Life expectancy is a factor in measuring human development and is usually used to describe the physical quality of life. Life expectancy is also a critical demographic indicator for setting effective policies. For example, higher life expectancy trends may signal to policymakers that funding pension plans or updates in its rules may be necessary, while the opposite may signal that investments in healthcare may be required to reduce mortality. Program Structure Data Collection For this project, the Life Expectancy dataset from WHO will be used. This dataset can be found at: https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who Exploratory Data Analysis & Preparation The dataset will be loaded in a pandas dataframe and analyzed (including the plotting of some charts) to understand some of its traits such as: What is the data distribution? What is the relationship/correlation between features and the target? What is the relationship/colinearity between features themselves? Are there missing values? What is the strategy that makes sense to handle those: drop or imput values, etc.? What is the shape of the dataset? Which approaches can be used to wrangle data? Are there any features that need encoding? Does the dataset should be scaled? Which features could be used to predict the label Life Expectancy ? If want to focus in some Model Evaluation & Selection Given this is a regression problem, the following models will be evaluated: Linear Regression Polynomial Regression Decision Tree Regression Random Forests Regression The approach to evaluate those, will consist in the following steps: 1. Split the data into train and test. 2. Further split the train data into training and validation using the K-Fold approach. 3. Use the K-Folds to assess the performance of different algorithms with different hyperparameters. 4. Compare the performance between models, and select the one with the best performance. 5. Use the test data (from #1) to evaluate the if the selected model is adequate to predict the Life Expectancy . Compare Model Performance Using Aternative Features Given this dataset has several features, let's compare the performance of alternative models which focus on specific areas such as: economical factors, immunization, etc. Data Summary The dataset from WHO on Life Expectancy, contains data from 2000 to 2015 for all countries. Each observation contains immunization factors, mortality factors, economic factors and social factors that may affect the life expectancy. For the purpose of this project, the data has been cleaned, reorganized and the features renamed to make them less confusing. The raw dataset can be found at: https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who Known issues in the Raw Dataset that will be addressed Measles Cases comprise the total number of cases reported by the given year instead of cases per 1.000 population. (Values compared against WHO dataset: https://immunizationdata.who.int/pages/incidence/MEASLES.html?CODE=Global&YEAR=) The Population have some invalid observations, causing the value to be invalid. *Given the Population** is not relevant to predict Life Expectancy it will be dropped. The Percentage Expenditure actually is a dollar amount of health expenditure per capita. The Total Expenditure is actually the percentage of expenditure in health in relation to GDP instead of total government expenditure. (Approximated values compared against: https://www.statista.com/statistics/268826/health-expenditure-as-gdp-percentage-in-oecd-countries/). Data Mapping (*) Country : country name. Year : year. Status : indicates whether the country is a "Developed" country or is still a "Developing" country. Infant Deaths : number of infant deathes per 1.000 population. Child Mortality : number of deaths of children under 5 years old per 1.000 population. Adult Mortality : adult mortality rate (including both sexes) between 15 and 60 years old per 1.000 population. HIV/AIDS Deaths : number of deaths per 1.000 births. Measles Cases : number of reported cases per year (for the entire population). Hepatitis B Immunization : immunization coverage among 1 year olds (in percentage). Polio Immunization : immunization coverage among 1 year olds (in percentage). Diphtheria Immunization : Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage 1 year olds (in percentage). Alcohol Consumption : consumption per capita (above 15 years old) in litres (of pure alcohol). Average BMI : average Body Mass Index of entire population. Malnutrition 5-9 : prevalence of thinness among children aged 5 to 9 years old (in percentage). Malnutrition 10-19 : prevalence of thinness among children and adolescents aged 10 to 19 years old (in percentage). Population : population of the country in that year. GDP per Capita : Gross Domestic Product per capita in USD (US Dollars). Health Expenditure : expenditure on health amount per capita in USD (US Dollars). Health Expenditure GDP : expenditure on health as pecentage of the GDP (in percentage). Schooling Years : number of years of schooling. Income Composition : Human Development Index comprising the relative share of each income source, expressed as a percentage of the aggregate total income of that area. Life Expectancy : target/label of life expectancy in years. (*) After columns renaming and information correction. Data Format The dataset format is a CSV file, containing 22 columns (factors) and 2938 rows (observations). To support the analysis, the data will be loaded into a Pandas DataFrame . Country Year Status Life expectancy Adult Mortality infant deaths Alcohol percentage expenditure Hepatitis B Measles ... Polio Total expenditure Diphtheria HIV/AIDS GDP Population thinness 1-19 years thinness 5-9 years Income composition of resources Schooling 0 Afghanistan 2015 Developing 65.0 263.0 62 0.01 71.279624 65.0 1154 ... 6.0 8.16 65.0 0.1 584.259210 33736494.0 17.2 17.3 0.479 10.1 1 Afghanistan 2014 Developing 59.9 271.0 64 0.01 73.523582 62.0 492 ... 58.0 8.18 62.0 0.1 612.696514 327582.0 17.5 17.5 0.476 10.0 2 Afghanistan 2013 Developing 59.9 268.0 66 0.01 73.219243 64.0 430 ... 62.0 8.13 64.0 0.1 631.744976 31731688.0 17.7 17.7 0.470 9.9 3 Afghanistan 2012 Developing 59.5 272.0 69 0.01 78.184215 67.0 2787 ... 67.0 8.52 67.0 0.1 669.959000 3696958.0 17.9 18.0 0.463 9.8 4 Afghanistan 2011 Developing 59.2 275.0 71 0.01 7.097109 68.0 3013 ... 68.0 7.87 68.0 0.1 63.537231 2978599.0 18.2 18.2 0.454 9.5 5 rows × 22 columns (2938, 22) array(['Country', 'Year', 'Status', 'Life expectancy ', 'Adult Mortality', 'infant deaths', 'Alcohol', 'percentage expenditure', 'Hepatitis B', 'Measles ', ' BMI ', 'under-five deaths ', 'Polio', 'Total expenditure', 'Diphtheria ', ' HIV/AIDS', 'GDP', 'Population', ' thinness 1-19 years', ' thinness 5-9 years', 'Income composition of resources', 'Schooling'], dtype=object) Converting Raw Data to Modeling Data As the purpose of this project is to predict the Life Expectancy using the health, economical, etc. factors. The Country and Year features/factors are not considered relevant, therefore they will be dropped. Also, to make it more intelligible, some features will be renamed. Infant Deaths Child Mortality Adult Mortality HIV/AIDS Deaths Measles Cases Hepatitis B Immunization Polio Immunization Diphtheria Immunization Alcohol Consumption Average BMI Malnutrition 5-9 Malnutrition 10- 19 Population GDP per Capita Health Expenditure Health Expenditure GDP Schooling Years Income Composition Status Target 0 62 83 263.0 0.1 1154 65.0 6.0 65.0 0.01 19.1 17.3 17.2 33736494.0 584.259210 71.279624 8.16 10.1 0.479 Developing 65.0 1 64 86 271.0 0.1 492 62.0 58.0 62.0 0.01 18.6 17.5 17.5 327582.0 612.696514 73.523582 8.18 10.0 0.476 Developing 59.9 2 66 89 268.0 0.1 430 64.0 62.0 64.0 0.01 18.1 17.7 17.7 31731688.0 631.744976 73.219243 8.13 9.9 0.470 Developing 59.9 3 69 93 272.0 0.1 2787 67.0 67.0 67.0 0.01 17.6 18.0 17.9 3696958.0 669.959000 78.184215 8.52 9.8 0.463 Developing 59.5 4 71 97 275.0 0.1 3013 68.0 68.0 68.0 0.01 17.2 18.2 18.2 2978599.0 63.537231 7.097109 7.87 9.5 0.454 Developing 59.2 Checking for Missing or Invalid Values Now that the modeling DataFrame is ready, it is necessary to check whether there are any observations where the Life Expectancy is missing. Given this is the target (label), an observation without this data is not useful for the training and testing a machine learning algorithm, therefore it should be dropped. Some features have units that need to be checked for consistency: Percentage columns should not have values above 100. Columns with indices per 1.000 population should not have values above 100. Income composition should not have values outside the range of 0-1. If a considerable volume of observations are encountered with inconsistencies, an empty value will be imputed (given they are invalid, so they can be handled by the same rules as missing values). The other features (factors) should also be checked, in order to decide what to do with them (e.g. imputation, removal, etc.). Given this check may be performed more than once, a function is created. Observations with target variable missing: 10 Observations with target variable missing after cleanup: 0 Infant Deaths has 14 invalid values. Child Mortality has 16 invalid values. Analysis of the Missing Values The volume of the missing data is high for the features Hepatits B Immunization , Population and GDP per Capita . Given the Population is not a relevant feature to predict the label Life Expectancy , this will be dropped. As for the other features, the data distribution will be analyzed, to support an imputation decision using either the mean (average) or median statistcs. That seems to be a good approach, considering that: The dataset comprises a very generalized data (for all countries in a period of 15 years) and the Life Expectation is an estimated (not precise) value. The dataset is not large, and it would be preferrable not to loose many datapoints which may contain other relevant features to predict the Life Expectancy . Though data imputation seems to be a good strategy for this given project, it should not be taken lightly. Therefore to reduce the potential distortions this will introduce, the data will be analyzed in two groups that share common traits: Developed countries and Developing countries. Feature Count Percentage 0 Hepatitis B Immunization 553 18.9 1 Polio Immunization 19 0.6 2 Diphtheria Immunization 19 0.6 3 Alcohol Consumption 193 6.6 4 Average BMI 32 1.1 5 Malnutrition 5-9 32 1.1 6 Malnutrition 10-19 32 1.1 7 Population 644 22.0 8 GDP per Capita 443 15.1 9 Health Expenditure GDP 226 7.7 10 Schooling Years 160 5.5 11 Income Composition 160 5.5 Imputation of Missing data for Developed Countries After the analysis of histograms below, the choice goes to imput the median on the following features: Hepatitis B, Polio, Diphteria, Population, Malnutrition and Income Composition. All the other features with missing values will have the mean imputed. Imputation of Missing data for Developing Countries After the analysis of histograms below, the choice goes to imput the median on the following features: Hepatitis B, Polio and Diphteria. All the other features with missing values will have the mean imputed. Feature Count Percentage Analyze Data for Feature Selection Correlation Looking at the correlation, we can see that Population could be dropped, as its correlation with life expectancy is close to none. Now, along Schooling and Income , Adult mortality has a strong correlation with the Life expectancy , whereas Infant deaths and Child mortality (under 5 years old) have a much weaker correlation. However, it would not be wise to drop the Child mortality , because as a common sense, if the child mortality is high, it would affect other demografic factors that may directly or indirectly influence in the Life expectancy . For instance, there is a strong negative correlation between Malnutrition and Life expectancy , and a positive correlation between those same factors and Child mortality , suggesting that in countries where malnutrition is prevalent, the child mortality is higher and the life expectancy lower. Furthermore, when looking at the historical factors, until the middle of the 20th century, the infant mortability was approximately 40-60% of the total mortality. Excluding child mortality, the average life expectancy during the 12th–19th centuries was approximately 55 years. If a medieval person survived childhood, they had about a 50% chance of living 50–55 years, instead of only 25–40 years. Given that Infant deaths is included in the Child mortality (and they have a very strong correlation), the Infant deaths feature will be dropped. The country Status will also be dropped, as after using it for missing values imputation, it is not longer necessary (so it would not also be required to be encoded - given this is a categorical value). Infant Deaths -0.196557 Child Mortality -0.222529 Adult Mortality -0.696359 HIV/AIDS Deaths -0.556556 Measles Cases -0.157586 Hepatitis B Immunization 0.161439 Polio Immunization 0.458450 Diphtheria Immunization 0.473104 Alcohol Consumption 0.379716 Average BMI 0.552138 Malnutrition 5-9 -0.454095 Malnutrition 10-19 -0.459834 Population -0.029020 GDP per Capita 0.386862 Health Expenditure 0.381864 Health Expenditure GDP 0.206041 Schooling Years 0.680577 Income Composition 0.647289 Target 1.000000 Name: Target, dtype: float64 Infant Deaths 0.996628 Child Mortality 1.000000 Adult Mortality 0.094146 HIV/AIDS Deaths 0.037783 Measles Cases 0.507718 Hepatitis B Immunization -0.166863 Polio Immunization -0.189286 Diphtheria Immunization -0.196226 Alcohol Consumption -0.110979 Average BMI -0.235948 Malnutrition 5-9 0.468563 Malnutrition 10-19 0.464186 Population 0.539221 GDP per Capita -0.110448 Health Expenditure -0.088152 Health Expenditure GDP -0.128889 Schooling Years -0.191567 Income Composition -0.143130 Target -0.222529 Name: Child Mortality, dtype: float64 Infant Deaths 0.078756 Child Mortality 0.094146 Adult Mortality 1.000000 HIV/AIDS Deaths 0.523821 Measles Cases 0.031176 Hepatitis B Immunization -0.119226 Polio Immunization -0.269759 Diphtheria Immunization -0.270741 Alcohol Consumption -0.178054 Average BMI -0.374971 Malnutrition 5-9 0.294792 Malnutrition 10-19 0.289482 Population -0.005252 GDP per Capita -0.238565 Health Expenditure -0.242860 Health Expenditure GDP -0.105980 Schooling Years -0.407538 Income Composition -0.404365 Target -0.696359 Name: Adult Mortality, dtype: float64 array(['Child Mortality', 'Adult Mortality', 'HIV/AIDS Deaths', 'Measles Cases', 'Hepatitis B Immunization', 'Polio Immunization', 'Diphtheria Immunization', 'Alcohol Consumption', 'Average BMI', 'Malnutrition 5-9', 'Malnutrition 10-19', 'GDP per Capita', 'Health Expenditure', 'Health Expenditure GDP', 'Schooling Years', 'Income Composition'], dtype=object) Visually Check Features against Target In order to understand which Regression algorithm may be used, it is useful to understand how some of the relevant features relate to the Target variable. Looking at the charts below, it seems that a Linear Regression may not perform very well; other algorithms should be considered. In [1]: #import the libraries that will be used in this project import math import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from sklearn import tree from sklearn.ensemble import RandomForestRegressor from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, r2_score from sklearn.model_selection import KFold from sklearn.model_selection import train_test_split from sklearn.preprocessing import MinMaxScaler, StandardScaler from sklearn.preprocessing import PolynomialFeatures from sklearn.tree import DecisionTreeRegressor In [2]: #load and inspect the data raw_data_df = pd.read_csv('Life Expectancy Data.csv') raw_data_df.head() Out[2]: In [3]: #check how many data points (rows) and factors (columns) raw_data_df.shape Out[3]: In [4]: #check factors names, raw_data_df.columns.values[:] Out[4]: In [5]: #get the first dataset for modelling, dropping some factors that will not be used, model_data_df = raw_data_df[[ 'Status', 'Adult Mortality', 'infant deaths', 'Alcohol', 'percentage expenditure', 'Hepatitis B', 'Measles ', ' BMI ', 'under-five deaths ', 'Polio', 'Total expenditure', 'Diphtheria ', ' HIV/AIDS', 'GDP', 'Population', ' thinness 1-19 years', ' thinness 5-9 years', 'Income composition of resources', 'Schooling', 'Life expectancy ']] #copy to avoid accidental change to the raw data model_data_df = model_data_df.copy() #rename some columns (e.g. correct typos, remove spaces, adhere to consistent upper/case usage, etc.) model_data_df.rename(columns = {'Life expectancy ' : 'Target', 'infant deaths' : 'Infant Deaths', 'Alcohol' : 'Alcohol Consumption', 'percentage expenditure' : 'Health Expenditure', 'Hepatitis B' : 'Hepatitis B Immunization', 'Measles ' : 'Measles Cases', ' BMI ' : 'Average BMI', 'under-five deaths ' : 'Child Mortality', 'Polio' : 'Polio Immunization', 'Total expenditure' : 'Health Expenditure GDP', 'Diphtheria ' : 'Diphtheria Immunization', ' HIV/AIDS': 'HIV/AIDS Deaths', 'GDP' : 'GDP per Capita', ' thinness 1-19 years' : 'Malnutrition 10-19', ' thinness 5-9 years' : 'Malnutrition 5-9', 'Income composition of resources': 'Income Composition', 'Schooling' : 'Schooling Years' }, inplace = True) #reorganize the dataset columns order model_data_df = model_data_df[['Infant Deaths', 'Child Mortality' , 'Adult Mortality', 'HIV/AIDS Deaths', 'Measles Cases', 'Hepatitis B Immunization', 'Polio Immunization', 'Diphtheria Immunization', 'Alcohol Consumption', 'Average BMI', 'Malnutrition 5-9', 'Malnutrition 10-19', 'Population', 'GDP per Capita', 'Health Expenditure', 'Health Expenditure GDP', 'Schooling Years', 'Income Composition', 'Status', 'Target']] model_data_df.head() Out[5]: In [6]: #Check if there are observations where the target variable is missing and drop those observations missing_target = sum(model_data_df['Target'].isna() == True) print('Observations with target variable missing: ', missing_target) if missing_target > 0: model_data_df.dropna(subset=['Target'], inplace = True) missing_target = sum(model_data_df['Target'].isna() == True) print('Observations with target variable missing after cleanup: ', missing_target) del missing_target In [7]: #Check invalid values and imput an empty (missing) value to inconsistent observations (against their unit) range_1000_pop = ['Infant Deaths', 'Child Mortality', 'Adult Mortality', 'HIV/AIDS Deaths'] range_100_pct = ['Hepatitis B Immunization', 'Polio Immunization', 'Diphtheria Immunization', 'Health Expenditure GDP', 'Malnutrition 5-9', 'Malnutrition 10-19'] range_0_1 = ['Income Composition'] for feature in range_1000_pop: occurrences_int = sum(model_data_df[feature] >= 1000) if occurrences_int > 0: print(feature, ' has ', occurrences_int, ' invalid values.') for feature in range_100_pct: occurrences_int = sum(model_data_df[feature] > 100) if occurrences_int > 0: print(feature, ' has ', occurrences_int, ' invalid values.') for feature in range_0_1: occurrences_int = sum(model_data_df[feature] > 1) if occurrences_int > 0: print(feature, ' has ', occurrences_int, ' invalid values.') In [8]: #create a function to check for missing data, count occurrences and display their percentage in relation to the total def check_missing(data_df): ''' This function takes a dataframe as input and check for missing values. It returns a dataframe containing the name of the colum with missing values, along with its count and the percentage in relation to the total number of rows (records) ''' #initialize auxiliar variables null_features_lst = [] null_count_lst = [] null_pct_lst = [] total = len(data_df) #check which feature contains missing values for feature in data_df.columns.values[:-1]: null_records = sum(data_df[feature].isna() == True) if null_records > 0: null_features_lst.append(feature) null_count_lst.append(null_records) null_pct_lst.append(round(((null_records / total) * 100), 1)) missing_data_df = pd.DataFrame({'Feature' : null_features_lst, 'Count' : null_count_lst, 'Percentage' : null_pct_lst}) return missing_data_df, null_features_lst In [9]: missing_data_df, null_features_lst = check_missing(model_data_df) missing_data_df Out[9]: In [10]: #create a function to plot the data distribution (as a histogram) of a given features list; #It will also show their average and median as well to visually aid imputation decisions def plot_data_distribution(data_df, features_lst, group_col_str, group_value_str): ''' This function takes a dataframe and a list of columns (subset by group) and plots histograms to show the data distribution of those columns. The histograms contain two vertical lines showing the following statistics: - red: mean - dashed blue: median ''' #define the layout cols = len(null_features_lst) subplot_cols = 3 subplot_rows = math.ceil(cols/subplot_cols) figure_width = 6 figure, axes = plt.subplots(subplot_rows, subplot_cols, figsize = (subplot_cols * figure_width, subplot_rows * figure_width)) #plots the histogram for each column from the "features_list" for col_index in range(cols): ax_row_index = col_index // subplot_cols ax_col_index = col_index % subplot_cols n, bins, patches = axes[ax_row_index][ax_col_index].hist(data_df[features_lst[col_index]][data_df[group_col_str] == group_value_str], bins = 40, color = 'gray') axes[ax_row_index][ax_col_index].set_title(features_lst[col_index]) #plots the vertical lines for mean and median axes[ax_row_index][ax_col_index].axvline(data_df[features_lst[col_index]].mean(), color='red', linewidth=2) axes[ax_row_index][ax_col_index].axvline(data_df[features_lst[col_index]].median(), color='blue', linestyle='dashed', linewidth=2) In [11]: #create a function to imput either a median or mean on missing values def imput_missing_values(data_df, features_lst, median_features_lst, group_col_str, group_value_str): ''' This function takes a dataframe and a list of columns (subset by group) and imputs the mean or median of those columns for the missing values. The columns explicit in the input list "median_features_list" will have the median imputted, while others will have the mean imputted to them. ''' for feature in features_lst: if feature in median_features_lst: feature_median = data_df[feature][data_df[group_col_str] == group_value_str].median() data_df.loc[:, feature] = data_df.loc[:, feature].fillna(round(feature_median, 1)) else: feature_mean = data_df[feature][data_df[group_col_str] == group_value_str].mean() data_df.loc[:, feature] = data_df.loc[:, feature].fillna(round(feature_mean, 1)) return data_df In [12]: #analyze the missing data for Developed countries plot_data_distribution(model_data_df, null_features_lst, 'Status', 'Developed') In [13]: #imput values for missing values of Developed countries based on the above analysis median_features_lst = ['Hepatitis B Immunization', 'Polio Immunization', 'Diphteria Immunization', 'Population', 'Malnutrition 10-19', 'Malnutrition 5-9', 'Income Composition'] model_data_df = imput_missing_values(model_data_df, null_features_lst, median_features_lst, 'Status', 'Developed') In [14]: #analyze the missing data for Developing countries plot_data_distribution(model_data_df, null_features_lst, 'Status', 'Developing') In [15]: #imput values for missing values of Developing countries based on the above analysis median_features_lst = ['Hepatitis B Immunization', 'Polio Immunization', 'Diphteria Immunization'] model_data_df = imput_missing_values(model_data_df, null_features_lst, median_features_lst, 'Status', 'Developing') In [16]: #check that imputation worked (it should return an empty dataframe when checking for missing values) missing_data_df, null_features_lst = check_missing(model_data_df) missing_data_df Out[16]: In [17]: #analyze the correlation between features and target variable sns.heatmap(model_data_df.corr(), vmin = -1, vmax = 1, cmap = sns.diverging_palette(15, 220, as_cmap = True), linewidths = 0.1); In [18]: model_data_df.corr()['Target'] Out[18]: In [19]: model_data_df.corr()['Child Mortality'] Out[19]: In [20]: model_data_df.corr()['Adult Mortality'] Out[20]: In [21]: #drop some additional features that will not be used in the model model_data_df.drop(['Status', 'Infant Deaths', 'Population'], axis = 1, inplace = True) model_data_df.columns.values[:-1] Out[21]: In [22]: #Check the relationship of some relevant features against the target variable #to understand which model may be appropriate def chart_scatter_plot(data_df, x, y = 'Target'): ''' This function plots a scatter plot between two variables. ''' sns.scatterplot(data = data_df, x = x, y = y, hue = y, palette = 'ch:s=.25,rot=-.25'); In [23]: chart_scatter_plot(model_data_df, 'Adult Mortality') In [24]: chart_scatter_plot(model_data_df, 'Child Mortality') In [25]: chart_scatter_plot(model_data_df, 'Average BMI') In [26]: chart_scatter_plot(model_data_df, 'Malnutrition 10-19') In [27]: chart_scatter_plot(model_data_df, 'HIV/AIDS Deaths') https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who https://immunizationdata.who.int/pages/incidence/MEASLES.html?CODE=Global&YEAR= https://www.statista.com/statistics/268826/health-expenditure-as-gdp-percentage-in-oecd-countries/ Model Evaluation & Selection Summary Now that the data has been analyzed and prepared for used, it is time to evaluate different models and select the one that performs better to predict the Life Expectancy (i.e. the one that has the lower root mean square error). Steps Data Split & Scaling First, the data needs to be split into two sets: Training and Test. Then, considering the data has several features measured in different units (percentage, monetary amounts, indices, population number per 1000, etc.), scaling the data is recommended. As the data distribution is not gaussian, the Min Max scaling method will be used. To avoid data leakage, the scalling will be applied (to both train and test datasets) after the split. Evaluate Models Performances Using K-Folds The Train dataset will be split into K-Folds to evaluate the performance of different algorithms, tuned with different hyperparameters. The models that will be evaluated are: Linear Regression Polynomial Regression Decision Tree Regressor Random Forest Regressor The models performance - especially their RMSE (Root Mean Squared Error) will be compared to support a model selection decision. Considering those models will be tried using n- K-Fold s, especiall attention will be paid to the mean RMSE and its standard deviation (i.e. to avoid selecting a model which has lower error, but which varies too much for one K-Fold to the next). Train and Test the Selected Model After the model selection described above, the model will be trained using the full Training set, and tested using the Test dataset for the very first time. If it is performance is acceptable, this will be final prediction model to be used. Visual Representation of the stragegy of splitting the data, then using K-Folds for Model Evaluation & Selection Child Mortality Adult Mortality HIV/AIDS Deaths Measles Cases Hepatitis B Immunization Polio Immunization Diphtheria Immunization Alcohol Consumption Average BMI Malnutrition 5-9 Malnutrition 10-19 GDP per Capita Health Expenditure Health Expenditure GDP Schooling Years Income Composition Target 0 83 263.0 0.1 1154 65.0 6.0 65.0 0.01 19.1 17.3 17.2 584.259210 71.279624 8.16 10.1 0.479 65.0 1 86 271.0 0.1 492 62.0 58.0 62.0 0.01 18.6 17.5 17.5 612.696514 73.523582 8.18 10.0 0.476 59.9 2 89 268.0 0.1 430 64.0 62.0 64.0 0.01 18.1 17.7 17.7 631.744976 73.219243 8.13 9.9 0.470 59.9 3 93 272.0 0.1 2787 67.0 67.0 67.0 0.01 17.6 18.0 17.9 669.959000 78.184215 8.52 9.8 0.463 59.5 4 97 275.0 0.1 3013 68.0 68.0 68.0 0.01 17.2 18.2 18.2 63.537231 7.097109 7.87 9.5 0.454 59.2 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 2933 42 723.0 33.6 31 68.0 67.0 65.0 4.36 27.1 9.4 9.4 454.366654 0.000000 7.13 9.2 0.407 44.3 2934 41 715.0 36.7 998 7.0 7.0 68.0 4.06 26.7 9.9 9.8 453.351155 0.000000 6.52 9.5 0.418 44.5 2935 40 73.0 39.8 304 73.0 73.0 71.0 4.43 26.3 1.3 1.2 57.348340 0.000000 6.53 10.0 0.427 44.8 2936 39 686.0 42.1 529 76.0 76.0 75.0 1.72 25.9 1.7 1.6 548.587312 0.000000 6.16 9.8 0.427 45.3 2937 39 665.0 43.5 1483 79.0 78.0 78.0 1.68 25.5 11.2 11.0 547.358878 0.000000 7.10 9.8 0.434 46.0 2928 rows × 17 columns Data sucessfully split. X_train type : Shape: (2049, 16) X_test type : Shape: (879, 16) y_train type : Shape: (2049,) y_test type : Shape: (879,) Evaluating the Regression Models The following models are being evaluated, and have their hyperparameters tuned as follow: Linear Regression : no hyperparameters to try. Polynomial Regression : trying from 2nd degree to 4th degree. *Based on the results below, the 2nd degree polynomial regression fares better than a 1-degress polynomial (i.e. Linear Regression) or higher degrees. Decision Tree: trying max depth, then 5-10. The max depth goes near 30. The depth 7 seems to be the optimal. Random Forests : this emsemble algorithm will be tried with 200 estimators, 100 (default), 50, 25 and 10. The 100 estimators seem to be the optimal.Above this, no improvements, below that it fares slightly worse. Linear Regression Polynomial Regression 2nd degree Polynomial Regression 3rd degree Polynomial Regression 4th degree Decision Tree Max Depth Decision Tree depth 10 Decision Tree depth 7 Decision Tree depth 5 Random Forest estimators 200 Random Forest estimators 100 Random Forest estimators 50 Random Forest estimators 25 Random Forest estimators 10 0 4.4 3.1 128.8 666.6 3.3 3.2 3.0 3.2 1.9 1.9 1.9 1.9 2.1 1 4.4 3.4 79.7 301.7 3.1 3.1 3.1 3.3 2.2 2.2 2.2 2.3 2.3 2 4.5 3.1 61.9 196.0 2.9 2.7 3.0 3.5 2.1 2.1 2.1 2.2 2.3 3 4.4 3.3 111.6 328.6 2.9 2.8 2.6 3.0 1.9 1.9 2.0 2.0 2.1 4 4.2 3.6 149.6 411.3 2.9 2.7 2.7 3.1 2.0 2.0 2.0 2.0 2.1 Linear Regression Polynomial Regression 2nd degree Polynomial Regression 3rd degree Polynomial Regression 4th degree Decision Tree Max Depth Decision Tree depth 10 Decision Tree depth 7 Decision Tree depth 5 Random Forest estimators 200 Random Forest estimators 100 Random Forest estimators 50 Random Forest estimators 25 Random Forest estimators 10 count 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 mean 4.380000 3.300000 106.320000 380.840000 3.020000 2.900000 2.880000 3.220000 2.020000 2.020000 2.040000 2.080000 2.180000 std 0.109545 0.212132 35.665628 177.302406 0.178885 0.234521 0.216795 0.192354 0.130384 0.130384 0.114018 0.164317 0.109545 min 4.200000 3.100000 61.900000 196.000000 2.900000 2.700000 2.600000 3.000000 1.900000 1.900000 1.900000 1.900000 2.100000 25% 4.400000 3.100000 79.700000 301.700000 2.900000 2.700000 2.700000 3.100000 1.900000 1.900000 2.000000 2.000000 2.100000 50% 4.400000 3.300000 111.600000 328.600000 2.900000 2.800000 3.000000 3.200000 2.000000 2.000000 2.000000 2.000000 2.100000 75% 4.400000 3.400000 128.800000 411.300000 3.100000 3.100000 3.000000 3.300000 2.100000 2.100000 2.100000 2.200000 2.300000 max 4.500000 3.600000 149.600000 666.600000 3.300000 3.200000 3.100000 3.500000 2.200000 2.200000 2.200000 2.300000 2.300000 Try 10 K-folds instead of 5 Though the mean performance is better (lower RMSE ), the standard deviation is a little higher. Nonetheless, the models performance when compared to each other is very similar, and Random Forest Regression still the algorithm that performs better. Linear Regression Polynomial Regression 2nd degree Polynomial Regression 3rd degree Polynomial Regression 4th degree Decision Tree Max Depth Decision Tree depth 10 Decision Tree depth 7 Decision Tree depth 5 Random Forest estimators 200 Random Forest estimators 100 Random Forest estimators 50 Random Forest estimators 25 Random Forest estimators 10 0 4.4 3.4 58.8 697.1 3.4 3.2 3.0 3.2 2.2 2.2 2.3 2.3 2.6 1 4.5 2.7 69.9 468.7 2.6 2.3 2.6 2.7 1.5 1.5 1.5 1.5 1.7 2 4.6 3.3 30.8 567.6 2.9 2.7 2.4 3.1 1.8 1.8 1.8 1.8 1.9 3 4.3 3.3 47.6 234.7 3.3 2.9 2.9 3.5 2.2 2.2 2.2 2.3 2.3 4 4.2 3.1 52.8 673.6 2.2 2.3 2.3 2.7 1.8 1.8 1.8 1.8 1.9 5 4.7 3.3 32.0 249.8 2.6 2.8 2.9 3.5 2.1 2.1 2.2 2.3 2.3 6 4.7 3.6 127.0 476.0 2.7 2.7 2.7 3.0 2.0 2.0 2.0 2.1 2.1 7 4.1 2.8 114.1 674.4 2.5 2.4 2.7 3.0 1.7 1.8 1.8 1.8 1.9 8 4.1 3.1 54.4 328.5 2.6 2.4 2.7 3.2 2.0 2.0 2.0 2.0 2.2 9 4.3 3.4 13.7 595.9 2.9 2.7 2.6 2.9 1.8 1.8 1.8 1.9 1.9 Linear Regression Polynomial Regression 2nd degree Polynomial Regression 3rd degree Polynomial Regression 4th degree Decision Tree Max Depth Decision Tree depth 10 Decision Tree depth 7 Decision Tree depth 5 Random Forest estimators 200 Random Forest estimators 100 Random Forest estimators 50 Random Forest estimators 25 Random Forest estimators 10 count 10.000000 10.000000 10.000000 10.000000 10.0000 10.000000 10.000000 10.000000 10.000000 10.000000 10.000000 10.000000 10.000000 mean 4.390000 3.200000 60.110000 496.630000 2.7700 2.640000 2.680000 3.080000 1.910000 1.920000 1.940000 1.980000 2.080000 std 0.228279 0.278887 35.796382 175.437802 0.3653 0.291357 0.220101 0.282056 0.228279 0.220101 0.245855 0.269979 0.269979 min 4.100000 2.700000 13.700000 234.700000 2.2000 2.300000 2.300000 2.700000 1.500000 1.500000 1.500000 1.500000 1.700000 25% 4.225000 3.100000 35.900000 363.550000 2.6000 2.400000 2.600000 2.925000 1.800000 1.800000 1.800000 1.800000 1.900000 50% 4.350000 3.300000 53.600000 521.800000 2.6500 2.700000 2.700000 3.050000 1.900000 1.900000 1.900000 1.950000 2.000000 75% 4.575000 3.375000 67.125000 654.175000 2.9000 2.775000 2.850000 3.200000 2.075000 2.075000 2.150000 2.250000 2.275000 max 4.700000 3.600000 127.000000 697.100000 3.4000 3.200000 3.000000 3.500000 2.200000 2.200000 2.300000 2.300000 2.600000 Selecting the Model & Evaluating its Performance Analyzing the above performance of the tried models, the Random Forest with 100 estimators seems to have the best performace. Now it is time to double-check that this model also performs well with unseen Test Data. *Based on the results below, it performs even better than the tried models with train K-folds data** Looking at the target data, it is possible see that 80% of the Life Expectancy was predicted with less than 1.5 years of difference. 1.8449583566801624 Target Predicted 0 73.0 73.326 1 71.8 71.182 2 55.4 56.212 3 71.6 71.935 4 68.3 64.746 ... ... ... 874 79.6 79.055 875 78.7 81.363 876 78.7 78.996 877 64.3 64.156 878 83.0 79.804 879 rows × 2 columns 90.0 % predictions under 2.5 years difference - count 768 observations 80.0 % predictions under 1.5 years difference - count 670 observations 60.0 % predictions under 1 years difference - count 562 observations Alternative Models Given the dataset is not very large, it gives the freedom to try it out with different approaches to compare how it performs agains the selected model above. Using a Different Scaler Using the same features, let's try to use the Standard Scaler instead of the Min Max to see if the model performs better, worse or whethere there are no significant differences. The Standard Scaler is generally chosen when the data is normally distributed, which is not the case for the Life Expectancy dataset. There are no significant differences for Random Forests, however the Min Max scaler fares better for model such Polynomial Regression and slightly worse for Decistion Tree Regressor for example. Using Different Features Focus on Economic Factors tied to education, health expenditure and income. Though those factors have a high correlation with the Life Expectancy they are not sufficient to predict it more precisely. Focus on Imunization Factors. Disregard other factors (e.g. social, economic, etc.) makes the model fare worse. Using Standard Scaler Data sucessfully split. X_train type : Shape: (2049, 16) X_test type : Shape: (879, 16) y_train type : Shape: (2049,) y_test type : Shape: (879,) Linear Regression Polynomial Regression 2nd degree Polynomial Regression 3rd degree Polynomial Regression 4th degree Decision Tree Max Depth Decision Tree depth 10 Decision Tree depth 7 Decision Tree depth 5 Random Forest estimators 200 Random Forest estimators 100 Random Forest estimators 50 Random Forest estimators 25 Random Forest estimators 10 0 4.4 3.1 128.8 761.5 3.3 3.2 3.0 3.2 1.9 1.9 1.9 2.0 2.1 1 4.4 3.4 79.7 380.9 3.1 3.1 3.1 3.3 2.2 2.2 2.2 2.3 2.3 2 4.5 3.1 61.9 293.9 2.8 2.7 3.0 3.5 2.1 2.1 2.1 2.2 2.3 3 4.4 3.3 111.6 2129.2 2.9 2.8 2.6 3.0 1.9 1.9 2.0 2.0 2.1 4 4.2 3.6 149.6 442.5 2.9 2.7 2.7 3.1 2.0 2.0 2.0 2.0 2.1 Linear Regression Polynomial Regression 2nd degree Polynomial Regression 3rd degree Polynomial Regression 4th degree Decision Tree Max Depth Decision Tree depth 10 Decision Tree depth 7 Decision Tree depth 5 Random Forest estimators 200 Random Forest estimators 100 Random Forest estimators 50 Random Forest estimators 25 Random Forest estimators 10 count 5.000000 5.000000 5.000000 5.000000 5.0 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 mean 4.380000 3.300000 106.320000 801.600000 3.0 2.900000 2.880000 3.220000 2.020000 2.020000 2.040000 2.100000 2.180000 std 0.109545 0.212132 35.665628 762.861449 0.2 0.234521 0.216795 0.192354 0.130384 0.130384 0.114018 0.141421 0.109545 min 4.200000 3.100000 61.900000 293.900000 2.8 2.700000 2.600000 3.000000 1.900000 1.900000 1.900000 2.000000 2.100000 25% 4.400000 3.100000 79.700000 380.900000 2.9 2.700000 2.700000 3.100000 1.900000 1.900000 2.000000 2.000000 2.100000 50% 4.400000 3.300000 111.600000 442.500000 2.9 2.800000 3.000000 3.200000 2.000000 2.000000 2.000000 2.000000 2.100000 75% 4.400000 3.400000 128.800000 761.500000 3.1 3.100000 3.000000 3.300000 2.100000 2.100000 2.100000 2.200000 2.300000 max 4.500000 3.600000 149.600000 2129.200000 3.3 3.200000 3.100000 3.500000 2.200000 2.200000 2.200000 2.300000 2.300000 Features Focused on Economical Factors Health Expenditure GDP Schooling Years Income Composition Target 0 8.16 10.1 0.479 65.0 1 8.18 10.0 0.476 59.9 2 8.13 9.9 0.470 59.9 3 8.52 9.8 0.463 59.5 4 7.87 9.5 0.454 59.2 ... ... ... ... ... 2933 7.13 9.2 0.407 44.3 2934 6.52 9.5 0.418 44.5 2935 6.53 10.0 0.427 44.8 2936 6.16 9.8 0.427 45.3 2937 7.10 9.8 0.434 46.0 2928 rows × 4 columns Data sucessfully split. X_train type : Shape: (2049, 3) X_test type : Shape: (879, 3) y_train type : Shape: (2049,) y_test type : Shape: (879,) Linear Regression Polynomial Regression 2nd degree Polynomial Regression 3rd degree Polynomial Regression 4th degree Decision Tree Max Depth Decision Tree depth 10 Decision Tree depth 7 Decision Tree depth 5 Random Forest estimators 200 Random Forest estimators 100 Random Forest estimators 50 Random Forest estimators 25 Random Forest estimators 10 0 7.3 6.9 5.7 5.4 5.5 5.2 4.9 5.4 4.7 4.7 4.7 4.7 4.8 1 6.9 6.1 5.0 5.0 5.2 5.1 5.0 5.1 4.6 4.6 4.6 4.7 4.7 2 6.9 6.4 5.6 5.3 5.6 4.8 4.5 4.9 4.3 4.3 4.3 4.3 4.5 3 6.4 6.0 5.5 5.4 6.0 5.1 5.2 5.2 4.7 4.8 4.8 4.8 4.8 4 6.8 6.4 5.4 5.2 5.1 5.0 4.8 5.1 4.2 4.2 4.2 4.3 4.4 Linear Regression Polynomial Regression 2nd degree Polynomial Regression 3rd degree Polynomial Regression 4th degree Decision Tree Max Depth Decision Tree depth 10 Decision Tree depth 7 Decision Tree depth 5 Random Forest estimators 200 Random Forest estimators 100 Random Forest estimators 50 Random Forest estimators 25 Random Forest estimators 10 count 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 mean 6.860000 6.360000 5.440000 5.260000 5.480000 5.040000 4.880000 5.140000 4.500000 4.520000 4.520000 4.560000 4.640000 std 0.320936 0.350714 0.270185 0.167332 0.356371 0.151658 0.258844 0.181659 0.234521 0.258844 0.258844 0.240832 0.181659 min 6.400000 6.000000 5.000000 5.000000 5.100000 4.800000 4.500000 4.900000 4.200000 4.200000 4.200000 4.300000 4.400000 25% 6.800000 6.100000 5.400000 5.200000 5.200000 5.000000 4.800000 5.100000 4.300000 4.300000 4.300000 4.300000 4.500000 50% 6.900000 6.400000 5.500000 5.300000 5.500000 5.100000 4.900000 5.100000 4.600000 4.600000 4.600000 4.700000 4.700000 75% 6.900000 6.400000 5.600000 5.400000 5.600000 5.100000 5.000000 5.200000 4.700000 4.700000 4.700000 4.700000 4.800000 max 7.300000 6.900000 5.700000 5.400000 6.000000 5.200000 5.200000 5.400000 4.700000 4.800000 4.800000 4.800000 4.800000 4.100811663938941 Features Focused on Immunization Factors Hepatitis B Immunization Polio Immunization Diphtheria Immunization Target 0 65.0 6.0 65.0 65.0 1 62.0 58.0 62.0 59.9 2 64.0 62.0 64.0 59.9 3 67.0 67.0 67.0 59.5 4 68.0 68.0 68.0 59.2 ... ... ... ... ... 2933 68.0 67.0 65.0 44.3 2934 7.0 7.0 68.0 44.5 2935 73.0 73.0 71.0 44.8 2936 76.0 76.0 75.0 45.3 2937 79.0 78.0 78.0 46.0 2928 rows × 4 columns Data sucessfully split. X_train type : Shape: (2049, 3) X_test type : Shape: (879, 3) y_train type : Shape: (2049,) y_test type : Shape: (879,) Linear Regression Polynomial Regression 2nd degree Polynomial Regression 3rd degree Polynomial Regression 4th degree Decision Tree Max Depth Decision Tree depth 10 Decision Tree depth 7 Decision Tree depth 5 Random Forest estimators 200 Random Forest estimators 100 Random Forest estimators 50 Random Forest estimators 25 Random Forest estimators 10 0 8.6 7.7 7.7 7.6 9.4 8.3 7.7 7.7 7.8 7.8 7.8 7.8 7.8 1 8.4 7.6 7.7 7.7 8.9 8.6 7.9 7.6 7.6 7.7 7.7 7.7 7.8 2 8.2 7.5 7.5 7.5 7.8 7.6 7.4 7.5 7.1 7.1 7.1 7.0 7.2 3 8.4 7.3 7.3 7.3 8.0 7.3 7.2 7.3 6.9 6.9 6.9 6.9 7.0 4 7.9 7.4 7.4 7.4 8.4 7.8 7.3 7.2 6.8 6.9 6.9 6.9 7.1 Linear Regression Polynomial Regression 2nd degree Polynomial Regression 3rd degree Polynomial Regression 4th degree Decision Tree Max Depth Decision Tree depth 10 Decision Tree depth 7 Decision Tree depth 5 Random Forest estimators 200 Random Forest estimators 100 Random Forest estimators 50 Random Forest estimators 25 Random Forest estimators 10 count 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 mean 8.300000 7.500000 7.520000 7.500000 8.500000 7.920000 7.500000 7.460000 7.240000 7.280000 7.280000 7.260000 7.380000 std 0.264575 0.158114 0.178885 0.158114 0.655744 0.526308 0.291548 0.207364 0.439318 0.438178 0.438178 0.450555 0.389872 min 7.900000 7.300000 7.300000 7.300000 7.800000 7.300000 7.200000 7.200000 6.800000 6.900000 6.900000 6.900000 7.000000 25% 8.200000 7.400000 7.400000 7.400000 8.000000 7.600000 7.300000 7.300000 6.900000 6.900000 6.900000 6.900000 7.100000 50% 8.400000 7.500000 7.500000 7.500000 8.400000 7.800000 7.400000 7.500000 7.100000 7.100000 7.100000 7.000000 7.200000 75% 8.400000 7.600000 7.700000 7.600000 8.900000 8.300000 7.700000 7.600000 7.600000 7.700000 7.700000 7.700000 7.800000 max 8.600000 7.700000 7.700000 7.700000 9.400000 8.600000 7.900000 7.700000 7.800000 7.800000 7.800000 7.800000 7.800000 6.88824922521834 In [28]: chart_scatter_plot(model_data_df, 'Income Composition') In [29]: chart_scatter_plot(model_data_df, 'Schooling Years') In [30]: #Checking the final dataset that will be used for model training and test model_data_df Out[30]: In [31]: #Function to split the data to support the model evaluation and selection def split_data(data_df, scaling_str = 'minmax', test_size_fl = 0.3): #start by splitting the data into training and test X = data_df.drop(['Target'], axis = 1) y = data_df['Target'] #split the data into train and test X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_size_fl, random_state = 1) #scale the data if scaling_str != None: if scaling_str == 'standard': scaler = StandardScaler() elif scaling_str == 'minmax': scaler = MinMaxScaler() else: print('Scaling method not supported by this function. Choose "standard", "minmax" or None.') #scale the data using the chosen method (if any) - after the split to avoid data leakage scaler.fit(X_train) X_train = pd.DataFrame(scaler.transform(X_train)) X_test = pd.DataFrame(scaler.transform(X_test)) print('Data sucessfully split.') print('X_train type : ', type(X_train), ' Shape: ', X_train.shape) print('X_test type : ', type(X_test), ' Shape: ', X_test.shape) print('y_train type : ', type(y_train), ' Shape: ', y_train.shape) print('y_test type : ', type(y_test), ' Shape: ', y_test.shape) return X_train, X_test, y_train, y_test In [32]: #Split the data and scaling using the Min Max scaler X_train, X_test, y_train, y_test = split_data(model_data_df) In [33]: #Function to try different regression models using K-fold data def try_model(model_str, X, y, fold_index, hyperparameter_int = None): ''' This function validates different machine learn algorithms for a regression problem. It takes as input the model, the train and validation data (split using the K-fold method), and also a hyperparemeter (optional) to tune the selected model. It returns the model RMSE metric to support model selection. Valid inputs: - model: linear (Linear Regression), poly (Polynomial Regression), dt (Decision Tree Regressor), rf (Random Forest Regressor) - hyperparameter_int: used for Polynomial Regression (degree), Decistion Tree (tree depth) and Random Forest (number of estimators) ''' #get the K-fold data to evaluate the model train_index = fold_index[0] validation_index = fold_index[1] X_train = X.iloc[train_index] y_train = y.iloc[train_index] X_validation = X.iloc[validation_index] y_validation = y.iloc[validation_index] #instantiate the model if model_str == 'linear': model = LinearRegression() elif model_str == 'poly': #transform the data to use in a polynomial algorithm poly = PolynomialFeatures(degree = hyperparameter_int, include_bias = False) X_train = poly.fit_transform(X_train) X_validation = poly.fit_transform(X_validation) model = LinearRegression() elif model_str == 'dt': model = DecisionTreeRegressor(max_depth = hyperparameter_int, random_state = 1) elif model_str == 'rf': model = RandomForestRegressor(n_estimators = hyperparameter_int, random_state = 1) else: print('Algorithm not supported by this function. Choose one of the following values: "linear", "poly", "dt" or "rf".') #train the algorithm model.fit(X_train, y_train) #validate the algorithm predictions y_prediction = model.predict(X_validation) rmse = round(np.sqrt(mean_squared_error(y_validation, y_prediction)), 1) return rmse In [34]: #Validate different models with different hyperperameters and compare their performance def validate_models(X, y, folds_int = 5): #Get a copy of the test data to split into K-folds for model validation X_fold = pd.DataFrame(X).copy() y_fold = pd.Series(y).copy() #Create a data dictionary to hold a list with the results of the models that will be evaluated results = {'Linear Regression': [], 'Polynomial Regression 2nd degree' : [], 'Polynomial Regression 3rd degree': [], 'Polynomial Regression 4th degree' : [], 'Decision Tree Max Depth': [], 'Decision Tree depth 10' : [], 'Decision Tree depth 7' : [], 'Decision Tree depth 5' : [], 'Random Forest estimators 200': [], 'Random Forest estimators 100': [], 'Random Forest estimators 50' : [], 'Random Forest estimators 25' : [], 'Random Forest estimators 10': [] } #Split the data into 5 folds kf = KFold(n_splits = folds_int, random_state = 1, shuffle = True) #try the models and collect their performance (RMSE) for fold in kf.split(X_fold): results['Linear Regression'].append(try_model('linear', X_fold, y_fold, fold)) results['Polynomial Regression 2nd degree'].append(try_model('poly', X_fold, y_fold, fold, 2)) results['Polynomial Regression 3rd degree'].append(try_model('poly', X_fold, y_fold, fold, 3)) results['Polynomial Regression 4th degree'].append(try_model('poly', X_fold, y_fold, fold, 4)) results['Decision Tree Max Depth'].append(try_model('dt', X_fold, y_fold, fold)) results['Decision Tree depth 10'].append(try_model('dt', X_fold, y_fold, fold, 10)) results['Decision Tree depth 7'].append(try_model('dt', X_fold, y_fold, fold, 7)) results['Decision Tree depth 5'].append(try_model('dt', X_fold, y_fold, fold, 5)) results['Random Forest estimators 200'].append(try_model('rf', X_fold, y_fold, fold, 200)) results['Random Forest estimators 100'].append(try_model('rf', X_fold, y_fold, fold, 100)) results['Random Forest estimators 50'].append(try_model('rf', X_fold, y_fold, fold, 50)) results['Random Forest estimators 25'].append(try_model('rf', X_fold, y_fold, fold, 25)) results['Random Forest estimators 10'].append(try_model('rf', X_fold, y_fold, fold, 10)) return results In [35]: #evaluate different models tuned with different hyperparameters using K-folds results_df = pd.DataFrame(validate_models(X_train, y_train, 5)) results_df Out[35]: In [36]: #check the results of RMSE - especially the mean and standard deviation results_df.describe() Out[36]: In [37]: #Evaluate the models using 10 K-folds instead of 5 to see if there is any significant change results_df = pd.DataFrame(validate_models(X_train, y_train, 10)) results_df Out[37]: In [38]: results_df.describe() Out[38]: In [39]: #Test Random Forests model with 100 estimators model = RandomForestRegressor(n_estimators = 100) model.fit(X_train, y_train) y_prediction = model.predict(X_test) rmse = np.sqrt(mean_squared_error(y_test, y_prediction)) rmse Out[39]: In [40]: y_test.reset_index(drop = True, inplace = True) In [41]: y_df = pd.concat([y_test, pd.Series(y_prediction)], axis = 1, keys = ['Target', 'Predicted']) y_df Out[41]: In [42]: total_observations = len(y_df) under_1 = sum(abs(y_df['Target'] - y_df['Predicted'])
There are no answers to this question.
Login to buy an answer or post yours. You can also vote on other others

Get Help With a similar task to - Simple Machine Learning Project using Python with pandas, numpy and sklearn

Related Questions

Similar orders to Simple Machine Learning Project using Python with pandas, numpy and sklearn
Popular Services
Tutlance Experts offer help in a wide range of topics. Here are some of our top services: