Do My Homework
/
Homework Help Answers
/
Data Science Homework Help
/ Simple Machine Learning Project using Python with pandas, numpy and sklearn

# Simple Machine Learning Project using Python with pandas, numpy and sklearn

**Need help with this question or any other
Data Science assignment help
task?**

Simple Machine Learning Project using Python with pandas, numpy and sklearn. Only Jupyter notebook, no presentation required. Should show data source (it can be from Kaggle), the steps of exploratory data analysis, data cleaning, visualization (charts), transformation, feature selection, etc. the model evaluation and selection. Preferred domain: retail - such as sales prediction or inventory optimization. I am attaching an example of a notebook I did on my own for a different project (it should be something similar, but a little better). Dataset requirement: minimum 10,000 rows and 5 columns.

##### Additional Instructions:

Hi @channel Please kindly use this for thread for Machine Learning Project Q&A
-------------------------------------------------------------------------------------
Machine Learning Project
Machine Learning Project Instructions + Hand in
--------------------------------------------------------------------------------------
Rules
· Reply to this thread for Q & A with
· Don't share your code
· SQL, Python documentation link is allowed
--------------------------------------------------------------------------------------- General Questions
Q: is the ML capstone in groups or individual?
+ What should we be handing in?A: It's individual, every student has to hand in:
1. Your short two-page analytics plan with a business background intro
2. Your notebook for the project (see Slides for more details on the project outline)
3. 15-20 page slide deck for the project presentation
Q: Expectations for presentations/hand-in on the 9th is still a bit ambiguous. What does 70% complete look like?
+ what should we be presenting at this phase of the project?A: Here is the checklist for the 70%:
Part 1 (Presentation)
1. Agenda
2. Motivation for dataset(business) chosen
3. Show your workflow + notebook
4. Insights/Conclusions
5. Challenges
6. Next steps
Part 2 (Notebook)
1. Gathering Data(example: a dataset from Kaggle, dataset from the web scraping project)
2. Data Cleaning
3. Data Visualization
4. Data transformation
5. Create new feature and feature selection
6. Basic model
7. Training & Evaluation
8. GridSearchCV (optional)
9. Final ML model
10. Deep Learning (optional)
11. Prediction (Explain the Metrics you choose)
Here is the checklist for the rest of 30%:
1. More Data Cleaning, Visualization and create new feature
2. Interpreting Machine Learning Model
3. Hyperparameter Tuning
4. Deep Learning
5. Post it in your blog or website
--------------------------------------------------------------------------------------- Ideas for datasets
· Practice Machine Learning with Small In-Memory Datasets
· Tour of Real-World Machine Learning Problems
· Work on Machine Learning Problems That Matter To You
· Top 47 Machine Learning Projects for 2022
· 285+ Machine Learning Projects with Python
--------------------------------------------------------------------------------------- Readings & Documentations
Visualization
· Plotly Open Source Graphing Library for Python
ML
· New Understanding Train Test Split
· Preprocessing: OneHotEncoder() vs pandas.get_dummies
· Ordinal and One-Hot Encodings for Categorical Data
· sklearn.compose.ColumnTransformer
· Various ways to evaluate a machine learning model’s performance
· Pipelines and composite estimators
· Cross-validation: evaluating estimator performance
· Tuning the hyper-parameters of an estimator

Machine Learning Midterm Project
Project Outline
Domain
Modern society relies heavily on institutionalized public policies to solve relevant problems, usually implemented by governments or nonprofit organizations. Such policies affect most aspects of our daily lives, and policies implemented inadequately or under incorrect assumptions bear a high cost to society (e.g. resources, life
quality, health, etc.), and many times the damage is visible only in the long-term, surpassing the administration that has implemented them. Considering this, it feels natural that policymakers rely on vast amounts of historical data and statistical methods. As many policies target improvements in human development aiming at
future goals, leveraging machine learning and human decision-making capabilities has the potential to improve the effectiveness of those policies and their outcomes for citizens.
Life Expectancy as Target
This project aims to visualize how income, health, education, etc. indices affect life expectancy, and use those to predict it. Life expectancy is a factor in measuring human development and is usually used to describe the physical quality of life. Life expectancy is also a critical demographic indicator for setting effective policies.
For example, higher life expectancy trends may signal to policymakers that funding pension plans or updates in its rules may be necessary, while the opposite may signal that investments in healthcare may be required to reduce mortality.
Program Structure
Data Collection
For this project, the Life Expectancy dataset from WHO will be used. This dataset can be found at: https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who
Exploratory Data Analysis & Preparation
The dataset will be loaded in a pandas dataframe and analyzed (including the plotting of some charts) to understand some of its traits such as:
What is the data distribution?
What is the relationship/correlation between features and the target?
What is the relationship/colinearity between features themselves?
Are there missing values? What is the strategy that makes sense to handle those: drop or imput values, etc.?
What is the shape of the dataset? Which approaches can be used to wrangle data?
Are there any features that need encoding?
Does the dataset should be scaled?
Which features could be used to predict the label Life Expectancy ? If want to focus in some
Model Evaluation & Selection
Given this is a regression problem, the following models will be evaluated:
Linear Regression
Polynomial Regression
Decision Tree Regression
Random Forests Regression
The approach to evaluate those, will consist in the following steps:
1. Split the data into train and test.
2. Further split the train data into training and validation using the K-Fold approach.
3. Use the K-Folds to assess the performance of different algorithms with different hyperparameters.
4. Compare the performance between models, and select the one with the best performance.
5. Use the test data (from #1) to evaluate the if the selected model is adequate to predict the Life Expectancy .
Compare Model Performance Using Aternative Features
Given this dataset has several features, let's compare the performance of alternative models which focus on specific areas such as: economical factors, immunization, etc.
Data
Summary
The dataset from WHO on Life Expectancy, contains data from 2000 to 2015 for all countries. Each observation contains immunization factors, mortality factors, economic factors and social factors that may affect the life expectancy.
For the purpose of this project, the data has been cleaned, reorganized and the features renamed to make them less confusing.
The raw dataset can be found at: https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who
Known issues in the Raw Dataset that will be addressed
Measles Cases comprise the total number of cases reported by the given year instead of cases per 1.000 population. (Values compared against WHO dataset: https://immunizationdata.who.int/pages/incidence/MEASLES.html?CODE=Global&YEAR=)
The Population have some invalid observations, causing the value to be invalid. *Given the Population** is not relevant to predict Life Expectancy it will be dropped.
The Percentage Expenditure actually is a dollar amount of health expenditure per capita.
The Total Expenditure is actually the percentage of expenditure in health in relation to GDP instead of total government expenditure. (Approximated values compared against: https://www.statista.com/statistics/268826/health-expenditure-as-gdp-percentage-in-oecd-countries/).
Data Mapping (*)
Country : country name.
Year : year.
Status : indicates whether the country is a "Developed" country or is still a "Developing" country.
Infant Deaths : number of infant deathes per 1.000 population.
Child Mortality : number of deaths of children under 5 years old per 1.000 population.
Adult Mortality : adult mortality rate (including both sexes) between 15 and 60 years old per 1.000 population.
HIV/AIDS Deaths : number of deaths per 1.000 births.
Measles Cases : number of reported cases per year (for the entire population).
Hepatitis B Immunization : immunization coverage among 1 year olds (in percentage).
Polio Immunization : immunization coverage among 1 year olds (in percentage).
Diphtheria Immunization : Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage 1 year olds (in percentage).
Alcohol Consumption : consumption per capita (above 15 years old) in litres (of pure alcohol).
Average BMI : average Body Mass Index of entire population.
Malnutrition 5-9 : prevalence of thinness among children aged 5 to 9 years old (in percentage).
Malnutrition 10-19 : prevalence of thinness among children and adolescents aged 10 to 19 years old (in percentage).
Population : population of the country in that year.
GDP per Capita : Gross Domestic Product per capita in USD (US Dollars).
Health Expenditure : expenditure on health amount per capita in USD (US Dollars).
Health Expenditure GDP : expenditure on health as pecentage of the GDP (in percentage).
Schooling Years : number of years of schooling.
Income Composition : Human Development Index comprising the relative share of each income source, expressed as a percentage of the aggregate total income of that area.
Life Expectancy : target/label of life expectancy in years.
(*) After columns renaming and information correction.
Data Format
The dataset format is a CSV file, containing 22 columns (factors) and 2938 rows (observations).
To support the analysis, the data will be loaded into a Pandas DataFrame .
Country Year Status Life expectancy Adult Mortality infant deaths Alcohol percentage expenditure Hepatitis B Measles ... Polio Total expenditure Diphtheria HIV/AIDS GDP Population thinness 1-19 years thinness 5-9 years Income composition of resources Schooling
0 Afghanistan 2015 Developing 65.0 263.0 62 0.01 71.279624 65.0 1154 ... 6.0 8.16 65.0 0.1 584.259210 33736494.0 17.2 17.3 0.479 10.1
1 Afghanistan 2014 Developing 59.9 271.0 64 0.01 73.523582 62.0 492 ... 58.0 8.18 62.0 0.1 612.696514 327582.0 17.5 17.5 0.476 10.0
2 Afghanistan 2013 Developing 59.9 268.0 66 0.01 73.219243 64.0 430 ... 62.0 8.13 64.0 0.1 631.744976 31731688.0 17.7 17.7 0.470 9.9
3 Afghanistan 2012 Developing 59.5 272.0 69 0.01 78.184215 67.0 2787 ... 67.0 8.52 67.0 0.1 669.959000 3696958.0 17.9 18.0 0.463 9.8
4 Afghanistan 2011 Developing 59.2 275.0 71 0.01 7.097109 68.0 3013 ... 68.0 7.87 68.0 0.1 63.537231 2978599.0 18.2 18.2 0.454 9.5
5 rows × 22 columns
(2938, 22)
array(['Country', 'Year', 'Status', 'Life expectancy ', 'Adult Mortality',
'infant deaths', 'Alcohol', 'percentage expenditure',
'Hepatitis B', 'Measles ', ' BMI ', 'under-five deaths ', 'Polio',
'Total expenditure', 'Diphtheria ', ' HIV/AIDS', 'GDP',
'Population', ' thinness 1-19 years', ' thinness 5-9 years',
'Income composition of resources', 'Schooling'], dtype=object)
Converting Raw Data to Modeling Data
As the purpose of this project is to predict the Life Expectancy using the health, economical, etc. factors. The Country and Year features/factors are not considered relevant, therefore they will be dropped.
Also, to make it more intelligible, some features will be renamed.
Infant
Deaths
Child
Mortality
Adult
Mortality
HIV/AIDS
Deaths
Measles
Cases
Hepatitis B
Immunization
Polio
Immunization
Diphtheria
Immunization
Alcohol
Consumption
Average
BMI
Malnutrition
5-9
Malnutrition 10-
19 Population
GDP per
Capita
Health
Expenditure
Health Expenditure
GDP
Schooling
Years
Income
Composition Status Target
0 62 83 263.0 0.1 1154 65.0 6.0 65.0 0.01 19.1 17.3 17.2 33736494.0 584.259210 71.279624 8.16 10.1 0.479 Developing 65.0
1 64 86 271.0 0.1 492 62.0 58.0 62.0 0.01 18.6 17.5 17.5 327582.0 612.696514 73.523582 8.18 10.0 0.476 Developing 59.9
2 66 89 268.0 0.1 430 64.0 62.0 64.0 0.01 18.1 17.7 17.7 31731688.0 631.744976 73.219243 8.13 9.9 0.470 Developing 59.9
3 69 93 272.0 0.1 2787 67.0 67.0 67.0 0.01 17.6 18.0 17.9 3696958.0 669.959000 78.184215 8.52 9.8 0.463 Developing 59.5
4 71 97 275.0 0.1 3013 68.0 68.0 68.0 0.01 17.2 18.2 18.2 2978599.0 63.537231 7.097109 7.87 9.5 0.454 Developing 59.2
Checking for Missing or Invalid Values
Now that the modeling DataFrame is ready, it is necessary to check whether there are any observations where the Life Expectancy is missing.
Given this is the target (label), an observation without this data is not useful for the training and testing a machine learning algorithm, therefore it should be dropped.
Some features have units that need to be checked for consistency:
Percentage columns should not have values above 100.
Columns with indices per 1.000 population should not have values above 100.
Income composition should not have values outside the range of 0-1.
If a considerable volume of observations are encountered with inconsistencies, an empty value will be imputed (given they are invalid, so they can be handled by the same rules as missing values).
The other features (factors) should also be checked, in order to decide what to do with them (e.g. imputation, removal, etc.). Given this check may be performed more than once, a function is created.
Observations with target variable missing: 10
Observations with target variable missing after cleanup: 0
Infant Deaths has 14 invalid values.
Child Mortality has 16 invalid values.
Analysis of the Missing Values
The volume of the missing data is high for the features Hepatits B Immunization , Population and GDP per Capita .
Given the Population is not a relevant feature to predict the label Life Expectancy , this will be dropped.
As for the other features, the data distribution will be analyzed, to support an imputation decision using either the mean (average) or median statistcs. That seems to be a good approach, considering that:
The dataset comprises a very generalized data (for all countries in a period of 15 years) and the Life Expectation is an estimated (not precise) value.
The dataset is not large, and it would be preferrable not to loose many datapoints which may contain other relevant features to predict the Life Expectancy .
Though data imputation seems to be a good strategy for this given project, it should not be taken lightly. Therefore to reduce the potential distortions this will introduce, the data will be analyzed in two groups that share common traits: Developed countries and Developing countries.
Feature Count Percentage
0 Hepatitis B Immunization 553 18.9
1 Polio Immunization 19 0.6
2 Diphtheria Immunization 19 0.6
3 Alcohol Consumption 193 6.6
4 Average BMI 32 1.1
5 Malnutrition 5-9 32 1.1
6 Malnutrition 10-19 32 1.1
7 Population 644 22.0
8 GDP per Capita 443 15.1
9 Health Expenditure GDP 226 7.7
10 Schooling Years 160 5.5
11 Income Composition 160 5.5
Imputation of Missing data for Developed Countries
After the analysis of histograms below, the choice goes to imput the median on the following features:
Hepatitis B, Polio, Diphteria, Population, Malnutrition and Income Composition.
All the other features with missing values will have the mean imputed.
Imputation of Missing data for Developing Countries
After the analysis of histograms below, the choice goes to imput the median on the following features:
Hepatitis B, Polio and Diphteria.
All the other features with missing values will have the mean imputed.
Feature Count Percentage
Analyze Data for Feature Selection
Correlation
Looking at the correlation, we can see that Population could be dropped, as its correlation with life expectancy is close to none.
Now, along Schooling and Income , Adult mortality has a strong correlation with the Life expectancy , whereas Infant deaths and Child mortality (under 5 years old) have a much weaker correlation.
However, it would not be wise to drop the Child mortality , because as a common sense, if the child mortality is high, it would affect other demografic factors that may directly or indirectly influence in the Life expectancy .
For instance, there is a strong negative correlation between Malnutrition and Life expectancy , and a positive correlation between those same factors and Child mortality , suggesting that in countries where malnutrition is prevalent, the child mortality is higher and the life expectancy lower.
Furthermore, when looking at the historical factors, until the middle of the 20th century, the infant mortability was approximately 40-60% of the total mortality. Excluding child mortality, the average life expectancy during the 12th–19th centuries was approximately 55 years. If a medieval person survived childhood, they had
about a 50% chance of living 50–55 years, instead of only 25–40 years.
Given that Infant deaths is included in the Child mortality (and they have a very strong correlation), the Infant deaths feature will be dropped.
The country Status will also be dropped, as after using it for missing values imputation, it is not longer necessary (so it would not also be required to be encoded - given this is a categorical value).
Infant Deaths -0.196557
Child Mortality -0.222529
Adult Mortality -0.696359
HIV/AIDS Deaths -0.556556
Measles Cases -0.157586
Hepatitis B Immunization 0.161439
Polio Immunization 0.458450
Diphtheria Immunization 0.473104
Alcohol Consumption 0.379716
Average BMI 0.552138
Malnutrition 5-9 -0.454095
Malnutrition 10-19 -0.459834
Population -0.029020
GDP per Capita 0.386862
Health Expenditure 0.381864
Health Expenditure GDP 0.206041
Schooling Years 0.680577
Income Composition 0.647289
Target 1.000000
Name: Target, dtype: float64
Infant Deaths 0.996628
Child Mortality 1.000000
Adult Mortality 0.094146
HIV/AIDS Deaths 0.037783
Measles Cases 0.507718
Hepatitis B Immunization -0.166863
Polio Immunization -0.189286
Diphtheria Immunization -0.196226
Alcohol Consumption -0.110979
Average BMI -0.235948
Malnutrition 5-9 0.468563
Malnutrition 10-19 0.464186
Population 0.539221
GDP per Capita -0.110448
Health Expenditure -0.088152
Health Expenditure GDP -0.128889
Schooling Years -0.191567
Income Composition -0.143130
Target -0.222529
Name: Child Mortality, dtype: float64
Infant Deaths 0.078756
Child Mortality 0.094146
Adult Mortality 1.000000
HIV/AIDS Deaths 0.523821
Measles Cases 0.031176
Hepatitis B Immunization -0.119226
Polio Immunization -0.269759
Diphtheria Immunization -0.270741
Alcohol Consumption -0.178054
Average BMI -0.374971
Malnutrition 5-9 0.294792
Malnutrition 10-19 0.289482
Population -0.005252
GDP per Capita -0.238565
Health Expenditure -0.242860
Health Expenditure GDP -0.105980
Schooling Years -0.407538
Income Composition -0.404365
Target -0.696359
Name: Adult Mortality, dtype: float64
array(['Child Mortality', 'Adult Mortality', 'HIV/AIDS Deaths',
'Measles Cases', 'Hepatitis B Immunization', 'Polio Immunization',
'Diphtheria Immunization', 'Alcohol Consumption', 'Average BMI',
'Malnutrition 5-9', 'Malnutrition 10-19', 'GDP per Capita',
'Health Expenditure', 'Health Expenditure GDP', 'Schooling Years',
'Income Composition'], dtype=object)
Visually Check Features against Target
In order to understand which Regression algorithm may be used, it is useful to understand how some of the relevant features relate to the Target variable.
Looking at the charts below, it seems that a Linear Regression may not perform very well; other algorithms should be considered.
In [1]: #import the libraries that will be used in this project
import math
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import tree
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.tree import DecisionTreeRegressor
In [2]: #load and inspect the data
raw_data_df = pd.read_csv('Life Expectancy Data.csv')
raw_data_df.head()
Out[2]:
In [3]: #check how many data points (rows) and factors (columns)
raw_data_df.shape
Out[3]:
In [4]: #check factors names,
raw_data_df.columns.values[:]
Out[4]:
In [5]: #get the first dataset for modelling, dropping some factors that will not be used,
model_data_df = raw_data_df[[ 'Status', 'Adult Mortality',
'infant deaths', 'Alcohol', 'percentage expenditure',
'Hepatitis B', 'Measles ', ' BMI ', 'under-five deaths ', 'Polio',
'Total expenditure', 'Diphtheria ', ' HIV/AIDS', 'GDP',
'Population', ' thinness 1-19 years', ' thinness 5-9 years',
'Income composition of resources', 'Schooling', 'Life expectancy ']]
#copy to avoid accidental change to the raw data
model_data_df = model_data_df.copy()
#rename some columns (e.g. correct typos, remove spaces, adhere to consistent upper/case usage, etc.)
model_data_df.rename(columns = {'Life expectancy ' : 'Target',
'infant deaths' : 'Infant Deaths',
'Alcohol' : 'Alcohol Consumption',
'percentage expenditure' : 'Health Expenditure',
'Hepatitis B' : 'Hepatitis B Immunization',
'Measles ' : 'Measles Cases',
' BMI ' : 'Average BMI',
'under-five deaths ' : 'Child Mortality',
'Polio' : 'Polio Immunization',
'Total expenditure' : 'Health Expenditure GDP',
'Diphtheria ' : 'Diphtheria Immunization',
' HIV/AIDS': 'HIV/AIDS Deaths',
'GDP' : 'GDP per Capita',
' thinness 1-19 years' : 'Malnutrition 10-19',
' thinness 5-9 years' : 'Malnutrition 5-9',
'Income composition of resources': 'Income Composition',
'Schooling' : 'Schooling Years'
}, inplace = True)
#reorganize the dataset columns order
model_data_df = model_data_df[['Infant Deaths', 'Child Mortality' , 'Adult Mortality',
'HIV/AIDS Deaths', 'Measles Cases',
'Hepatitis B Immunization', 'Polio Immunization', 'Diphtheria Immunization',
'Alcohol Consumption', 'Average BMI', 'Malnutrition 5-9', 'Malnutrition 10-19',
'Population', 'GDP per Capita', 'Health Expenditure', 'Health Expenditure GDP',
'Schooling Years',
'Income Composition', 'Status',
'Target']]
model_data_df.head()
Out[5]:
In [6]: #Check if there are observations where the target variable is missing and drop those observations
missing_target = sum(model_data_df['Target'].isna() == True)
print('Observations with target variable missing: ', missing_target)
if missing_target > 0:
model_data_df.dropna(subset=['Target'], inplace = True)
missing_target = sum(model_data_df['Target'].isna() == True)
print('Observations with target variable missing after cleanup: ', missing_target)
del missing_target
In [7]: #Check invalid values and imput an empty (missing) value to inconsistent observations (against their unit)
range_1000_pop = ['Infant Deaths', 'Child Mortality', 'Adult Mortality', 'HIV/AIDS Deaths']
range_100_pct = ['Hepatitis B Immunization', 'Polio Immunization', 'Diphtheria Immunization',
'Health Expenditure GDP', 'Malnutrition 5-9', 'Malnutrition 10-19']
range_0_1 = ['Income Composition']
for feature in range_1000_pop:
occurrences_int = sum(model_data_df[feature] >= 1000)
if occurrences_int > 0:
print(feature, ' has ', occurrences_int, ' invalid values.')
for feature in range_100_pct:
occurrences_int = sum(model_data_df[feature] > 100)
if occurrences_int > 0:
print(feature, ' has ', occurrences_int, ' invalid values.')
for feature in range_0_1:
occurrences_int = sum(model_data_df[feature] > 1)
if occurrences_int > 0:
print(feature, ' has ', occurrences_int, ' invalid values.')
In [8]: #create a function to check for missing data, count occurrences and display their percentage in relation to the total
def check_missing(data_df):
'''
This function takes a dataframe as input and check for missing values.
It returns a dataframe containing the name of the colum with missing values,
along with its count and the percentage in relation to the total number of rows (records)
'''
#initialize auxiliar variables
null_features_lst = []
null_count_lst = []
null_pct_lst = []
total = len(data_df)
#check which feature contains missing values
for feature in data_df.columns.values[:-1]:
null_records = sum(data_df[feature].isna() == True)
if null_records > 0:
null_features_lst.append(feature)
null_count_lst.append(null_records)
null_pct_lst.append(round(((null_records / total) * 100), 1))
missing_data_df = pd.DataFrame({'Feature' : null_features_lst,
'Count' : null_count_lst,
'Percentage' : null_pct_lst})
return missing_data_df, null_features_lst
In [9]: missing_data_df, null_features_lst = check_missing(model_data_df)
missing_data_df
Out[9]:
In [10]: #create a function to plot the data distribution (as a histogram) of a given features list;
#It will also show their average and median as well to visually aid imputation decisions
def plot_data_distribution(data_df, features_lst, group_col_str, group_value_str):
'''
This function takes a dataframe and a list of columns (subset by group) and plots histograms to show the
data distribution of those columns. The histograms contain two vertical lines showing the following statistics:
- red: mean
- dashed blue: median
'''
#define the layout
cols = len(null_features_lst)
subplot_cols = 3
subplot_rows = math.ceil(cols/subplot_cols)
figure_width = 6
figure, axes = plt.subplots(subplot_rows, subplot_cols,
figsize = (subplot_cols * figure_width, subplot_rows * figure_width))
#plots the histogram for each column from the "features_list"
for col_index in range(cols):
ax_row_index = col_index // subplot_cols
ax_col_index = col_index % subplot_cols
n, bins, patches = axes[ax_row_index][ax_col_index].hist(data_df[features_lst[col_index]][data_df[group_col_str] == group_value_str],
bins = 40, color = 'gray')
axes[ax_row_index][ax_col_index].set_title(features_lst[col_index])
#plots the vertical lines for mean and median
axes[ax_row_index][ax_col_index].axvline(data_df[features_lst[col_index]].mean(),
color='red', linewidth=2)
axes[ax_row_index][ax_col_index].axvline(data_df[features_lst[col_index]].median(),
color='blue', linestyle='dashed', linewidth=2)
In [11]: #create a function to imput either a median or mean on missing values
def imput_missing_values(data_df, features_lst, median_features_lst, group_col_str, group_value_str):
'''
This function takes a dataframe and a list of columns (subset by group) and imputs the mean or median of
those columns for the missing values. The columns explicit in the input list "median_features_list" will
have the median imputted, while others will have the mean imputted to them.
'''
for feature in features_lst:
if feature in median_features_lst:
feature_median = data_df[feature][data_df[group_col_str] == group_value_str].median()
data_df.loc[:, feature] = data_df.loc[:, feature].fillna(round(feature_median, 1))
else:
feature_mean = data_df[feature][data_df[group_col_str] == group_value_str].mean()
data_df.loc[:, feature] = data_df.loc[:, feature].fillna(round(feature_mean, 1))
return data_df
In [12]: #analyze the missing data for Developed countries
plot_data_distribution(model_data_df, null_features_lst, 'Status', 'Developed')
In [13]: #imput values for missing values of Developed countries based on the above analysis
median_features_lst = ['Hepatitis B Immunization', 'Polio Immunization', 'Diphteria Immunization',
'Population', 'Malnutrition 10-19', 'Malnutrition 5-9', 'Income Composition']
model_data_df = imput_missing_values(model_data_df, null_features_lst, median_features_lst, 'Status', 'Developed')
In [14]: #analyze the missing data for Developing countries
plot_data_distribution(model_data_df, null_features_lst, 'Status', 'Developing')
In [15]: #imput values for missing values of Developing countries based on the above analysis
median_features_lst = ['Hepatitis B Immunization', 'Polio Immunization', 'Diphteria Immunization']
model_data_df = imput_missing_values(model_data_df, null_features_lst, median_features_lst, 'Status', 'Developing')
In [16]: #check that imputation worked (it should return an empty dataframe when checking for missing values)
missing_data_df, null_features_lst = check_missing(model_data_df)
missing_data_df
Out[16]:
In [17]: #analyze the correlation between features and target variable
sns.heatmap(model_data_df.corr(), vmin = -1, vmax = 1,
cmap = sns.diverging_palette(15, 220, as_cmap = True), linewidths = 0.1);
In [18]: model_data_df.corr()['Target']
Out[18]:
In [19]: model_data_df.corr()['Child Mortality']
Out[19]:
In [20]: model_data_df.corr()['Adult Mortality']
Out[20]:
In [21]: #drop some additional features that will not be used in the model
model_data_df.drop(['Status', 'Infant Deaths', 'Population'], axis = 1, inplace = True)
model_data_df.columns.values[:-1]
Out[21]:
In [22]: #Check the relationship of some relevant features against the target variable
#to understand which model may be appropriate
def chart_scatter_plot(data_df, x, y = 'Target'):
'''
This function plots a scatter plot between two variables.
'''
sns.scatterplot(data = data_df, x = x, y = y, hue = y, palette = 'ch:s=.25,rot=-.25');
In [23]: chart_scatter_plot(model_data_df, 'Adult Mortality')
In [24]: chart_scatter_plot(model_data_df, 'Child Mortality')
In [25]: chart_scatter_plot(model_data_df, 'Average BMI')
In [26]: chart_scatter_plot(model_data_df, 'Malnutrition 10-19')
In [27]: chart_scatter_plot(model_data_df, 'HIV/AIDS Deaths')
https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who
https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who
https://immunizationdata.who.int/pages/incidence/MEASLES.html?CODE=Global&YEAR=
https://www.statista.com/statistics/268826/health-expenditure-as-gdp-percentage-in-oecd-countries/
Model Evaluation & Selection
Summary
Now that the data has been analyzed and prepared for used, it is time to evaluate different models and select the one that performs better to predict the Life Expectancy (i.e. the one that has the lower root mean square error).
Steps
Data Split & Scaling
First, the data needs to be split into two sets: Training and Test. Then, considering the data has several features measured in different units (percentage, monetary amounts, indices, population number per 1000, etc.), scaling the data is recommended.
As the data distribution is not gaussian, the Min Max scaling method will be used. To avoid data leakage, the scalling will be applied (to both train and test datasets) after the split.
Evaluate Models Performances Using K-Folds
The Train dataset will be split into K-Folds to evaluate the performance of different algorithms, tuned with different hyperparameters. The models that will be evaluated are:
Linear Regression
Polynomial Regression
Decision Tree Regressor
Random Forest Regressor
The models performance - especially their RMSE (Root Mean Squared Error) will be compared to support a model selection decision.
Considering those models will be tried using n- K-Fold s, especiall attention will be paid to the mean RMSE and its standard deviation (i.e. to avoid selecting a model which has lower error, but which varies too much for one K-Fold to the next).
Train and Test the Selected Model
After the model selection described above, the model will be trained using the full Training set, and tested using the Test dataset for the very first time. If it is performance is acceptable, this will be final prediction model to be used.
Visual Representation of the stragegy of splitting the data, then using K-Folds for Model Evaluation & Selection
Child Mortality Adult Mortality HIV/AIDS Deaths Measles Cases Hepatitis B Immunization Polio Immunization Diphtheria Immunization Alcohol Consumption Average BMI Malnutrition 5-9 Malnutrition 10-19 GDP per Capita Health Expenditure Health Expenditure GDP Schooling Years Income Composition Target
0 83 263.0 0.1 1154 65.0 6.0 65.0 0.01 19.1 17.3 17.2 584.259210 71.279624 8.16 10.1 0.479 65.0
1 86 271.0 0.1 492 62.0 58.0 62.0 0.01 18.6 17.5 17.5 612.696514 73.523582 8.18 10.0 0.476 59.9
2 89 268.0 0.1 430 64.0 62.0 64.0 0.01 18.1 17.7 17.7 631.744976 73.219243 8.13 9.9 0.470 59.9
3 93 272.0 0.1 2787 67.0 67.0 67.0 0.01 17.6 18.0 17.9 669.959000 78.184215 8.52 9.8 0.463 59.5
4 97 275.0 0.1 3013 68.0 68.0 68.0 0.01 17.2 18.2 18.2 63.537231 7.097109 7.87 9.5 0.454 59.2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2933 42 723.0 33.6 31 68.0 67.0 65.0 4.36 27.1 9.4 9.4 454.366654 0.000000 7.13 9.2 0.407 44.3
2934 41 715.0 36.7 998 7.0 7.0 68.0 4.06 26.7 9.9 9.8 453.351155 0.000000 6.52 9.5 0.418 44.5
2935 40 73.0 39.8 304 73.0 73.0 71.0 4.43 26.3 1.3 1.2 57.348340 0.000000 6.53 10.0 0.427 44.8
2936 39 686.0 42.1 529 76.0 76.0 75.0 1.72 25.9 1.7 1.6 548.587312 0.000000 6.16 9.8 0.427 45.3
2937 39 665.0 43.5 1483 79.0 78.0 78.0 1.68 25.5 11.2 11.0 547.358878 0.000000 7.10 9.8 0.434 46.0
2928 rows × 17 columns
Data sucessfully split.
X_train type : Shape: (2049, 16)
X_test type : Shape: (879, 16)
y_train type : Shape: (2049,)
y_test type : Shape: (879,)
Evaluating the Regression Models
The following models are being evaluated, and have their hyperparameters tuned as follow:
Linear Regression : no hyperparameters to try.
Polynomial Regression : trying from 2nd degree to 4th degree.
*Based on the results below, the 2nd degree polynomial regression fares better than a 1-degress polynomial (i.e. Linear Regression) or higher degrees.
Decision Tree: trying max depth, then 5-10.
The max depth goes near 30. The depth 7 seems to be the optimal.
Random Forests : this emsemble algorithm will be tried with 200 estimators, 100 (default), 50, 25 and 10.
The 100 estimators seem to be the optimal.Above this, no improvements, below that it fares slightly worse.
Linear
Regression
Polynomial Regression 2nd
degree
Polynomial Regression 3rd
degree
Polynomial Regression 4th
degree
Decision Tree Max
Depth
Decision Tree depth
10
Decision Tree
depth 7
Decision Tree
depth 5
Random Forest estimators
200
Random Forest estimators
100
Random Forest
estimators 50
Random Forest
estimators 25
Random Forest
estimators 10
0 4.4 3.1 128.8 666.6 3.3 3.2 3.0 3.2 1.9 1.9 1.9 1.9 2.1
1 4.4 3.4 79.7 301.7 3.1 3.1 3.1 3.3 2.2 2.2 2.2 2.3 2.3
2 4.5 3.1 61.9 196.0 2.9 2.7 3.0 3.5 2.1 2.1 2.1 2.2 2.3
3 4.4 3.3 111.6 328.6 2.9 2.8 2.6 3.0 1.9 1.9 2.0 2.0 2.1
4 4.2 3.6 149.6 411.3 2.9 2.7 2.7 3.1 2.0 2.0 2.0 2.0 2.1
Linear
Regression
Polynomial Regression 2nd
degree
Polynomial Regression 3rd
degree
Polynomial Regression 4th
degree
Decision Tree Max
Depth
Decision Tree depth
10
Decision Tree
depth 7
Decision Tree
depth 5
Random Forest estimators
200
Random Forest estimators
100
Random Forest
estimators 50
Random Forest
estimators 25
Random Forest
estimators 10
count 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000
mean 4.380000 3.300000 106.320000 380.840000 3.020000 2.900000 2.880000 3.220000 2.020000 2.020000 2.040000 2.080000 2.180000
std 0.109545 0.212132 35.665628 177.302406 0.178885 0.234521 0.216795 0.192354 0.130384 0.130384 0.114018 0.164317 0.109545
min 4.200000 3.100000 61.900000 196.000000 2.900000 2.700000 2.600000 3.000000 1.900000 1.900000 1.900000 1.900000 2.100000
25% 4.400000 3.100000 79.700000 301.700000 2.900000 2.700000 2.700000 3.100000 1.900000 1.900000 2.000000 2.000000 2.100000
50% 4.400000 3.300000 111.600000 328.600000 2.900000 2.800000 3.000000 3.200000 2.000000 2.000000 2.000000 2.000000 2.100000
75% 4.400000 3.400000 128.800000 411.300000 3.100000 3.100000 3.000000 3.300000 2.100000 2.100000 2.100000 2.200000 2.300000
max 4.500000 3.600000 149.600000 666.600000 3.300000 3.200000 3.100000 3.500000 2.200000 2.200000 2.200000 2.300000 2.300000
Try 10 K-folds instead of 5
Though the mean performance is better (lower RMSE ), the standard deviation is a little higher. Nonetheless, the models performance when compared to each other is very similar, and Random Forest Regression still the algorithm that performs better.
Linear
Regression
Polynomial Regression 2nd
degree
Polynomial Regression 3rd
degree
Polynomial Regression 4th
degree
Decision Tree Max
Depth
Decision Tree depth
10
Decision Tree
depth 7
Decision Tree
depth 5
Random Forest estimators
200
Random Forest estimators
100
Random Forest
estimators 50
Random Forest
estimators 25
Random Forest
estimators 10
0 4.4 3.4 58.8 697.1 3.4 3.2 3.0 3.2 2.2 2.2 2.3 2.3 2.6
1 4.5 2.7 69.9 468.7 2.6 2.3 2.6 2.7 1.5 1.5 1.5 1.5 1.7
2 4.6 3.3 30.8 567.6 2.9 2.7 2.4 3.1 1.8 1.8 1.8 1.8 1.9
3 4.3 3.3 47.6 234.7 3.3 2.9 2.9 3.5 2.2 2.2 2.2 2.3 2.3
4 4.2 3.1 52.8 673.6 2.2 2.3 2.3 2.7 1.8 1.8 1.8 1.8 1.9
5 4.7 3.3 32.0 249.8 2.6 2.8 2.9 3.5 2.1 2.1 2.2 2.3 2.3
6 4.7 3.6 127.0 476.0 2.7 2.7 2.7 3.0 2.0 2.0 2.0 2.1 2.1
7 4.1 2.8 114.1 674.4 2.5 2.4 2.7 3.0 1.7 1.8 1.8 1.8 1.9
8 4.1 3.1 54.4 328.5 2.6 2.4 2.7 3.2 2.0 2.0 2.0 2.0 2.2
9 4.3 3.4 13.7 595.9 2.9 2.7 2.6 2.9 1.8 1.8 1.8 1.9 1.9
Linear
Regression
Polynomial Regression 2nd
degree
Polynomial Regression 3rd
degree
Polynomial Regression 4th
degree
Decision Tree Max
Depth
Decision Tree depth
10
Decision Tree
depth 7
Decision Tree
depth 5
Random Forest estimators
200
Random Forest estimators
100
Random Forest
estimators 50
Random Forest
estimators 25
Random Forest
estimators 10
count 10.000000 10.000000 10.000000 10.000000 10.0000 10.000000 10.000000 10.000000 10.000000 10.000000 10.000000 10.000000 10.000000
mean 4.390000 3.200000 60.110000 496.630000 2.7700 2.640000 2.680000 3.080000 1.910000 1.920000 1.940000 1.980000 2.080000
std 0.228279 0.278887 35.796382 175.437802 0.3653 0.291357 0.220101 0.282056 0.228279 0.220101 0.245855 0.269979 0.269979
min 4.100000 2.700000 13.700000 234.700000 2.2000 2.300000 2.300000 2.700000 1.500000 1.500000 1.500000 1.500000 1.700000
25% 4.225000 3.100000 35.900000 363.550000 2.6000 2.400000 2.600000 2.925000 1.800000 1.800000 1.800000 1.800000 1.900000
50% 4.350000 3.300000 53.600000 521.800000 2.6500 2.700000 2.700000 3.050000 1.900000 1.900000 1.900000 1.950000 2.000000
75% 4.575000 3.375000 67.125000 654.175000 2.9000 2.775000 2.850000 3.200000 2.075000 2.075000 2.150000 2.250000 2.275000
max 4.700000 3.600000 127.000000 697.100000 3.4000 3.200000 3.000000 3.500000 2.200000 2.200000 2.300000 2.300000 2.600000
Selecting the Model & Evaluating its Performance
Analyzing the above performance of the tried models, the Random Forest with 100 estimators seems to have the best performace.
Now it is time to double-check that this model also performs well with unseen Test Data.
*Based on the results below, it performs even better than the tried models with train K-folds data**
Looking at the target data, it is possible see that 80% of the Life Expectancy was predicted with less than 1.5 years of difference.
1.8449583566801624
Target Predicted
0 73.0 73.326
1 71.8 71.182
2 55.4 56.212
3 71.6 71.935
4 68.3 64.746
... ... ...
874 79.6 79.055
875 78.7 81.363
876 78.7 78.996
877 64.3 64.156
878 83.0 79.804
879 rows × 2 columns
90.0 % predictions under 2.5 years difference - count 768 observations
80.0 % predictions under 1.5 years difference - count 670 observations
60.0 % predictions under 1 years difference - count 562 observations
Alternative Models
Given the dataset is not very large, it gives the freedom to try it out with different approaches to compare how it performs agains the selected model above.
Using a Different Scaler
Using the same features, let's try to use the Standard Scaler instead of the Min Max to see if the model performs better, worse or whethere there are no significant differences.
The Standard Scaler is generally chosen when the data is normally distributed, which is not the case for the Life Expectancy dataset.
There are no significant differences for Random Forests, however the Min Max scaler fares better for model such Polynomial Regression and slightly worse for Decistion Tree Regressor for example.
Using Different Features
Focus on Economic Factors tied to education, health expenditure and income.
Though those factors have a high correlation with the Life Expectancy they are not sufficient to predict it more precisely.
Focus on Imunization Factors.
Disregard other factors (e.g. social, economic, etc.) makes the model fare worse.
Using Standard Scaler
Data sucessfully split.
X_train type : Shape: (2049, 16)
X_test type : Shape: (879, 16)
y_train type : Shape: (2049,)
y_test type : Shape: (879,)
Linear
Regression
Polynomial Regression 2nd
degree
Polynomial Regression 3rd
degree
Polynomial Regression 4th
degree
Decision Tree Max
Depth
Decision Tree depth
10
Decision Tree
depth 7
Decision Tree
depth 5
Random Forest estimators
200
Random Forest estimators
100
Random Forest
estimators 50
Random Forest
estimators 25
Random Forest
estimators 10
0 4.4 3.1 128.8 761.5 3.3 3.2 3.0 3.2 1.9 1.9 1.9 2.0 2.1
1 4.4 3.4 79.7 380.9 3.1 3.1 3.1 3.3 2.2 2.2 2.2 2.3 2.3
2 4.5 3.1 61.9 293.9 2.8 2.7 3.0 3.5 2.1 2.1 2.1 2.2 2.3
3 4.4 3.3 111.6 2129.2 2.9 2.8 2.6 3.0 1.9 1.9 2.0 2.0 2.1
4 4.2 3.6 149.6 442.5 2.9 2.7 2.7 3.1 2.0 2.0 2.0 2.0 2.1
Linear
Regression
Polynomial Regression 2nd
degree
Polynomial Regression 3rd
degree
Polynomial Regression 4th
degree
Decision Tree Max
Depth
Decision Tree depth
10
Decision Tree
depth 7
Decision Tree
depth 5
Random Forest estimators
200
Random Forest estimators
100
Random Forest
estimators 50
Random Forest
estimators 25
Random Forest
estimators 10
count 5.000000 5.000000 5.000000 5.000000 5.0 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000
mean 4.380000 3.300000 106.320000 801.600000 3.0 2.900000 2.880000 3.220000 2.020000 2.020000 2.040000 2.100000 2.180000
std 0.109545 0.212132 35.665628 762.861449 0.2 0.234521 0.216795 0.192354 0.130384 0.130384 0.114018 0.141421 0.109545
min 4.200000 3.100000 61.900000 293.900000 2.8 2.700000 2.600000 3.000000 1.900000 1.900000 1.900000 2.000000 2.100000
25% 4.400000 3.100000 79.700000 380.900000 2.9 2.700000 2.700000 3.100000 1.900000 1.900000 2.000000 2.000000 2.100000
50% 4.400000 3.300000 111.600000 442.500000 2.9 2.800000 3.000000 3.200000 2.000000 2.000000 2.000000 2.000000 2.100000
75% 4.400000 3.400000 128.800000 761.500000 3.1 3.100000 3.000000 3.300000 2.100000 2.100000 2.100000 2.200000 2.300000
max 4.500000 3.600000 149.600000 2129.200000 3.3 3.200000 3.100000 3.500000 2.200000 2.200000 2.200000 2.300000 2.300000
Features Focused on Economical Factors
Health Expenditure GDP Schooling Years Income Composition Target
0 8.16 10.1 0.479 65.0
1 8.18 10.0 0.476 59.9
2 8.13 9.9 0.470 59.9
3 8.52 9.8 0.463 59.5
4 7.87 9.5 0.454 59.2
... ... ... ... ...
2933 7.13 9.2 0.407 44.3
2934 6.52 9.5 0.418 44.5
2935 6.53 10.0 0.427 44.8
2936 6.16 9.8 0.427 45.3
2937 7.10 9.8 0.434 46.0
2928 rows × 4 columns
Data sucessfully split.
X_train type : Shape: (2049, 3)
X_test type : Shape: (879, 3)
y_train type : Shape: (2049,)
y_test type : Shape: (879,)
Linear
Regression
Polynomial Regression 2nd
degree
Polynomial Regression 3rd
degree
Polynomial Regression 4th
degree
Decision Tree Max
Depth
Decision Tree depth
10
Decision Tree
depth 7
Decision Tree
depth 5
Random Forest estimators
200
Random Forest estimators
100
Random Forest
estimators 50
Random Forest
estimators 25
Random Forest
estimators 10
0 7.3 6.9 5.7 5.4 5.5 5.2 4.9 5.4 4.7 4.7 4.7 4.7 4.8
1 6.9 6.1 5.0 5.0 5.2 5.1 5.0 5.1 4.6 4.6 4.6 4.7 4.7
2 6.9 6.4 5.6 5.3 5.6 4.8 4.5 4.9 4.3 4.3 4.3 4.3 4.5
3 6.4 6.0 5.5 5.4 6.0 5.1 5.2 5.2 4.7 4.8 4.8 4.8 4.8
4 6.8 6.4 5.4 5.2 5.1 5.0 4.8 5.1 4.2 4.2 4.2 4.3 4.4
Linear
Regression
Polynomial Regression 2nd
degree
Polynomial Regression 3rd
degree
Polynomial Regression 4th
degree
Decision Tree Max
Depth
Decision Tree depth
10
Decision Tree
depth 7
Decision Tree
depth 5
Random Forest estimators
200
Random Forest estimators
100
Random Forest
estimators 50
Random Forest
estimators 25
Random Forest
estimators 10
count 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000
mean 6.860000 6.360000 5.440000 5.260000 5.480000 5.040000 4.880000 5.140000 4.500000 4.520000 4.520000 4.560000 4.640000
std 0.320936 0.350714 0.270185 0.167332 0.356371 0.151658 0.258844 0.181659 0.234521 0.258844 0.258844 0.240832 0.181659
min 6.400000 6.000000 5.000000 5.000000 5.100000 4.800000 4.500000 4.900000 4.200000 4.200000 4.200000 4.300000 4.400000
25% 6.800000 6.100000 5.400000 5.200000 5.200000 5.000000 4.800000 5.100000 4.300000 4.300000 4.300000 4.300000 4.500000
50% 6.900000 6.400000 5.500000 5.300000 5.500000 5.100000 4.900000 5.100000 4.600000 4.600000 4.600000 4.700000 4.700000
75% 6.900000 6.400000 5.600000 5.400000 5.600000 5.100000 5.000000 5.200000 4.700000 4.700000 4.700000 4.700000 4.800000
max 7.300000 6.900000 5.700000 5.400000 6.000000 5.200000 5.200000 5.400000 4.700000 4.800000 4.800000 4.800000 4.800000
4.100811663938941
Features Focused on Immunization Factors
Hepatitis B Immunization Polio Immunization Diphtheria Immunization Target
0 65.0 6.0 65.0 65.0
1 62.0 58.0 62.0 59.9
2 64.0 62.0 64.0 59.9
3 67.0 67.0 67.0 59.5
4 68.0 68.0 68.0 59.2
... ... ... ... ...
2933 68.0 67.0 65.0 44.3
2934 7.0 7.0 68.0 44.5
2935 73.0 73.0 71.0 44.8
2936 76.0 76.0 75.0 45.3
2937 79.0 78.0 78.0 46.0
2928 rows × 4 columns
Data sucessfully split.
X_train type : Shape: (2049, 3)
X_test type : Shape: (879, 3)
y_train type : Shape: (2049,)
y_test type : Shape: (879,)
Linear
Regression
Polynomial Regression 2nd
degree
Polynomial Regression 3rd
degree
Polynomial Regression 4th
degree
Decision Tree Max
Depth
Decision Tree depth
10
Decision Tree
depth 7
Decision Tree
depth 5
Random Forest estimators
200
Random Forest estimators
100
Random Forest
estimators 50
Random Forest
estimators 25
Random Forest
estimators 10
0 8.6 7.7 7.7 7.6 9.4 8.3 7.7 7.7 7.8 7.8 7.8 7.8 7.8
1 8.4 7.6 7.7 7.7 8.9 8.6 7.9 7.6 7.6 7.7 7.7 7.7 7.8
2 8.2 7.5 7.5 7.5 7.8 7.6 7.4 7.5 7.1 7.1 7.1 7.0 7.2
3 8.4 7.3 7.3 7.3 8.0 7.3 7.2 7.3 6.9 6.9 6.9 6.9 7.0
4 7.9 7.4 7.4 7.4 8.4 7.8 7.3 7.2 6.8 6.9 6.9 6.9 7.1
Linear
Regression
Polynomial Regression 2nd
degree
Polynomial Regression 3rd
degree
Polynomial Regression 4th
degree
Decision Tree Max
Depth
Decision Tree depth
10
Decision Tree
depth 7
Decision Tree
depth 5
Random Forest estimators
200
Random Forest estimators
100
Random Forest
estimators 50
Random Forest
estimators 25
Random Forest
estimators 10
count 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000
mean 8.300000 7.500000 7.520000 7.500000 8.500000 7.920000 7.500000 7.460000 7.240000 7.280000 7.280000 7.260000 7.380000
std 0.264575 0.158114 0.178885 0.158114 0.655744 0.526308 0.291548 0.207364 0.439318 0.438178 0.438178 0.450555 0.389872
min 7.900000 7.300000 7.300000 7.300000 7.800000 7.300000 7.200000 7.200000 6.800000 6.900000 6.900000 6.900000 7.000000
25% 8.200000 7.400000 7.400000 7.400000 8.000000 7.600000 7.300000 7.300000 6.900000 6.900000 6.900000 6.900000 7.100000
50% 8.400000 7.500000 7.500000 7.500000 8.400000 7.800000 7.400000 7.500000 7.100000 7.100000 7.100000 7.000000 7.200000
75% 8.400000 7.600000 7.700000 7.600000 8.900000 8.300000 7.700000 7.600000 7.600000 7.700000 7.700000 7.700000 7.800000
max 8.600000 7.700000 7.700000 7.700000 9.400000 8.600000 7.900000 7.700000 7.800000 7.800000 7.800000 7.800000 7.800000
6.88824922521834
In [28]: chart_scatter_plot(model_data_df, 'Income Composition')
In [29]: chart_scatter_plot(model_data_df, 'Schooling Years')
In [30]: #Checking the final dataset that will be used for model training and test
model_data_df
Out[30]:
In [31]: #Function to split the data to support the model evaluation and selection
def split_data(data_df, scaling_str = 'minmax', test_size_fl = 0.3):
#start by splitting the data into training and test
X = data_df.drop(['Target'], axis = 1)
y = data_df['Target']
#split the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_size_fl, random_state = 1)
#scale the data
if scaling_str != None:
if scaling_str == 'standard':
scaler = StandardScaler()
elif scaling_str == 'minmax':
scaler = MinMaxScaler()
else:
print('Scaling method not supported by this function. Choose "standard", "minmax" or None.')
#scale the data using the chosen method (if any) - after the split to avoid data leakage
scaler.fit(X_train)
X_train = pd.DataFrame(scaler.transform(X_train))
X_test = pd.DataFrame(scaler.transform(X_test))
print('Data sucessfully split.')
print('X_train type : ', type(X_train), ' Shape: ', X_train.shape)
print('X_test type : ', type(X_test), ' Shape: ', X_test.shape)
print('y_train type : ', type(y_train), ' Shape: ', y_train.shape)
print('y_test type : ', type(y_test), ' Shape: ', y_test.shape)
return X_train, X_test, y_train, y_test
In [32]: #Split the data and scaling using the Min Max scaler
X_train, X_test, y_train, y_test = split_data(model_data_df)
In [33]: #Function to try different regression models using K-fold data
def try_model(model_str, X, y, fold_index, hyperparameter_int = None):
'''
This function validates different machine learn algorithms for a regression problem.
It takes as input the model, the train and validation data (split using the K-fold method),
and also a hyperparemeter (optional) to tune the selected model.
It returns the model RMSE metric to support model selection.
Valid inputs:
- model: linear (Linear Regression), poly (Polynomial Regression),
dt (Decision Tree Regressor), rf (Random Forest Regressor)
- hyperparameter_int: used for Polynomial Regression (degree), Decistion Tree (tree depth) and
Random Forest (number of estimators)
'''
#get the K-fold data to evaluate the model
train_index = fold_index[0]
validation_index = fold_index[1]
X_train = X.iloc[train_index]
y_train = y.iloc[train_index]
X_validation = X.iloc[validation_index]
y_validation = y.iloc[validation_index]
#instantiate the model
if model_str == 'linear':
model = LinearRegression()
elif model_str == 'poly':
#transform the data to use in a polynomial algorithm
poly = PolynomialFeatures(degree = hyperparameter_int, include_bias = False)
X_train = poly.fit_transform(X_train)
X_validation = poly.fit_transform(X_validation)
model = LinearRegression()
elif model_str == 'dt':
model = DecisionTreeRegressor(max_depth = hyperparameter_int, random_state = 1)
elif model_str == 'rf':
model = RandomForestRegressor(n_estimators = hyperparameter_int, random_state = 1)
else:
print('Algorithm not supported by this function. Choose one of the following values: "linear", "poly", "dt" or "rf".')
#train the algorithm
model.fit(X_train, y_train)
#validate the algorithm predictions
y_prediction = model.predict(X_validation)
rmse = round(np.sqrt(mean_squared_error(y_validation, y_prediction)), 1)
return rmse
In [34]: #Validate different models with different hyperperameters and compare their performance
def validate_models(X, y, folds_int = 5):
#Get a copy of the test data to split into K-folds for model validation
X_fold = pd.DataFrame(X).copy()
y_fold = pd.Series(y).copy()
#Create a data dictionary to hold a list with the results of the models that will be evaluated
results = {'Linear Regression': [],
'Polynomial Regression 2nd degree' : [],
'Polynomial Regression 3rd degree': [],
'Polynomial Regression 4th degree' : [],
'Decision Tree Max Depth': [],
'Decision Tree depth 10' : [],
'Decision Tree depth 7' : [],
'Decision Tree depth 5' : [],
'Random Forest estimators 200': [],
'Random Forest estimators 100': [],
'Random Forest estimators 50' : [],
'Random Forest estimators 25' : [],
'Random Forest estimators 10': []
}
#Split the data into 5 folds
kf = KFold(n_splits = folds_int, random_state = 1, shuffle = True)
#try the models and collect their performance (RMSE)
for fold in kf.split(X_fold):
results['Linear Regression'].append(try_model('linear', X_fold, y_fold, fold))
results['Polynomial Regression 2nd degree'].append(try_model('poly', X_fold, y_fold, fold, 2))
results['Polynomial Regression 3rd degree'].append(try_model('poly', X_fold, y_fold, fold, 3))
results['Polynomial Regression 4th degree'].append(try_model('poly', X_fold, y_fold, fold, 4))
results['Decision Tree Max Depth'].append(try_model('dt', X_fold, y_fold, fold))
results['Decision Tree depth 10'].append(try_model('dt', X_fold, y_fold, fold, 10))
results['Decision Tree depth 7'].append(try_model('dt', X_fold, y_fold, fold, 7))
results['Decision Tree depth 5'].append(try_model('dt', X_fold, y_fold, fold, 5))
results['Random Forest estimators 200'].append(try_model('rf', X_fold, y_fold, fold, 200))
results['Random Forest estimators 100'].append(try_model('rf', X_fold, y_fold, fold, 100))
results['Random Forest estimators 50'].append(try_model('rf', X_fold, y_fold, fold, 50))
results['Random Forest estimators 25'].append(try_model('rf', X_fold, y_fold, fold, 25))
results['Random Forest estimators 10'].append(try_model('rf', X_fold, y_fold, fold, 10))
return results
In [35]: #evaluate different models tuned with different hyperparameters using K-folds
results_df = pd.DataFrame(validate_models(X_train, y_train, 5))
results_df
Out[35]:
In [36]: #check the results of RMSE - especially the mean and standard deviation
results_df.describe()
Out[36]:
In [37]: #Evaluate the models using 10 K-folds instead of 5 to see if there is any significant change
results_df = pd.DataFrame(validate_models(X_train, y_train, 10))
results_df
Out[37]:
In [38]: results_df.describe()
Out[38]:
In [39]: #Test Random Forests model with 100 estimators
model = RandomForestRegressor(n_estimators = 100)
model.fit(X_train, y_train)
y_prediction = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_prediction))
rmse
Out[39]:
In [40]: y_test.reset_index(drop = True, inplace = True)
In [41]: y_df = pd.concat([y_test, pd.Series(y_prediction)], axis = 1, keys = ['Target', 'Predicted'])
y_df
Out[41]:
In [42]: total_observations = len(y_df)
under_1 = sum(abs(y_df['Target'] - y_df['Predicted'])

There are no answers to this question.

Login to buy an answer or post yours. You can also vote on other
others

Get Help With a similar task to - Simple Machine Learning Project using Python with pandas, numpy and sklearn

## Related Questions

Similar orders to
Simple Machine Learning Project using Python with pandas, numpy and sklearn

Tutlance Experts offer help in a wide range of topics. Here are some
of our top services:

- Online writing help
- Online homework help
- Personal statement help
- Essay writing help
- Research paper help
- Term paper help
- Do my homework
- Online assignment help
- Online class help
- Dissertation help
- Thesis help
- Proofreading and editing help
- Lab report writing help
- Case study writing help
- White paper writing help
- Letter writing help
- Resume writing help

Post your project now for free and watch professional experts outbid each other in just a few minutes.