What does impact body performance? | A machine learning approach


Data analysis with Pandas, data visualization with Seaborn and machine learning with Scikit Learn regarding a body performance features | Age, weight, body fat...

Published on December 22, 2021 by Andrés Ingelmo Poveda

jupyter notebook python data analysis pandas seaborn data visualization machine learning scikit learn

15 min READ

What does impact body performance? This is the question a lot of us have asked themselves sometime. Am I going to perform better if I’m lighter? What if I’m taller?

In this blog post, I will take a machine learning approach to analyse more than 13,000 observations from individuals aged between 20 and 64 years and its results in different performance tests depending on their height and weight among other variables.

The final objective of the analysis is to perform some unsupervised learning to discover possible clusters with common characteristics and a linear regression to estimate what are the most important features that define each result.

Data cleaning

The dataset was already clean. However, a few modifications needed to be done. Because I don’t want to spend a lot ot time here, you can check the steps I took to clean it on my Kaggle notebook or my GitHub project.

Data exploration

The first 5 rows of the cleaned dataset follows the next structure:

 agegenderheight_cmweight_kgbody fat_%diastolicsystolicgripForcesit and bend forward_cmsit-ups countsbroad jump_cmclasslean_BMI
027M172.375.2421.38013054.918.460217C19.9459
125M16555.815.77712636.416.353229A17.278
231M179.67820.19215244.81249181C19.321
332M174.571.118.47614741.415.253219B19.0532
428M173.867.717.17012743.527.145217B18.5799

I usually like to follow my analysis by plotting the histogram and the correlation matrix to extract some insights from the data:

histogram

From this visualization we can say that there are two features that are not normally distributed: the age and the gender. There are more males than females in the dataset and more young than old people. Other than that, the other features are pretty evenly distributed.

correlation

This correlation matrix show some interesting data. For example, it seems to be a strong negative correlation between body fat and the broad jump test results. This means that, the more body fat, the worst the result in this test is. Another insight we can get is that weight is positive correlated to weight, meaning more height, more weight, but body fat is negatively correlated. This means that taller people have usually a lower body fat percentage.

Data preprocessing

After performing some superficial data analysis, it is time to prepare our dataset to use it in machine learning purposes. In this step, I will encode the categorical variables and standardize the features. You can check the steps I took on my Kaggle notebook or my GitHub project.

The scaled dataset looks like this:

 agegenderheight_cmweight_kgbody fat_%diastolicsystolicgripForcesit and bend forward_cmsit-ups countsbroad jump_cmclasslean_BMI
0-0.7098330.7706440.4543390.686597-0.2364970.114976-0.0151681.673920.3183381.416880.6539970.5420150.793553
1-0.8572980.770644-0.417975-0.966094-1.02083-0.165797-0.28817-0.06847910.001813250.9072120.962995-1.29698-0.331118
2-0.4149040.7706441.326650.921238-0.4045691.238071.486340.722665-0.646310.615971-0.2729980.5420150.530114
3-0.3411710.7706440.7172280.334635-0.642671-0.2593881.145090.40244-0.1639860.9072120.705497-0.3774810.417254
4-0.6361010.7706440.6335810.045584-0.824749-0.820934-0.2199190.6002261.629660.3247290.653997-0.3774810.217715

I also thought it would be good to perform some dimensionality reduction. In this dataset, there are a lot of features that influence in the final classifications. The higher number of features, the harder it is to work with. As many of these features are correlated, they are redundant.

I will use the Principal Component Analysis (PCA) to reduce the dimensions.

pca

The above chart shows that, to achieve a 95% of variance explained, we need to get at least 8 variables.

Clustering

Once the dimensionality reduction is done, we’ll move to clustering. The clustering technique I’m going to use is the Agglomerative Clustering. This type of clustering is a hierarchical clustering method that involves merging examples until the desired number of clusters is achieved.

elbow

The chart above indicates that the optimal number of clusters for this dataset is 5. After performing Agglomerative Clustering and parsing the data into the original dataset, it looks like this:

 agegenderheight_cmweight_kgbody fat_%diastolicsystolicgripForcesit and bend forward_cmsit-ups countsbroad jump_cmclasslean_BMIclusters
027M172.375.2421.38013054.918.460217C19.94590
125M16555.815.77712636.416.353229A17.2780
231M179.67820.19215244.81249181C19.3210
332M174.571.118.47614741.415.253219B19.05320
428M173.867.717.17012743.527.145217B18.57990

Evaluating models

Since this is an unsupervised machine learning model, we don´t have a tagged feature to evaluate our model. The purpose of this section is to study the patterns in the clusters formed an determine the nature of the clusters’ patterns.

cluster-dist

As we can see in the above chart, the clusters are not equally distributed. Let’s try to figure out what each one can mean. First thing I’m going to do is check if height and weight has something to do with the score.

cluster-height-weight-score

It is pretty clear that there is a correlation between height, weight and score. The individuals who weighted more, scored the worst in the test on average. We can also see that majority of cluster 4 observations are in the class D. Let’s get more details on this.

cluster-height-weight-score

The chart above doesn’t provide much information we didn’t know before. Cluster 4 is related with people with the worst score. However, the other three clusters are distributed evenly between the other three scores. It has to have another meaning.

Profiling

In this section, I will try to deduce which individuals are in which cluster. To decide that, I will be plotting some of the features present in the dataset.

Let’s first analyse the relation between weight and the score on different tests:

weight-gripForce

weight-broad jump_cm

weight-sit and bend forward_cm

weight-sit-ups counts

Taking a look at the charts plotted we can say that cluster 4 individuals are heavier, on average, than the other individuals. It is also important to notice that, the heavier the individual, the better gripForce score. In the the sit-ups and broad jump tests, it seems to be a positive correlation between weight and better results but it is not as high as the previous one.

Let’s see know how the height compares to the score.

height-gripForce

height-broad jump_cm

height-sit and bend forward_cm

height-sit-ups counts

With this plot we can get some interesting insights!

  1. It seems that taller people perform better in broad jump and grip force tests. However, in the other two, height doesn’t seem to provide any advantage at all.
  2. Individuals from cluster 0 are taller than the average while individuals from cluster 3 are shorter.

Now, let’s compare the results with the age.

age-gripForce

age-broad jump_cm

age-sit and bend forward_cm

age-sit-ups counts

From the above charts, we can also get two important insights:

  1. The younger the individual, the better the score.
  2. Individuals from clusters 2 and 3 are, on average, older than the rest of the dataset.

Let’s compare it now with the lean BMI.

leanBMI-gripForce

leanBMI-broad jump_cm

leanBMI-sit and bend forward_cm

leanBMI-sit-ups counts

From the charts above, we can also get two important insights:

  1. Individuals with higher lean BMI performed better in strength tests (grip force and broad jump).
  2. Individuals from cluster 0 have the highest lean BMI while individuals from cluster 1 have the lowest.

Before trying to get more detailed information to profile each cluster, let’s get the gender distribution for each one.

cluster-gender-distribution

Now, let’s group all observations by cluster type and get the mean of all the features to get more details regarding each cluster.

clustersageheight_cmweight_kgbody fat_%diastolicsystolicgripForcesit and bend forward_cmsit-ups countsbroad jump_cmlean_BMI
029.7555174.26172.963717.840578.6195131.54545.365215.904451.2011225.30719.6739
129.5054162.18756.05126.5974.3606120.72526.783320.448737.5047166.16915.5609
254.924169.39669.054121.694483.4058138.10239.865913.384934.7137187.09418.7648
353.7969157.22458.08531.683578.1181130.49424.319917.521820.2031132.63815.9078
433.952175.02583.54126.636186.0193138.74442.89168.4695138.1858200.61219.874

Now we can draw some conclusions to determine which type of people form each cluster:

  • Cluster 0: young tall males with high lean BMI.
  • Cluster 1: young small females with low lean BMI.
  • Cluster 2: old tall males with high lean BMI.
  • Cluster 3: old small females with low lean BMI.
  • Cluster 4: overweight males in its majority.

Linear Regression

At this point of the analysis, I thought it would be good to know what features determine the test results. For example, is a heavier person going to perform better in a strength test? To answer these kind of questions, I’m going to run a linear regression model.

I builded four different models: grip force, sit and bend forward, sit-ups count and broad jump. Again, you can check more information regarding the steps taken for linear regression on my Kaggle notebook or my GitHub project. The score for each model is as following:

The score for model gripForce is:  0.7744484932198941
The score for model sit and bend forward_cm is:  0.24417579569506664
The score for model sit-ups counts is:  0.616084673391197
The score for model broad jump_cm is:  0.7425962295683317

Then, I plotted the coefficients to check which one was making more differences within model. In other words, I wanted to know which feature contributed the most in a positive or negative way to the test score.

linear-regression-coefs

As we can see in the above graph, the most influential parameter in all test is the gender. It looks like male individuals perform better than female in strength tests while women do better in flexibility ones. The second most important parameter is the lean BMI and it makes sense. The higher the lean BMI is, the lower the body fat is, the higher the muscle mass is, the better an individual can perform in physical tests. Height and weight are somewhat important and came make a different depending on the test. For example, a heavier guy is probably going to perform worse on the broad jump and better in the grip force than a lighter one.

At this point, we know that gender plays a huge rol in explaining the test result. Let’s remove this bias and try it again.

images/linear-regression-coefs-genders

Both plots show similar results. Height influence positively in all results while weight does it negatively. This means that the taller an individual is, the better he is going to perform and, the heavier, the worse. Lean BMI seems to be the most influential variable af all and it makes sense. This variable measures the amount of muscle an individual has depending on his height. The higher the lean BMI is, the more athletic he is going to be, therefore, the better he is going to perform in physical tests.

Conclusions

The unsupervised clustering provided a good data segmentation. Despite this technique wasn’t really needed to carry on the analysis, it showed five different groups with common characteristics.

From the data analyzed, we can draw the conclusion that body performance decreases during the years and males and females have different strengths. In this case, females were more flexible while males did better in strength tests.

Also, a higher body fat can lead to bad body performance and higher blood pressure.