Interpreting Feature Scores in Principal Component Analysis (PCA)

Andrea Grianti
The Startup
Published in
8 min readNov 25, 2020

--

My previous story was about the math intuition behind PCA. Today my objective is to apply those concepts to a simple use case and in particular to the part which often is overlooked being that of how to interpret the results of principal component analysis .

As you probably know PCA is a sophisticated tool for exploratory data analysis and dimensionality reduction. In extreme summary, using PCA, a dataset with a high number of variables could be hopefully represented by a small number of new variables which are a (special=eigen in german) linear combination of the original variables.

The problem with PCA is that original data is transformed and the new found variables have to be interpreted and the interpretation is influenced by the weights that the PCA assigns to the set of the original variables.

To do that I will use a sample database of features for football players in the defensive role extracted from the FIFA game of EA Sports.

The reason I choosen FIFA players is because football is popular and so most of fans can immediately checks the results and understand the logic of reasoning. If I used other data then specific industry expertise was required and this reduced readability.

Interpreting the PCs

First of all, I must assume that you know how to calculate the PCA (eigenvectors and eigenvalues) and how to calculate the “loadings” : that is -simplifying- the weight values each of the original variables has in the calculated linear combination. Loadings are useful because you can see the importance of each variable across all of the calculated Principal Components (by row) and the importance of each variable within each of the calculated PC (by column). If this assumption is wrong don’t leave because one day you will be here.

This is the cross correlation matrix. In the rows the original variables, in the columns the first 4 PCs . In the cells the loadings: values that takes into consideration the eigenvalues and the eigenvectors.

In the above picture this is the result for the data I got back from PCA analysing FIFA players attributes playing in the defensive role.

The excercise now is, given the numbers above, how to interpret them. Deep blue color is used where the value is far from zero in positive, deep red if it’s far from zero in negative. White for values near zero.

To easy the reading instead of looking at numbers in columns I created a logic to be applied to numbers in order to visualize symbols which are easier to read. The logic was to divide the delta (max-min) in every column in 5 units and depending on the value assigning ‘space’ if value is within 1 unit on both signs: positive or negative, a single ‘+’ if within 2 units in positive, a ‘-’ if within 2 units in negative, two ‘++’ or ‘- -‘ if within 3 units and so on … up to 4 which is the max I decided. So looking at symbols with ‘++++’ or ‘ — — ’ you have a better visual of the most important variables in one PC.

Principal Component 1: => “The Defense Ministers”

The term I invented is because most of the coefficients are positive and quite far from zero. This means that there’s a positive correlation between almost all of the variables. So a growing positive value in PC1 means a rather uniform growth of values in all of the variables and this means that as much a player has a high value in PC1, he is likely to have high skills in almost all of the variables. This means that this component tells me that higher the numbers higher the value of the player, which is very intuitive. It’s like saying that in school best students are those with higher evaluations.

To immediately check this statement we can use PC1 to apply these weights to the original data (multiply the factors of PC1 column with the values they have in of their variables and then sum) calculate the score, sort and analyse top 5 players.

PC1 Scores (name, club, position LB=Left Back, CB=Center Back, pc-score). Highest in positive should be the top overall players. (actually they were in 2020).Top Five (positive) defenders in FIFA 20
5)Marcelo Real Madrid LB 12.412590
4)Kolarov Roma LB 12.488453
3)De Ligt Piemonte Calcio CB 12.657573
2)Telles FC Porto LB 12.682743
1)Robertson Liverpool LB 13.786357

Principal Component 2: “The slow walls”

For PC2 you can see high values both on the positive side and on the negative side. This means that as much you are far on positive side as much the player has those kind of skills and at the same time he has much less of the skills on the negative side. The opposite is true for those players high on the other side. So in red values (-) we should find here those players who are high for the specific defensive and physical attributes (including the height => tall players), while in blue values (+) those high in acceleration, speed and some dribbling attributes. From here the name I thought for this PC having given priority to direction in red as “slow walls” highlights strong defensive skills and slow acceleration/speed. Of course I could have named the same PC2 with a different name like “fastest defenders” because if you evaluate the PC from the positive side you find those very fast but lacking or having under average defensive skills. In any case let’s check if PC2 scores on the negative side give me some confirmation on my theory:

PC2 Scores (name, club, position, pc-score)The players here on the negative side are the top on pure defensive attributes (def_*) but slow on speed and acceleration variables. The opposite is true for the second group: they have speed and acceleration but low defensive skills.1)Maripán     AS Monaco FC CB -7.427195
2)Acerbi Lazio CB -7.422273
3)Fazio Roma CB -7.410685
4)Vestergaard Southampton CB -7.382580
5)Süle FC Bayern München CB -7.375576
...
5)Gan Chao Shenzhen FC LB 6.338888
4)Pierias Western United FC RB 6.352141
3)Escoboza Club América LB 6.743281
2)Guenouche KFC Uerdingen 05 LB 7.111342
1)Wen Jiabao Shanghai Shenhua FC LB 7.38066

I went to the original dataset to check Maripan rescaled attributes and see if my hypothesis holds:

player_name                             Maripán
club AS Monaco Football Club SA
league Ligue 1 Conforama
position CB
height 0.791667 (+) Note: 1.93m
pace_acceleration 0.265823 (-)
pace_sprint_speed 0.270270 (-)
drib_agility 0.131579 (-)
drib_balance 0.142857 (-)

...
def_interceptions 0.808989 (+)
def_heading 0.753086
(+)
def_marking 0.831461
(+)
def_stand_tackle 0.820225
(+)
def_slid_tackle 0.764045
(+)
...
phys_strength 0.958333 (+)

Given that in the original dataset the variables were rescaled in range 0–1 the hypothesis seems to hold because he has values close to 1 (max) in those variables where I was expecting to find high values (def_*) and low values closer to 0 in the acceleration and speed. So Maripan is actually stronger than the average in defense skills but slower than the average for pace attributes.

Principal Component 1 and 2 together

Given that PC1 could measure the overall value of players and PC2 highlights those with high defensive skill (contrasting speed and acceleration), by plotting together PC1 and PC2 I should find along the diagonal the players who excel in both overall value (highest positive values) AND specific defensive skills (highest negative values), so something like best of best.

Let’s see if this is true by looking these 2 charts. the first is the overall distribution of players by position (light blue are Center Back, green and red are Left and Right Back). To find top PC1 (higher in positive is better) and PC2 (higher in negative is better) players I have to look bottom right. .

Using Orange’s scatter plot facility has been easy to highlight directly on the graph the names of the players. Just export from Python as CSV file the PC scores together with corresponding data (names, club, position …) from the original data set and that’s it.

The chart below is an enlargement of the bottom right part of the first chart with the names of players there… as said a mix of best of the best players .

This is an enlargement of the bottom right graph of the previous picture to highlight the names of the players with high values in both PC1 and PC2

Principal Component 3: “The defenders in attack”

PC3 accounts for 7% on the overall correlations and highlights something not so intuitive at first. There are some defensive players who are better at shooting and freekicks, than at acceleration/speed (so they are slower than the average) and general defensive skills (under the average). So on one side of this eigenvector we have players who play as defenders but they do not have strong skills of real defensive players. It looks like if they move forward they become more effective as attackers than defenders. Sometime defensive players move forward for corner kicks or when the team is loosing the game in the final minutes. Could this be an explanation or a kind of mistake from EA sports ? I don’t know.

Looking this table below you find this kind of players among those with positive PC3 scores while those below are typical top defenders lacking shooting skills (that makes sense)

PC3 Scores (name, club, position, pc-score)
1)Steven Vitória Moreirense FC CB 6.88884
2)Kippe Lillestrøm SK CB 6.785333
3)Wasilewski Wisła Kraków CB 6.429422
4)Abe Urawa Red Diamonds CB 6.029964
5)Lepoint KV Kortrijk CB 5.959654
... ... .. ...
5)Manolas Napoli CB -6.638891
4)Fai Standard de Liège RB -5.119398
3)Koulibaly Napoli CB -4.782433
2)Shoji Toulouse Football Club CB -4.694171
1)Gomez Liverpool CB -4.567751

Even in this case I was curious to see if my interpretation was correct and checked the values of the first guy (Steven Vitoria) on the original Dataset. Note that the original dataset was not centered and because of this some attributes are higher than the average for a column and lower for the same segment. Looking at the values it seems that it matches the analysis even if I wonder if this relflects the reality of these players:

player_name          Steven Vitória
club Moreirense FC
league Liga NOS
position CB
height 0.833333
pace_acceleration 0.126582
pace_sprint_speed 0.162162
drib_agility 0.0921053
drib_balance 0.142857
drib_reactions 0.514706
drib_ball_control 0.453333
drib_dribbling 0.313253
drib_composure 0.597222
shoot_positioning 0.113636
shoot_finishing 0.41573 (higher than avg 0.285098)
shoot_shot_power 0.705882 (higher than avg 0.443230)
shoot_long_shots 0.545455 (higher than avg 0.323520)
shoot_volleys 0.352273 (higher than avg 0.289339)
shoot_penalties 0.643678 (higher than avg 0.359228)

pass_vision 0.356322
pass_crossing 0.295455
pass_free_kick 0.629213
pass_short 0.459459
pass_long 0.43038
pass_curve 0.625 (lower than avg 0.370765)
def_interceptions 0.573034 (lower than avg 0.608587)
def_heading 0.580247 (lower than avg 0.537645)
def_marking 0.494382 (lower than avg 0.607551)
def_stand_tackle 0.595506 (lower than avg 0.639204)
def_slid_tackle 0.573034 (lower than avg 0.620089)

phys_jumping 0.014084
phys_stamina 0.342857
phys_strength 0.902778

Principal Component 4: The jumpers

PC4 is another demonstration that PCA has to be taken with a grain of salt. Looking at the values for PC4 it seems the only correlation really meaningfull is the ability to jump, which actually in itself alone is not an important feature in the context of football players if it’s not correlated with other skills like “heading”which is not the case here.

So this is another possible case where I’d suggest the guys at EA Sports to check if this is what they really wanted or if they have to adjust the attributes of some players :-) to make them more realistic.

Conclusions

Even if the example was built from a popular game everything I have taken into consideration can be applied to other more “serious” contexts and especially the logic about how to evaluate the variables influencing more the PCs. There in the current studies more sophisticated theories about finding the most important variables in a PC but here I’ve used the most common which is to evaluate the terms of the cross correlation matrix. If you have any comment I’m open to read them as I created that for my reference but it’s open for improvements.

… just Follow me

Please consider following me in order for me to reach the threshold of number of followers so that Medium platform consider me in their partner program.

--

--

Andrea Grianti
The Startup

IT Senior Manager and Consultant. Data Warehouse and Business Intelligence expertise in design and build. Freelance.