Who is at Risk for Heart Disease?

Anamika Singh
Nov 16, 2021
4 min read

Updated: Jul 26, 2023

Topics: Machine Learning, Modeling, Logistic Regression, Lasso Regression, R, Tableau

Surprisingly, heart disease still leads to the highest number of average daily deaths in the US (the only exceptions being Dec '20, Jan '21 & Feb '21 where Covid took over). To me, this is disappointing, especially since unlike Covid, heart disease is not transmittable and very much preventable/curable. Wouldn't it be useful if we could harness the power of machine learning (ML) to build models that predict the occurrence of heart disease in a person?

The Process: 5 Steps

1. Every ML model is trained using an existing data set. As I scrolled through kaggle, I came across this dataset which had certain health characteristics of a sample of people. The key characteristic here, which was also our variable of interest, was whether or not they had heart disease.

2. Once I had the dataset, the next step was to decide on what we call the "core task" in machine learning. In this case, it was classification, since we are trying to classify our dependent variable into one of two categories (i.e., 1 = heart disease, 0 = no heart disease). Before moving on to training my model, I randomly split the data into "training" and "testing" sets. The primary purpose here is to use the training data set to build the model and then input the testing data in the model to see if the predicted outcomes match up against the actual outcomes.

3. The most frequently used technique for classification is logistic regression and while I knew this would the best method to train my model, I wanted to ensure that my model was not "overfitting" the data since there were so many independent variables (eg. age, sex, resting blood pressure, etc.) that were fed into it. Overfitting is when your model exactly predicts the data used to train it but does not have accurate predictions when new data is fed into it. One of the best ways to overcome this obstacle is to use a method called "Lasso Regression" which essentially picks out variables that are the most relevant to your dependent variable.

4. Once I knew my most important independent variables, I trained my model to use those specific variables for prediction and then used the testing data to check the accuracy of my model.

5. Although there are many ways to check accuracy, I chose to look at the ROC/AUC graphs. ROC (receiver operating characteristic) graphs plot how often the model's predicted positives were actually positive (true positive rate) and how often were they not (false positive rate) and is used to decide above what probability threshold should we consider the independent variable as 1. The AUC is the 'area under the curve' or specifically the area under the ROC curve, which implies how good or bad the model is (the higher the AUC, the better the model is). The entire process is summarized below:

The Outcomes:

1. Lasso Regression: I used R to build my entire model and when I performed Lasso I received the following as the most relevant independent variables:

The variables with dots are those which Lasso deemed as irrelevant and hence removed. Just to verify some of the results of Lasso, I also quickly hoped into Tableau and looked at what some of the data indicated:

For instance, when looking at the relationship between BP and the occurrence of heart disease, it didn't seem like there was a relationship between the two - which is also what Lasso shows.

Here indeed, it does seem like a male was more likely to have heart disease as compared to females, which is again shown by Lasso as well.

Here, it does seem like there is a positive relationship between age and cases of heart disease but only until 62-63, after which it seems to be a negative relationship. Lasso shows us that there is a positive relationship between the two but a very weak one.

2. The Final Model & Accuracy: The final post-lasso model was the following:

Log Odds (HeartDisease) = 1.24 + 0.00174*Age + 0.62268*SexM - 0.653164637*ATA - 0.23710*NAP - 0.00106*Cholesterol + 0.35691*FastingBS - 0.00766*MaxHR + 0.55885*AnginaY + 0.15251*Oldpeak + 0.178392180*SlopeFlat - 1.53247*SlopeUp

In terms of the ROC, I got the following graph:

In our case, having a high true positive is more important than having a false positive which is why I would recommend setting a threshold of about 0.3-0.4.

In terms of AUC, the model received a AUC of 0.92 which is above the usual cut off of 0.8, implying that the model is quite accurate.

The entire code can be found below:

Long Story Short:

Machine learning can often be difficult to digest all at once. So if these concepts seemed a bit too difficult to grasp, just understand this - our mission was to design a prediction model that predicts the occurrence of heart disease which is what we achieved through this exercise. Hence, if provided with a specific person's health characteristics as mentioned in the log odds model above, the model can predict if he/she is likely to have heart disease or not.

Who is at Risk for Heart Disease?

The Process: 5 Steps

The Outcomes:

Long Story Short:

Recent Posts

Comentarios