Across the globe, many people are dealing with kidney diseases. These can show up suddenly due to various risk factors like what they eat, their surroundings, and how they live. Checking for these diseases can be invasive, expensive, and slow. It might even be risky. This is why, especially in places with limited resources, lots of patients don't get diagnosed and treated until their kidney disease is already advanced. So, finding ways to spot these diseases early is really important, especially in developing countries where late diagnosis is common.
There are many datasets containing information of patients with this desease that can be worked on to achieve our objective, predict chronic kidney disease. In this case we are going to use the dataset provided by the UCI repository. Looking at the description of the dataset and the data itself we can define the types and roles of the attributes:
| Name | Abbreviation | UCI Description | Observed type |
|---|---|---|---|
| Age | age | Numerical | Integer |
| Blood Pressure | bp | Numerical | Real (in mm/Hg) |
| Specific Gravity | sg | Nominal | Polynomial (1.005, 1.010, 1.015, 1.020, 1.025) |
| Albumin | al | Nominal | Polynomial (0, 1, 2, 3, 4, 5) |
| Sugar | su | Nominal | Polynomial (0, 1, 2, 3, 4, 5) |
| Red Blood Cells | rbc | Nominal | Binary (normal, abnormal) |
| Pus Cell | pc | Nominal | Binary (normal, abnormal) |
| Pus Cell Clumps | pcc | Nominal | Binary (present, notpresent) |
| Bacteria | ba | Nominal | Binary (present, notpresent) |
| Blood Glucose Random | bgr | Numerical | Real (bgr in mgs/dl) |
| Blood Urea | bu | Numerical | Real (bu in mgs/dl) |
| Serum Creatinine | sc | Numerical | Real (sc in mgs/dl) |
| Sodium | sod | Numerical | Real (sod in mEq/L) |
| Potassium | pot | Numerical | Real (pot in mEq/L) |
| Hemoglobin | hemo | Numerical | Real (hemo in gms) |
| Packed Cell Volume | pcv | Numerical | Real |
| White Blood Cell Count | wc | Numerical | Real (wc in cells/cumm) |
| Red Blood Cell Count | rc | Numerical | Real (rc in millions/cmm) |
| Hypertension | htn | Nominal | Binary (yes, no) |
| Diabetes Mellitus | dm | Nominal | Binary (yes, no) |
| Coronary Artery Disease | cad | Nominal | Binary (yes, no) |
| Appetite | appet | Nominal | Binary (good, poor) |
| Pedal Edema | pe | Nominal | Binary (yes, no) |
| Anemia | ane | Nominal | Binary (yes, no) |
| Class | class | Nominal | Binary, and label (ckd, notckd) |
When importing the data, we noticed that it's necessary to modify the types of several attributes because RapidMiner didn't do a good job automatically.
We observe that most of the attributes are recognized as 'polynomial,' even though only a few of them should be. We edit all the attributes, assigning the types defined in the table above. Remove rows that might cause issues (for example, if a row contains 'no' in the 'class' column. It could be interpreted as 'notckd,' but since it's a single value and we are not certain, removing it won't cause problems). Finally, we renamed the attributes to make their names more descriptive and easier to work with.
Once the data is loaded into RM, we can study the following statistics:
From the statistics, we highlight the following observations:
Before
After
Before
After
When examining the statistics, we notice a substantial presence of missing values. Dealing with this issue offers several approaches. We can choose to substitute them with averages or logically derived data, remove rows containing missing values entirely, or employ machine learning methods to predict the missing values based on available data.
Nevertheless, it's essential to be cautious with these techniques, as they might introduce inaccurate data into the system, potentially compromising or deteriorating the final outcome. Additionally, some algorithms are capable of accommodating missing values. Therefore, in our initial version, we will work with the missing values in their current state. If we identify room for improvement, we can explore these techniques at a later stage.
Once in RapidMiner we can quicly make use of the Correlation matrix operator to find out more info about these attributes
At first sight we cant find heavily correlated attributes. The most correlation is between the class atribute with Hemoglobin, packed cell volume and red blood count. However the correlation value does not exceed 0.77, so for now we will leave it as it is.
Now to the fun part! We will start processing the data with some ML models to find out if we can predict the target value. Since we are going to try to classify new incoming data into two possible outcomes ckd and notckd, we can clearly identify this as a classification problem, therefore we will make use of classification models.
Some classification models that we can make ous of are Logistic regression, Linear discriminant analysis, KNN and Naive bayes
To assess the performance of the models, we will employ the Cross-Validation operator from RM, using a 5-fold strategy. Cross-validation involves dividing the dataset into five subsets, and the model is trained and validated five times. Each fold serves as the validation set exactly once, ensuring thorough assessment. This approach not only utilizes all available data but also tests the model's ability to generalize to unseen data, making it a robust validation method.
We can see that there are many attributes with information here. Some of these attributes may not be useful, and may also impact negatively not only the performance in terms of efficiency, but the outcome itself.
To analyze these attributes in search of the most useful ones, we can employ various techniques. However, we will opt for the RM operator called Optimize Selection (evolutionary). This operator essentially repeats the entire process multiple times, selecting different attributes each time and evaluating their performance. Ultimately, this operator will help us identify the most useful attributes, resulting in improved performance. On a side note, we choose 'evolutionary' because it is the option most likely to prevent reaching a local maximum.
Lets work with the first algorithm, logistic regression. We plug it in the cross validation operator and press play
After it finishes processing we look at the results:
Seems a little too good to be true, doesn't it? There may be some issues with the data that leads to such a good, but false outcome. Also, playing around with the operators we observed that the Optimize Selection operator raised the performance from 98% to 100%. Even tough this may be a misleading result, we will keep working with the other models to see how they perform.
LDA does not support working with missing values, but we are going to try it anyways filling the data with average values. This is not a good idea because its made up information, but we are going to try it anyways just to see how it performs. Also, it doesn't support binomial or polynomial attributes. To solve this we will use the operator Nominal to numerical, which will transform the values of binomial and polynomial attributes to numerical ones.
After it finishes processing we look at the results:
Not bad to have so many made up values. playing around with the operators we found out that when leaving out the rows with missing values (a lot of rows), it had a 61% success performance, and autocompleting missing values upgraded it to 90%. This is probably due to the very little examples in the dataset that have all the attributes complete. Also we noted that the operator Optimize Selection raised the performance from 90% to 96%! We can clearly see that this operator rocks.
KNN or k-nearest neighbours is a computationally expensive algorithm, but since we wont be working with a huge dataset we will try it out and see how it performs.
After it finishes processing we look at the results:
Really good performance! We accomplished this after some tweaking with the knn operator parameters, and also some data preprocessing. Running it in the first instance resulted in 61% performance. After that we implemented the Normalize operator, which normalized every attribute. This is especially good for KNN, upgrading it's peformance to 91%. After this we tweaked the k parameter of the KNN operator, finding out that the values around "25" turned out to be the most performant ones, upgrading the performance to 95.49%. Lastly, we used the trusty operator Optimize Selection, reaching the showed result, almost 99%🚀.
This algorithm assumes independence of features and may not work well with highly correlated features. However as we saw previously there are no higly correlated attributes, so we are going to give it a chance and see how it performs:
After it finishes processing we look at the results:
Again, amazing performance. And the operator Optimize Selection, did it's thing again, turning an already optimal 99% efficiency to a 100%.
All these performances seem too good to be true, and this is a bit unsetteling. However after some time thinking of reasons it may work too good, I couldn't find a satisfactory explanation other than: It really works!
After all, thanks to the cross validation operator, the models should not be suffering of overfitting, which should be a reason to have such high performance. To continue working on this, the best approach to further validate the models would be to use another dataset that is very similar and already classified, and test the performances to see if it actually works as well as it shows.
Lastly, the MVP of this study case will be the operator Optimize Selection, wich optimized every model, from a resource efficiency point of view, to the overall performance of all of them. 🥳