In the healthcare industry, one primary concern when doing research or creating a machine learning model to improve patient care is handling imbalanced data. For example, most individuals do not suffer from heart failure yearly, which is excellent. However, when we try to build a model that predicts the chances of an individual having heart failure, having fewer individuals with heart failure in our data set might lead to them being misrepresented in the model. Another example is: are all races and socioeconomic classes equally represented in our data? Unfortunately, due to the available population or data, getting an imbalanced data set might be the only option we have.
In health care, we must provide the utmost care to everyone. Thus, a machine learning model must learn the genuine relationship from our data, not a pattern created by underrepresentation. To overcome this obstacle, many data science methods and techniques can be employed to leverage all of our data.
Data sampling techniques
When dealing with imbalanced data, one way to adjust imbalance is using sampling techniques on the data set. Some sampling techniques are undersampling, oversampling, and synthetic minority over-sampling technique (SMOTE) or Synthetic Minority Over-Sampling for Regression with Gaussian Noise (SMOGN).
In undersampling techniques, we can adjust imbalance by keeping all the data in the underrepresented class (group), then randomly sampling from larger classes (groups). We have selected the same number of points in the minority class (group). This will create a new data set where the underrepresented class or classes are now of the same size as the larger classes (groups), thus giving us a balanced data set.
Similarly, with oversampling techniques, we keep the entire class with the most data points and the minority class or classes. We then start randomly resampling (choosing data points) from that minority class or classes, and put the resampled minority class points into the new data set until the new data set has all the classes (groups) of the same size. Depending on the size of the imbalance, there may be concerns with using these more traditional methods.
For example, with undersampling, if the smaller classes are indeed smaller than the majority class, randomly picking records from the majority only until it is the size of the smaller minority group might cause critical aspects of that larger majority class to be missed. With oversampling, if we keep resampling from the minority class to add more data points, a model might see many of the same points and falsely build confidence for that specific case.
To address these concerns, SMOTE and SMOGN were created. SMOTE stands for Synthetic Minority Over-Sampling Technique, and SMOGN stands for Synthetic Minority Over-Sampling for Regression with Gaussian Noise. SMOTE and SMOGN are both types of oversampling, but how these are done differs from the explanation above. With SMOTE and SMOGN, the larger class or group is kept in its entirety. For the smaller or minority class, SMOTE and SMOGN carefully look at the characteristics of these minority classes (groups) and create synthetic data points that are similar but not the same as the original minority data point. This way, with SMOTE and SMOGN, the entire larger class is kept, and the minority classes (groups) are bolstered with synthetic data points to be very similar but not the same, eliminating the duplicates of resampling. This allows for a balanced data set that addresses concerns of traditional undersampling and oversampling techniques. The difference between SMOTE and SMOGN is that SMOTE is typically for classification problems, while SMOGN is for continuous or regression problems. In addition, one can choose the sampling strategy to determine what is similar to a given minority data point.
By implementing the correct sampling technique, an imbalanced data set can be resampled to create a balanced data set that helps alleviate a model learning with false patterns created by the original imbalanced data.
Selecting machine learning algorithms for use cases
Another method for handling unbalanced data is carefully selecting the machine learning algorithm for your use case. For example, some tree-based boosting algorithms can naturally handle imbalanced data sets. By their nature, these algorithms are designed to address imbalances in the data and extract underlying patterns from the data. Some examples of these algorithms include Adaboost, XG boost, Cat boost and Light Gradient Boost. By implementing a machine learning algorithm that can naturally handle imbalanced data, the data itself does not have to be resampled for the model to learn actual patterns in our data and overcome false habits caused by misrepresentation. By utilizing these machine learning algorithms, we can start leveraging all our data to draw powerful insights and drive critical business decisions, while ensuring our imbalanced data is not providing false assumptions.
Using data science methods to overcome imbalanced data
In health care, we often find our data imbalanced, and it might be impossible for us to get a balanced data set. While an imbalanced data set can pose difficulties, we can still use the data set. Using different data science methods and techniques, we can overcome the imbalance in our data and start making important business decisions to help improve patient outcomes and better understand relationships in our data. While several methods to handle imbalanced data were discussed here, there are other methods. If you are trying to overcome imbalanced data, please connect with a CGI expert to discuss leveraging your imbalanced data to make actionable insights or explore our life sciences expertise.