More Data is not better! Well, that took long enough to glean!
With the emergence of expressions such as “big data” it has become the norm to think that more data is always better than less. Of course, counterintuitively, it should have been clear that this was not true. And now, a paper coming out of Stanford University and Harvard University is laying out the case that it is indeed not true that more data is better, or that more data can simply be harmless. When it comes to risk modeling, the crux of what the paper is about, the data doesn’t directly model the risk item under study, such as recidivism, health outcomes or related topics.
In these areas, label bias, as defined by the authors. Since the underlying data can have biases such as geography, economics or other factors, they can lead to confounding, incorrect and many times dangerous results for racial minorities, for instance, for African Americans when it comes to crime or healthcare outcomes. In such instances, removing incorrect data or proxies can actually improve outcomes.
Read the summary article here: https://hai.stanford.edu/news/how-bias-hides-kitchen-sink-approaches-data?utm_source=Stanford+HAI&utm_campaign=013d8c6d1d-hai_news_june_9_2024
The original paper is available here: https://www.science.org/doi/epdf/10.1126/sciadv.adi8411
The cover image for the post is from here: https://pixabay.com/vectors/statistic-analytic-diagram-1564428/