More Data is not better! Well, that took long enough to glean!
With the emergence of expressions such as “big data” it has become the norm to think that more data is always better than less. Of course, counterintuitively, it should have been clear that this was not true. And now, a paper coming out of Stanford University and Harvard University is laying out the case that it is indeed not true that more data is better, or that more data can simply be harmless. When it comes to risk modeling, the crux of what the paper is about, the data doesn’t directly model the risk item under study, such as recidivism, health outcomes or related topics.
In these areas, label bias, as defined by the authors. Since the underlying data can have biases such as geography, economics or other factors, they can lead to confounding, incorrect and many times dangerous results for racial minorities, for instance, for African Americans when it comes to crime or healthcare outcomes. In such instances, removing incorrect data or proxies can actually improve outcomes.
Read the summary article here: https://hai.stanford.edu/news/how-bias-hides-kitchen-sink-approaches-data?utm_source=Stanford+HAI&utm_campaign=013d8c6d1d-hai_news_june_9_2024
The original paper is available here: https://www.science.org/doi/epdf/10.1126/sciadv.adi8411
The cover image for the post is from here: https://pixabay.com/vectors/statistic-analytic-diagram-1564428/
Reposting: Quick Review: Big Data, by Brian Clegg
Reposting from my other young site: http://bibliomaniac.me/
Big Data is now more than hype, which however wont stop from those who wish to hype things away from reality from continuing to do so.
If you want a clear and concise book on Big Data, or are like me, never able to stay away from any refresher, well this book is for you.
From obvious Big Data examples, such as the erstwhile Netflix, to some very Brit specific stuff (the author is British, as you might have guessed), the book is an easy read in just under 150 pages, and can be easily understood by a wide range of readers.
It also lays out important details, such as the purpose of analyzing Big Data, the pitfalls of rushing to it, without informing a panic-prone public (hear that, Google?) and more.
Enjoy the read!