avatarZach Quinn

Summary

The article discusses the issue of data hoarding, its potential legal implications, and the misconception that retaining vast amounts of historical data equates to business value.

Abstract

The article titled "The Data Vice No One Talks About: Data Hoarding" delves into the practice of excessive data retention, highlighting the subreddit r/DataHoarder as an example of a community that collects data with the intent of archiving and preservation. It underscores the legal risks associated with data hoarding in the context of evolving privacy laws like the GDPR and CCPA, which impose strict guidelines and penalties for non-compliance. The piece argues that the indiscriminate accumulation of data is not only potentially illegal but also often unnecessary for business insights, leading to increased costs and digital clutter. It suggests that organizations should focus on the quality and purposeful use of data rather than the quantity, advocating for a more strategic approach to data collection and analysis.

Opinions

  • Data hoarding, while sometimes seen as a hobby or necessity for archiving, can lead to legal violations as privacy laws become more stringent.
  • The value of historical data is questioned, suggesting that it may not provide the expected business insights and can result in unnecessary storage costs.
  • The r/DataHoarder community is portrayed as having a noble purpose in archiving data, contrasting with organizations that hoard data without a clear objective.
  • Organizations are encouraged to shift their focus from data collection to data analysis, ensuring that the data they retain is relevant and useful for decision-making.
  • The article implies that the current obsession with big data has led to the misconception that more data equates to better insights, which is not always the case.
  • The author suggests that organizations should be more selective and purposeful in their data retention practices to avoid legal and financial repercussions.

The Data Vice No One Talks About: Data Hoarding.

How long-term data retention could violate the law… and not even yield business results.

Photo by Andrew Haimerl (ANDREWNEF) on Unsplash

A Digital Disease

One of the more peculiar subreddits (on a site full of them) is r/DataHoarder. The subreddit’s moderators describe the community as a forum for those suffering from the ‘digital disease’ of data hoarding, the practice of retaining, to an extreme degree, all forms of data. With more than half a million members, the community prides itself on enabling those who suffer from an inclination to hoard data. One of the top posts is from a verified user who claims to have 87 TB of storage, at a cost of approximately 5,000 dollars.

The aim of the community isn’t to amass stockpiles of useless data. Conversely, r/DataHoarder’s members perceive themselves to be archivists in an almost noble fashion. The content they collect serves a purpose. There is a discussion on finding an archive of resources to rebuild after the apocalypse. There is another seeking to coordinate an effort to document the 2022 Ukrainian conflict. Although several media outlets have published stories on this community many treat it as an eccentric collective, rather than what the group seems to want to be known for: Intentional archiving. I’d argue, in fact, that r/DataHoarder users make better use of their historic data than most data-oriented organizations.

Unfortunately, the window for any organization seeking to leverage historic data in production is closing due to ongoing privacy regulation legislation. We’re now at a point when data hoarding won’t just be a bad organizational practice, akin to knowledge hoarding, it will be a difficult act to defend as federal and international privacy legislation evolves.

Data Retention Law

Currently only European Union (EU) law specifies what power civilians have in preventing their information from being stored (and used) indefinitely. When exercised, the General Data Protection Regulation (GDPR)’s much-publicized ‘right to be forgotten’ law enables users to request that their personal information be deleted from virtual public forums like Google Search. Although, despite being the self-described ‘toughest privacy and security law in the world’, the GDPR does not outright ban data retention on EU citizens. Instead, it suggests a broad retention guideline of six years. The GDPR’s counterpart, the California Consumer Privacy Act, establishes a 45-day window for companies to respond to a consumer’s deletion requests.

Photo by Christian Lue on Unsplash

Two other U.S. states, Virginia and Colorado, have passed consumer privacy laws, set to go into effect in 2023. Violations for data hoarding for data opaque organizations can be severe, especially for smaller or mid-sized businesses. The GDPR states that companies caught engaging in more serious violations can be fined up to 20 million dollars or four percent of their worldwide annual revenue. While data hoarding itself is not explicitly mentioned, violations related to processing and data subjects’ rights, two acts rooted in long-term data storage, are.

Making matters worse, California-based businesses bound by CCPA make it intentionally difficult to petition for the cessation of individual data collection. A 2020 Consumer Reports study determined that over 60 percent of a sample size of 400 individuals either had difficulty navigating the mechanisms to submit a request or did not hear back after they sent a petition.

Long-term data retention is particularly unnerving to data consumers because of the permanence of the act. Admittedly, even though I work in data, thinking that my personal data, traces of my essence on the Internet could outlive me is a bit unnerving. However, what’s even crazier than the notion of your Google Searches living for decades on a server is the idea that this kind of data isn’t even particularly helpful to businesses. One of the biggest mistakes of big data is that size equals quality. Another notion is that historic data is that the further back you retain data, the more insight you possess.

Pardon the interruption: For more Python, SQL and cloud computing walkthroughs, follow Pipeline: Your Data Engineering Resource.

To receive my latest writing, you can follow me as well.

Historic Data != Business Value

The effectiveness of historic data is entirely dependent on a business’ needs and often it’s entirely unnecessary to store years or even decades of data on a single consumer. That kind of storage takes up unnecessary space and creates digital clutter. If your organization relies on a cloud computing service like Amazon Web Services or Google Cloud then your leadership is going to be constantly adjusting slot usage and long-term storage rates, which can be costly for data that might only be used for ‘big picture’ analyses. A byproduct of this corporate data hoarding is that the focus shifts from data analysis to data collection. Organizational leaders can lose focus and begin demanding data without knowing what it will be used for.

The Data Hoarder Intervention

Photo by Alexandre Debiève on Unsplash

Even though it’s easy to see the r/DataHoarder community as a cross between cyberpunks and actual, physical hoarders, organizations that blindly collect data could learn something from their passion, vision and commitment to an end goal. Too many organizations speak of being ‘data-driven’ without sourcing and leveraging data effectively. Truthfully, if actively collected data from the past year is of a high enough quality and ingested at a frequent interval, the only reason data from a short time period will be ineffective is because of a lack of thoughtful application.

Organizations don’t need to panic and voraciously ingest every third party data point before laws such as GDPR, CCPA and U.S. state laws crack down on long-term data storage. They simply need to be more purposeful about using the data points they have.

Create a job-worthy data portfolio. Learn how with my free project guide.

Data
Big Data
Data Storage
Business Intelligence
Data Engineering
Recommended from ReadMedium