The Data Vice No One Talks About: Data Hoarding.
How long-term data retention could violate the law… and not even yield business results.

A Digital Disease
One of the more peculiar subreddits (on a site full of them) is r/DataHoarder. The subreddit’s moderators describe the community as a forum for those suffering from the ‘digital disease’ of data hoarding, the practice of retaining, to an extreme degree, all forms of data. With more than half a million members, the community prides itself on enabling those who suffer from an inclination to hoard data. One of the top posts is from a verified user who claims to have 87 TB of storage, at a cost of approximately 5,000 dollars.
The aim of the community isn’t to amass stockpiles of useless data. Conversely, r/DataHoarder’s members perceive themselves to be archivists in an almost noble fashion. The content they collect serves a purpose. There is a discussion on finding an archive of resources to rebuild after the apocalypse. There is another seeking to coordinate an effort to document the 2022 Ukrainian conflict. Although several media outlets have published stories on this community many treat it as an eccentric collective, rather than what the group seems to want to be known for: Intentional archiving. I’d argue, in fact, that r/DataHoarder users make better use of their historic data than most data-oriented organizations.
Unfortunately, the window for any organization seeking to leverage historic data in production is closing due to ongoing privacy regulation legislation. We’re now at a point when data hoarding won’t just be a bad organizational practice, akin to knowledge hoarding, it will be a difficult act to defend as federal and international privacy legislation evolves.
Data Retention Law
Currently only European Union (EU) law specifies what power civilians have in preventing their information from being stored (and used) indefinitely. When exercised, the General Data Protection Regulation (GDPR)’s much-publicized ‘right to be forgotten’ law enables users to request that their personal information be deleted from virtual public forums like Google Search. Although, despite being the self-described ‘toughest privacy and security law in the world’, the GDPR does not outright ban data retention on EU citizens. Instead, it suggests a broad retention guideline of six years. The GDPR’s counterpart, the California Consumer Privacy Act, establishes a 45-day window for companies to respond to a consumer’s deletion requests.

Two other U.S. states, Virginia and Colorado, have passed consumer privacy laws, set to go into effect in 2023. Violations for data hoarding for data opaque organizations can be severe, especially for smaller or mid-sized businesses. The GDPR states that companies caught engaging in more serious violations can be fined up to 20 million dollars or four percent of their worldwide annual revenue. While data hoarding itself is not explicitly mentioned, violations related to processing and data subjects’ rights, two acts rooted in long-term data storage, are.
Making matters worse, California-based businesses bound by CCPA make it intentionally difficult to petition for the cessation of individual data collection. A 2020 Consumer Reports study determined that over 60 percent of a sample size of 400 individuals either had difficulty navigating the mechanisms to submit a request or did not hear back after they sent a petition.
Long-term data retention is particularly unnerving to data consumers because of the permanence of the act. Admittedly, even though I work in data, thinking that my personal data, traces of my essence on the Internet could outlive me is a bit unnerving. However, what’s even crazier than the notion of your Google Searches living for decades on a server is the idea that this kind of data isn’t even particularly helpful to businesses. One of the biggest mistakes of big data is that size equals quality. Another notion is that historic data is that the further back you retain data, the more insight you possess.
Pardon the interruption: For more Python, SQL and cloud computing walkthroughs, follow Pipeline: Your Data Engineering Resource.
To receive my latest writing, you can follow me as well.
Historic Data != Business Value
The effectiveness of historic data is entirely dependent on a business’ needs and often it’s entirely unnecessary to store years or even decades of data on a single consumer. That kind of storage takes up unnecessary space and creates digital clutter. If your organization relies on a cloud computing service like Amazon Web Services or Google Cloud then your leadership is going to be constantly adjusting slot usage and long-term storage rates, which can be costly for data that might only be used for ‘big picture’ analyses. A byproduct of this corporate data hoarding is that the focus shifts from data analysis to data collection. Organizational leaders can lose focus and begin demanding data without knowing what it will be used for.
The Data Hoarder Intervention

Even though it’s easy to see the r/DataHoarder community as a cross between cyberpunks and actual, physical hoarders, organizations that blindly collect data could learn something from their passion, vision and commitment to an end goal. Too many organizations speak of being ‘data-driven’ without sourcing and leveraging data effectively. Truthfully, if actively collected data from the past year is of a high enough quality and ingested at a frequent interval, the only reason data from a short time period will be ineffective is because of a lack of thoughtful application.
Organizations don’t need to panic and voraciously ingest every third party data point before laws such as GDPR, CCPA and U.S. state laws crack down on long-term data storage. They simply need to be more purposeful about using the data points they have.
Create a job-worthy data portfolio. Learn how with my free project guide.





