CIOs everywhere will be familiar with the major issues caused by collecting and retaining data at an increasingly rapid rate. Industry research shows 64% of enterprises manage at least 1 Petabyte of data, creating substantial cost, governance and compliance pressures.
If that wasn’t enough, organizations frequently default to retaining these enormous datasets, even when they are no longer needed. To put this into context, the average useful life of most enterprise data has now shrunk to 30–90 days; however, for various reasons, businesses continue to store it indefinitely, thereby adding to the cost and complexity of their underlying infrastructure.
As much as 90% of this information comes in the form of unstructured data files spread across hybrid, multi-vendor environments with little to no centralized oversight. This can include everything from MS Office docs to photo and video content routinely used by the likes of marketing teams, for example. The list is extensive, stretching to invoices, service reports, log files and in some organizations even scans or faxes of hand-written documents, often dating back decades.
In these circumstances, CIOs often lack clear visibility into what data exists, where it resides, who owns it, how old it is or whether it holds any business value. This matters because in many cases, it has tremendous value with the potential to offer insight into a range of important business issues, such as customer behaviour or field quality challenges, among many others.
With the advent of GenAI, it is now realistic to use the knowledge embedded in all kinds of documents and to retrieve their high-quality (i.e., relevant, useful and correct) content. This is even possible for documents having a low visual/graphical quality. As a result, running AI on a combination of structured and unstructured input can reconstruct the entire enterprise memory and the so-called “tribal knowledge”.
Visibility and governance
The first point to appreciate is that the biggest challenge is not the amount of data being collected and retained, but the absence of meaningful visibility into what is being stored.
Without an enterprise-wide view (a situation common to many organizations), teams cannot determine which data is valuable, which is redundant, or which poses a risk. In particular, metadata remains underutilised, even though insights such as creation date, last access date, ownership, activity levels and other basic indicators can immediately reveal security risks, duplication, orphaned content and stale data.
Visibility begins by building a thorough understanding of the existing data landscape. This can be done by using tools that scan storage platforms across multi-vendor and multi-location environments, collect metadata at scale, and generate virtual views of datasets. This allows teams to understand the size, age, usage and ownership of their data, enabling them to identify duplicate, forgotten or orphaned files.
It’s a complex challenge. In most cases, some data will be on-premises, some in the cloud, some stored as files and some as objects (such as S3 or Azure), all of which can be on-prem or in the cloud. In these circumstances, the multi-vendor infrastructure strategy adopted by many organizations is a sound strategy as it facilitates data redundancy and replication while also protecting against increasingly common cloud outages, such as those seen at Amazon and CloudFlare.
With visibility tools and processes in place, the next requirement is to introduce governance frameworks that bring structure and control to unstructured data estates. Good governance enables CIOs to align information with retention rules, compliance obligations and business requirements, reducing unnecessary storage and risk.
It’s also dependent on effective data classification processes, which help determine which data should be retained, which can be relocated to lower-cost platforms and which no longer serve a purpose. Together, these processes establish clearer ownership and ensure data is handled consistently across the organization while also providing the basis for reliable decision-making by ensuring that data remains accurate. Without it, visibility alone cannot deliver operational or financial benefits, because there is no framework for acting on what the organization discovers.
Lifecycle management
Once CIOs have a clear view of what exists and a framework to control it, they need a practical method for acting on those findings across the data lifecycle. By applying metadata-based policies, teams can migrate older or rarely accessed data to lower-cost platforms, thereby reducing pressure on primary storage. Files that have not been accessed for an extended period can be relocated to more economical systems, while long-inactive data can be archived or removed entirely if appropriate.
A big part of the challenge is that the data lifecycle is now much longer than it used to be, a situation that has profoundly affected how organizations approach storage strategy and spend.
For example, datasets considered ‘active’ will typically be stored on high- or mid-performance systems. Once again, there are both on-premises and cloud options to consider, depending on the use case, but typically they include both file and object requirements.
As time passes (often years), data gradually becomes eligible for archival. It is then moved to an archive venue, where it is better protected but may become less accessible or require more checks before access. Inside the archive, it can (after even more years) be tiered to cheaper storage such as tape. At this point, data retrieval times might range from minutes to hours, or even days. In each case, archived data is typically subject to all kinds of regulations and can be used during e-discovery.
In most circumstances, it is only after this stage has been reached that data is finally eligible to be deleted.
When organizations take this approach, many discover that a significant proportion of their stored information falls into the inactive or long-inactive category. Addressing this issue immediately frees capacity, reduces infrastructure expenditure and helps prevent the further accumulation of redundant content.
Policy-driven lifecycle management also improves operational control. It ensures that data is retained according to its relevance rather than by default and reduces the risk created by carrying forgotten or outdated information. It supports data quality by limiting the spread of stale content across the estate and provides CIOs with a clearer path to meeting retention and governance obligations.
What’s more, at a strategic level, lifecycle management transforms unstructured data from an unmanaged cost into a controlled process that aligns storage with business value. It strengthens compliance by ensuring only the data required for operational or legal reasons is kept, and it improves readiness for AI and analytics initiatives by ensuring that underlying datasets are accurate and reliable.
To put all these issues into perspective, the business obsession with data shows no sign of slowing up. Indeed, the growing adoption of AI technologies is raising the stakes even further, particularly for organizations that continue to prioritize data collection and storage over management and governance. As a result, getting data management and storage strategies in order sooner rather than later is likely to rise to the top of the to-do list for CIOs across the board.
This article is published as part of the Foundry Expert Contributor Network.
Want to join?
Read More from This Article: Why CIOs need a new approach to unstructured data management
Source: News

