Innovation Insight: File Analysis Innovation Delivers an Understanding of Unstructured Dark Data
Explosive, unstructured data growth is forcing IT leaders to rethink data management. IT, data and storage managers use file analysis to deliver insight into information about the data, enabling better management and governance to improve business value, reduce risk and lower management cost.
- Unstructured data growth is rapidly outpacing structured data and is poorly controlled, stored and managed on file shares, on personal devices and in the cloud.
- Organizations have little awareness of the volume, composition, risk and business value of their unstructured data.
- Instead of addressing the holistic picture of unstructured data, including content, data access and data storage, IT leaders tend to view unstructured data only from the perspective of age, and do little if anything to support information governance.
- Organizations should review the scope of their unstructured data problems by using file analysis (FA) tools to understand where dark unstructured data resides and who has access to it.
- Identify the value and risks of unstructured data, and prioritize unstructured data management needs for classification and information governance, file and identity governance, storage management and content migration.
- Delete redundant or unneeded data once unstructured data is classified and mapped, then move legal, regulatory and stale data for compliance or low-touch retention reasons to lower-cost storage, and assign policies for retention and access.
Table of Contents
FA differs from traditional storage reporting tools not only by reporting on simple file attributes, but also by providing detailed metadata and contextual information to enable better information governance and storage management actions. These tools analyze, index, search, track and report on file metadata and, in some cases, file content, to assist in taking action on files according to what was collected.
FA tools offer a variety of options, for example:
- Storage management FA tools focus on the frequency of unstructured data use, identifying data associated with different applications and taking action on that data, such as migration to an archive or a tiered storage layer, or to be deleted.
- File and identity governance tools focus on who has access to which files and can identify and correct anomalies directly through the tools or through integration with Active Directory.
- Another class of FA tools provides a full content index, and is used for classification and information governance. These tools focus on what actions to take on unstructured data for information governance, e-discovery (such as legal hold), archiving, defensible deletion and storage management.
FA provides business value in the following ways:
- Reducing risk by identifying which files reside where and who has access to them, allowing remediation on areas such as eliminating personally identifiable information, corralling and controlling intellectual property, and finding and eliminating redundant and outdated data that may lead to business difficulties, such as multiple copies of a contract
- Reducing cost by reducing the amount of data stored
- Classifying valuable business data so that it can be more easily found and leveraged
- Supporting e-discovery efforts for legal and regulatory investigations
Unstructured data is growing at a much faster pace than traditional relational database management system (RDBMS) data, now accounting for well over half of all data storage by organizations and presenting a major challenge to manage. This data resides in file shares, email, SharePoint, file sync and share (FSS) applications, and individuals' laptops and desktops. Organizations are better equipped to take action when data access, usage, associations, redundancy and content are fully understood. Identifying the users and groups with access to data, matching them to who should or shouldn't have access and recognizing anomalies reduces security risks and increases organizational effectiveness. Understanding data's use and its associations with applications can identify the data to be moved to lower-cost storage or to be deleted. Going beyond the metadata and understanding the content of dark data (data gathered by companies that is not part of their day-to-day operations; see Note 1) can provide even more value as organizations initiate information governance strategies.
FA improves e-discovery readiness through searching, indexing and categorizing unstructured data that can be fed into archiving, enterprise content management (ECM) and e-discovery tools.
FA tools enable IT to create a visualization of unstructured data that can be presented to others in the organization so that they can make decisions based on the data. Key to this process is the creation of an effective cross-functional team of IT, lines of business (content experts), legal and compliance stakeholders that work together to use the data generated from the FA to make better information governance decisions.
FA tools enable views into an organization's unstructured data, much as master data management (MDM) does for structured data. As organizations visualize the content of their unstructured data, the use cases will move beyond storage management and governance into business support.
Use cases for FA include:
- Classifications for information security purposes
- Enforcement of information governance and retention policies
- Support for archiving/e-discovery and business reporting
- Storage management
- Data center or server consolidation
- Cloud migration
- Support for management of data as a result of mergers or acquisitions
- Data deletion/legacy data cleanup
- Copy data management
Examples of specific use case scenarios:
- Organization: Manufacturing Company:
- Use Case: Storage management
- Objective: Cleanse a file share environment that contained 30TB of file system data.
- Implementation: Initially delayed because of the newness of the FA approach internally. Once permissions were received, the file discovery and analysis project took less than three months to complete.
- Outcome: A total of 50% more content was identified beyond the original 30TB. After analysis, almost 60% of the data was identified for removal. As a result, the CIO authorized policies for the deletion of the data (currently being implemented). The ROI (payback) is two years, not including the resultant cost avoidance deferral of the storage hardware purchase.
- Buyer: CIO and storage team
- Organization: Oil and Gas Company:
- Use Case: Migration to SharePoint
- Objective: Clean up many sites worldwide that had unknown tens of terabytes of unstructured data prior to migration to SharePoint. Remove sensitive data, get a document countdown and provide good-quality data.
- Implementation: The implementations and initial cleansing at all sites worldwide were completed in one year. Massive savings were realized, as the tool identified more than 30% of data that could be removed prior to the migration. The FA product generated metadata about the files to be tagged, and to assist in reorganizing the storage. This information was passed on to a migration tool. The project is still running and is being funded with new success factors focused on improving business user engagement.
- Outcome: The FA tool reduced the time and costs for migration, and provided metadata tags that could not have been practically generated by manual processes.
- Buyer: CTO
- Organization: Financial Services:
- Use Case: Reporting
- Objective: The organization had previously deployed another tool and had achieved limited success without delivering to full expectations. While storage savings was a factor, the overall drivers weren't 100% clear. However, one long-term objective was to add metadata prior to migration to SharePoint.
- Implementation: During the six-week project, the tool identified 100TB of data — of which only 35TB were unique. Of the 35TB, the FA tool identified 15TB for removal.
- Outcome: At the onset, there was data everywhere that was being poorly managed. The storage team was able to identify the potential to go from 100TB to 20TB of necessary data.
- Buyer: CIO and storage team
As organizations implement FA tools to assist in general information governance activities, more use cases will become apparent. The impact of understanding and taking action on unstructured data will be greatest on organizations that generate millions or billions of files from many applications. The potential for a high payback will help drive the adoption of these tools (see Figure 1).
Source: Gartner (March 2013)
The impact of FA tools on IT can be dramatic. Storage administrators now have a tool that shows detailed information about the data being stored to take to business owners so that more-informed decisions can be made on data retention and optimized data protection policies. File shares can be dramatically cleaned up by deleting old, orphaned and irrelevant data, greatly reducing the burden on IT when an e-discovery, regulatory or compliance request is presented. Data generated by FA tools can be integrated with data loss prevention (DLP) tools to provide proactive management of intellectual property.
Gartner considers FA to be a high-impact technology, and estimates that it will take two to five years before it reaches mainstream adoption. Adoption rates will differ according to use cases. FA for storage management purposes, namely for migrations or technology refreshes, may evolve more quickly as organizations view massive amounts of data stored on file shares as cumbersome to move in totality.
FA for classification, governance and e-discovery will increase in frequency and importance as organizations ascertain the legal, compliance and intellectual property loss potential around their unknown unstructured data and associated costs to manage.
Figure 2 shows the responses of organizations at the December 2012 Data Center Conference in Las Vegas to the question, "Do you have management tools in place to help better understand your unstructured data?"
N = 52
Source: Gartner (March 2013)
Figure 3 shows the responses of organizations at the December 2012 Data Center Conference in Las Vegas to the question, "What type of aging data represents your biggest challenge?"
N = 52
Source: Gartner (March 2013)
The main challenge organizations face in adopting FA is finally facing the black hole of data they have ignored for too long. Some organizations have literally said they are afraid of what they might find.
Yet, FA technology is relatively risk-free, as the main outcomes of data scans are data reports and visualizations. Based on these, organizations can take action on segments of the data. Risk arises as a result of poorly defined policies on what to do with the data once it's classified, and/or the improper movement of the data in response to the classification. For example, if an organization runs a report on access times for documents and then deletes everything that has not been touched in three years, issues can arise if regulations require the organization to keep some of the data longer. Legal problems also may arise if the data is moved from one repository to another and the chain of custody is not maintained. While FA tools may have a slight impact on system performance, most are configured to run at a low rate of impact on CPUs, or to run during idle periods.
Technology providers with FA capabilities have varying backgrounds, including storage management, e-discovery, indexing and FA, which may be the providers' primary product areas. The providers offer at least one of the following capabilities for either file metadata or content reporting: storage management, file/identity governance, classification/information governance and content migration. Sample vendors include Acaveo, Active Navigation, Aptare, Autonomy (an HP company), AvePoint, Clearswift, Content Analyst, dataglobal, Dell-Quest Software, EMC, Equivio, FileTek, IBM-Stored IQ, Idera, Imperva, Index Engines, Litera, Metalogix, Northern, NTP Software, Nuix, Proofpoint, Recommind, RSD, Symantec, Tarmin, Varonis Systems and ZyLAB.
- innovationinsig... (53KB)
By 2018, 25% of progressive organizations will manage all their unstructured data using information governance and storage management policies, up from less 1% today.
Gartner defines dark data as the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes (for example, analytics, business relationships and direct monetizing). Similar to dark matter in physics, dark data often comprises most organizations' universe of information assets. Thus, organizations often retain dark data for compliance purposes only. Storing and securing data typically incurs more expense (and sometimes greater risk) than value (see "The Importance of 'Big Data': A Definition").