Sullivan and Shinkle - Managing Shared Content

advertisement
Managing Shared Content
Part 1 - Susan J. Sullivan, CRM; NARA
Part 2 - Tim Shinkle, Gimmal
1
Target Outcomes
Knowledge of:
• How NARA combined people and technology to
meet the shared drive cleanup challenge, and
• How cleaning up shared drives can lead to
efficient management of all electronic content.
• How people interact with Indexing and
Categorization (I&C) Management Tools to
manage content
• Start thinking how you can do this in your
organization
2
First meeting with the CIO
as Records Officer
Clean up those
shared drives!
ARMA is in
Florida this
year.
*This is a dramatization
3
A bit of luck at ARMA
ARMA 2009
- How to Clean
May I have
your card?
Shared Drives
*This is a dramatization
4
Services Contract
•
•
•
•
•
•
•
Award winning RM experts
Approach rooted in records management
Appliance with crawlers
Data gathering period, pre-crawls
Pilot project, draft rules – cleanup pilot
Information Management Plan
Communications Plan (posters wiki,
meetings)
• Met and worked with 36 SMEs, crawled and
reported, validated, cleaned content
5
Overall Process
Business Assessment
Content Assessment
Information
Management Plan
Litigation Preservation
Human
Validation
Cleanup
Policies
Organization
Migration
6
People + Tools = Cleanup
People
Technology
Where is the content?
Pre-crawl for overview
What to protect / preserve?
Find /lock preservation content
Develop and validate auto-cleanup rules
(.tmp, backup, ~)
Find / report on auto-cleanup content
Approve auto-clean rules
Auto-clean (2% moved)
Develop rules for content to be reviewed Find / report on reviewable content
before cleaning or moving (large, old,
marked “draft, old, trash”, music, video,
audio)
Review and identify for cleanup
Cleanup or move individual or groups of
files
Develop data integrity policies, develop
and communicate cleanup maintenance
rules and policies
Cleanup / identify continually
7
What is eTrash
• Non-business value, non-records
–
–
–
–
Temporary files
Obsolete applications
Files with zero content
Un-openable/un-readable content
• Not so important content that is not FOIA,
Litigation related, retention scheduled and is:
–
–
–
–
Duplicates
Large
Voluminous
Obsolete
8
Content identified for auto-clean
•
•
•
•
•
•
System generated temporary files
System generated backup files
Zero Content files and folders
Abandoned applications
Obsolete install files
System status reporting files
9
Content reviewed for cleanup
• Old files
• Large files
• Large duplicate
archive
• Identified old files
• Compressed file
duplicates
• Renditions
convenience copies
• Non accessed drafts older
than 1 year
• Human identified backup
files
• Human-identified as
delete-able content
• Non-business images
music or audio video or
media
• Office Document
Duplicates
10
Content Identified for Risk
• PII and eDiscovery
– Litigation hold files
– Credit card numbers
– Social security numbers
11
A new way of viewing content
12
Outcomes
• Shares cleaned or under review
• Bulletin issued - NARA Bulletin 2012-02, December 06, 2011;
Guidance on Managing Content on Shared Drives
[http://www.archives.gov/records-mgmt/bulletins/2012/2012-02.html]
• Inputs to Presidential Memorandum -- Managing Government
Records, November 28, 2011, [http://www.whitehouse.gov/the-pressoffice/2011/11/28/presidential-memorandum-managing-government-records]
• Policies identified
– cleaning shares,
– managing content on shared drives
• Most importantly…..clear path forward
13
Data Management Policies
• Reliability and content – Preserving file attributes
• Context – Preserve file path, include recordkeeping
metadata within user profiles
• Usability
– Align storage with usage, preservation and access needs.
Widely accessible storage location for high value content
– Use robust search capability eliminate complex hierarchical
folder structures.
• Authenticity - only authorized users can add, change
and delete content for a specified data set.
14
Benefits
• Storage space – when cleaned all identified, can
provide upwards of 47% in reclaimed storage.
• Money - To maintain 27 TB clean per year, save
$187K per year (shares and email).
• Time – Staff time culling through volumes of
content to find or migrate their information (FOIA
– E-discovery)
• Access efficiencies – Right information, right
place right time
15
Enterprise Value of Cleaning
• If users are to migrate their own content to a
target system (e.g., new share or ECRM), they
decide:
1. Should it migrate (30 seconds)
2. How should it be tagged (30 seconds)
• If 55% of 10TB should not be migrated:
– There are 16.8M documents (at 360KB/DOC)
– 140K hours of wasted decision making
– $10.4M of staff time (at $75 per hour)
16
Next Opportunity:
Align management with value
Permanent
Mid to long term retention
Short term retention
E-trash
17
Beyond cleanup – Auto-class
• Keyword/Boolean - group
• Concept Search
– Ontology (file plan) based
– Analyze word/phrase
relationships (Bayesian)
• Clustering (Bayesian)
– Auto-classifier
– Near duplicates
– Predictive coding
• Learning by example
18
People will still have a role
• Connect with business owners
• Develop / enforce file integrity & metadata policies
• Locate and categorize content
– Develop rules Boolean / Bayesian / Train tools by example
– Run & review crawls and refine rules
• Determine and assign value (metadata)
– Search legal terms, financial terms, health terms
– How often accessed, how often duplicated?
• Develop / enforce storage / management policies
according to value
• Manage by exception
19
Vision of future
I&C Tools
Workers
store
content
Crawler
indexes
content
nightly
Maintain current
item-level,
categorized index
Tweek index
crawler,
learns
20
Retention Bell Curve
Granular
retention
schedules
Granular
retention
schedules
Big bucket
retention
schedules
I&C Tools
Paper-based
file
plans
Large categories by
business function /
creator role
Indexed and
Categorized at
item level
21
Vision Evolution
• Get clean and lean
– Mature cleanup program
– Cleanup and network policies enforced
– Comfortable with crawlers and rules
• Develop rules to index at various value levels
– Transitory, 3 year, 7 year, 10 year, etc.
– Permanent
• Group, cluster, train, metadata extraction, migrate, and
tag with retention
• Go granular as needed
• Crawl nightly, manage by exception
22
Automated Policies
• Run continuously
–
–
–
–
Cleanup policies for files that are no longer needed
Security risk policies for files meeting PII criteria
Legal policies for files meeting legal hold criteria
IT drive management policies for databases, multimedia, images,
applications, web content and other categories
– Lifecycle management policies for
•
•
•
•
•
inactive files
draft files
active files
records
reference material
– Content integrity policies for abandoned files
– Migration policies for files that require migration to target
systems
23
I&C Tools
• Indexes & categorizes content according to
rules
• Clusters content around trends (retention)
• Ingests content samples and learns.
• More crawling = more learning
– Manage by exception = more learning
• Act upon content (move, lock)
24
Part 2
Tim Shinkle
Professional Services Approach
•
•
•
•
Phase 0 – Source repository assessment
Phase I – Program setup
Phase II – Pilot
Phase III – Enterprise Transformation
Phase 0 – Source Repository Assessment
• Perform preliminary analysis, e.g., Shares, SMEs, business
processes
• Scan a percentage of enterprise shares that represent a
good cross section of the data
• Produce an assessment report and plan moving forward
–
–
–
–
–
–
Percentage of candidate eTrash, migrate, relocate, review leave
PII analysis
Metadata extraction
Classification and categorization
Strategy and approach
Expected cost savings and benefit analysis
Phase I – Program setup
•
•
•
•
Meet with sponsors
Identify scope (shares, groups, SMEs)
Develop communication plan
Stand up program communications portal (e.g.,
corporate portal, wiki)
• Identify pilot group(s) (shares and SMEs)
• Meet with RM and Legal to identify candidate
records, holds and PII
• Approve policies, cleanup and classification rules
Phase II – Pilot
•
•
•
•
Interview pilot group
Perform analysis and refine rules
Crawl and report on pilot group shares
Review results with SMEs and move approved
files to cleanup location
• Refine global cleanup and classification rules and
update published policies
• Develop benefit analysis report and enterprise
strategy
• Publish updates to program portal (wiki)
Phase III – Enterprise Transformation
• Outline schedule, project plan and group priority
• Update communications on portal (e.g., wiki)
• Repeat for each group
– Meet, refine share mapping to SMEs and perform high level
analysis, size and scope
– Crawl shares, generate reports and publish for review
– Meet with SMEs, review results and start the clock
– Move approved files to temporary storage location
– Perform any additional migration and transformation based on
classification, categorization and metadata extraction of results
– Repeat until completed
• Publish final reports, develop and implement strategy for
on going policy information governance automation
Technology Demonstration
Example Data
Classification Best Practices
• Create categories that can be leveraged by machines
(e.g., Put like content in like containers, e.g., expense
reports)
• People can provide context that is not found in the
content (e.g., the folder name is a project number
where anything to do with the project is dumped into
the folder, such as proposal, meeting notes,
deliverables, reference material, directions, local
hotels, expenses, etc...)
• Folders can only sometimes tell you what can be found
in the folder, other times it cannot (e.g., \Invoices vs.
\Misc)
Classification Best Practices
• File names are mixed and matched (e.g.,
\Proposals\Proposal XYZ.doc, \Meeting
notes\Proposal XYZ.doc)
• Don’t confuse how individuals want to organize
their content with the classification of the
content
• The more the information is shared, the more it
needs to be generally classified (big buckets)
• Combine algorithm extracted topics and
identified classifications with subject matter
expert search rules
Q&A
• Susan Sullivan, CRM
Director - Corporate Records Management
National Archives and Records Administration
susan.sullivan@nara.gov
V: (301) 837-2088
• Tim Shinkle
Director, Gimmal Group
tim.shinkle@gimmal.com
(703) 927-5650
Download