Retrospecting VM Images Wolfgang Richter Glenn Ammons*, Jan Harkes, Adam Goode, Vasanth Bala*, Nilton Bila+, Eyal de Lara+, Mahadev Satyanarayan * IBM Research + University of Toronto The Importance of Content http://www.pdl.cmu.edu/ 2 Wolfgang Richter © July 17, 2016 Retrospection • VM collections growing • 300% Year over year, IBM Research RC2 • System, application, and user content • Searchable history • • • • Debugging opportunities Legal data or code origin Malware tracking License violations http://www.pdl.cmu.edu/ 3 Wolfgang Richter © July 17, 2016 Roadmap • What is the retrospection problem? • What are the main challenges? • How can we solve them? http://www.pdl.cmu.edu/ 4 Wolfgang Richter © July 17, 2016 The retrospection mechanism should place as few constraints as possible on the code used for search computations. PRINCIPLE 1 http://www.pdl.cmu.edu/ 5 Wolfgang Richter © July 17, 2016 A search computation should only be performed on demand for a specific query, and its scope should be restricted to the smallest relevant subset of VM images. PRINCIPLE 2 http://www.pdl.cmu.edu/ 6 Wolfgang Richter © July 17, 2016 Control of policy for retrospection should reside with the owners of VM images and their delegates. PRINCIPLE 3 - WIP http://www.pdl.cmu.edu/ 7 Wolfgang Richter © July 17, 2016 Find The Picture • Rich content-based and application-specific queries • 10 Graphics OK • 100 Graphics OK • 1000+ Graphics ? http://www.pdl.cmu.edu/ 8 Wolfgang Richter © July 17, 2016 OpenDiamond Platform • Distributed, interactive, unindexed search • Focuses on the principle of early discard • Enables arbitrary search queries • Arbitrary x86 binary code The retrospection mechanism should place as few constraints as possible on the code used for search computations. http://www.pdl.cmu.edu/ 9 Wolfgang Richter © July 17, 2016 Available Structured Data • VM’s have attributes and metadata • Owners • Files • File Systems • Files have attributes and metadata • • • • Owners File Type Permissions Modification Timestamp http://www.pdl.cmu.edu/ 10 Wolfgang Richter © July 17, 2016 Scoping Solution • Metadata MySQL database • Scope Server • Manage access to data A search computation should only be performed on demand • Scope Cookie for a specific query, and • X.509 signed cookie its scope should be • Determines accessible data restricted to the smallest relevant subset of VM images. http://www.pdl.cmu.edu/ 11 Wolfgang Richter © July 17, 2016 Problem: VM Sprawl in Files Data from 78 NCSU VCL VM Images based on Windows XP http://www.pdl.cmu.edu/ 12 Wolfgang Richter © July 17, 2016 Problem: VM Sprawl in Bytes Data from 78 NCSU VCL VM Images based on Windows XP http://www.pdl.cmu.edu/ 13 Wolfgang Richter © July 17, 2016 Solution: Deduplication in Files Reduce Search Time Data from 78 NCSU VCL VM Images based on Windows XP http://www.pdl.cmu.edu/ 14 Wolfgang Richter © July 17, 2016 Solution: Deduplication in Bytes Reduce Storage Space Data from 78 NCSU VCL VM Images based on Windows XP http://www.pdl.cmu.edu/ 15 Wolfgang Richter © July 17, 2016 IBM Research Mirage • File-level deduplication • Files are referenced by SHA-1 tag • Reads VM image partitions and file systems • On-disk deduplicated format • Centralized VM store – a potential bottleneck http://www.pdl.cmu.edu/ 16 Wolfgang Richter © July 17, 2016 Network Bottlenecks Megabytes Takeaway: Centralized store limited by network bandwidth, limiting parallelism. http://www.pdl.cmu.edu/ 17 Wolfgang Richter © July 17, 2016 Network Bottlenecks Objects Takeaway: The number of objects pushed determines the possible number of search processes. http://www.pdl.cmu.edu/ 18 Wolfgang Richter © July 17, 2016 Dataretriever • Abstract data source • getObject() interface • Search process oblivious to where data comes from • Access deduplicated data • Unmodified client and search server • Solve network bottleneck: Data partitioning • Compute on local data rather than central store • Layer of indirection enables this without modification http://www.pdl.cmu.edu/ 19 Wolfgang Richter © July 17, 2016 Architecture Scope Cookie Scope Definition Scope Mirage Server Client Request Objects Raw Data Dataretriever Dataretriever Dataretriever Server Server Metadata Query Query+Cookie MySQL http://www.pdl.cmu.edu/ 20 Wolfgang Richter © July 17, 2016 Revisiting Network Bottlenecks How bad is the bottleneck with content-based queries? http://www.pdl.cmu.edu/ 21 Wolfgang Richter © July 17, 2016 CPU-Bound Search Process Takeaway: Content search is limited by computation, although embarrassingly parallel. http://www.pdl.cmu.edu/ 22 Wolfgang Richter © July 17, 2016 Achievable Efficient Retrospection Takeaway: Search scales with servers, and the Mirage case closely matches local. http://www.pdl.cmu.edu/ 23 Wolfgang Richter © July 17, 2016 Current Research: Principle 2 • Control of policy to owners via encryption • Proof of concept: convergent encrypt /home • Encrypt files using file hash as key • Fine-grained, per file • Future direction: key escrow? • Support investigations and warrants • Support multiple encryption methods? • Per VM Image? Groups? http://www.pdl.cmu.edu/ 24 Wolfgang Richter © July 17, 2016 Recap • Retrospection – search VM image content • Main challenges 1. Get data efficiently – Solution: Dataretriever 2. Handle big and growing data – Solution: Scoping + Deduplication 3. Privacy and encryption http://www.pdl.cmu.edu/ 25 Wolfgang Richter © July 17, 2016 OpenDiamond - http://diamond.cs.cmu.edu IBM Mirage - http://doi.acm.org/10.1145/1346256.1346272 Convergent Encryption - http://doi.acm.org/10.1145/339331.339345 LEARN MORE http://www.pdl.cmu.edu/ 26 Wolfgang Richter © 7/17/2016