Next Generation Information Systems Avi Silberschatz Department of Computer Science Yale University URL: www.cs.yale.edu/~avi 1 The Digital Age Digital information forms the glue for blending the fields of computing, communication and entertainment. At the center of this revolution is data that is stored, accessed and delivered in digital format. Some of the major issues surrounding this type of data are: Data is to be available to the users anytime and anywhere and with the desired QoS. Data access must adhere to privacy and security policies. Data Interoperability. Fast access to data, which implies support for queries with approximate answers. Data analysis and mining capabilities over very large datasets. Many of the advances in information systems are due to development of new technologies. These advances, in turn, are pushing the developments of even newer technologies. Next Generation Information Systems 2 Silberschatz Research Challenges Storage retrieval and delivery of multimedia data Storage System Issues QoS issues of continuous media data (e.g., video and audio) Approximate answers useful for very large data sets useful for Web searching Data mining Discovering “interesting” patterns in very large data sets Discovering “interesting” patterns from incomplete information Data Interoperability Privacy and security Next generation Networks Converged networks Network Management Next Generation Information Systems 3 Silberschatz Multimedia Data Regular Data text, binary, image Database Data tuples, objects Continuous Media Data Video Data The display (playback) of the data must be continuous with a fixed rate, which is typically 30 frames/second. A viewer may wish to control the way the data is to be displayed by applying various VCR-type operations to the video data. Audio Data The playback must be continuous with a fixed rate, which is dependent on the sample rate. A listener may wish to control the way the data is played back. Next Generation Information Systems 4 Silberschatz Storage System Issues Rapid growth in storage capacity demand world-wide installed storage: 738 PetaByte in 2000 over 75% per year storage capacity increase over the next 5 years reaches ZettaByte in 2009 data stored at Global 2500 companies double every 18 months data stored at e-commerce companies grow at 400% a year Management 40-50% of company IT budget is spent on storage fraction of IT budget spent on storage is expected to grow cost for storage management exceeds cost of storage equipment management: $300 per GB per year low-end storage: $14 - $50 per GB (packaged, powered, networked) management cost is expected to grow Storage Requirement 24 x 7 Disaster recover Next Generation Information Systems 5 Silberschatz Storage is Moving Into the Network Motivation Use commodity IP based networks IT staff know-how Distance and universal access Applications Disaster recovery Archiving Backups Content Distribution Managed storage Value added storage services Consolidation of storage Next Generation Information Systems 6 Silberschatz IP-Based Network Storage Storage is managed Client site #1 possible by different domains LAN Storage devices are Client site #2 connected over networking infrastructure LAN Metro/WAN file server LAN SAN Next Generation Information Systems LAN file servers 7 SAN Silberschatz file servers IP-based Network Storage (Cont.) IETF standards are being drafted Most popular: iSCSI and FCIP Almost all networking and storage companies are participating in these standards Issues Performance Reliability Future end-to-end iSCSI; end-to-end IP storage networking? demise of FC? Hybrid? FC (InfiniBand) SAN islands connected over IP networks FC SANs in data centers accessed by IP networks Next Generation Information Systems 8 Silberschatz Network Storage Security Customers may not trust the storage service provider (SSP) Storage consolidation over different customers is essential to make storage outsourcing viable. However, customers may not trust each other Threat model Disclosure of data to an eavesdropper intercepting communication Disclosure of data to storage service provider (SSP) and to other customers of the SSP Manipulation of communication by an attacker Manipulation of data by the SSP or other customers of the SSP Challenges high throughput encryption (e.g., 1Gbps, 10 Gbps) security without hindering performance Next Generation Information Systems 9 Silberschatz Multimedia Storage and Delivery Issues The size of some databases is enormous, especially those that are used for data mining (e.g., cash register transactions). 30 terabytes largest commercial database Some information sources generate data at an astonishing rate (e.g., satellite images). EOS – 1-2 terabytes per day The BBC is planning to digitize the last 50 years of programming. Continuous media data is voluminous: 100 minute MPEG-1 video requires 1.125GB 100 minute HDTV video requires 15GB Continuous media data require support for QoS. Next Generation Information Systems 10 Silberschatz System Resources to be Managed for QoS Storage Server Resources Tertiary Storage I/O Bus Secondary Storage I/O Bus Buffer Space Processor(s) Network Next Generation Information Systems 11 Silberschatz Research Issues Admission control Disk Scheduling Buffer Management Storage Management data layout varying disk transfer rates disk striping meta data fault-tolerance Tertiary storage Next Generation Information Systems 12 Silberschatz Cycle-based Scheduling Let T be the length of a service cycle Maintain a queue of requests R1 , R2 . . . Rn . Each Ri corresponding to a request to view a CM clip. Each an associated rate ri. request has For each request, a buffer is allocated of size 2 T ri . Requests in the queue are served in a cyclic order using double buffering. In each cycle I: get data from disk to buffer (I mod 2) transfer data from the (I + 1 mod 2) buffer to the client Next Generation Information Systems 13 Silberschatz Disk Scheduling Request are serviced in service cycles (rounds). In the beginning of a service cycle requests are ordered in C-SCAN order. In the beginning of every service cycle, it is ensured that 2 T ri B T ri t rot t settle 2 t seek T rdisk hold. (where trot , t settle , t seek are the rotational delay, settle time, and seek time, respectively, and B is the buffer pool size). The value of T is adjusted depending on the workload. In every service cycle, min T ri , 2 T ri offset of last retrieved - offset of last consumed bits of data retrieved for each request. Next Generation Information Systems 14 Silberschatz Admissions Control Queue is bounded by an admission control scheme For each request, the service time for a request is estimated. A request is admitted only if the sum of the estimated service times for all admitted requests does not exceed the duration of service cycle T. Next Generation Information Systems 15 Silberschatz Admission Control (cont.) Reserve a fraction of service cycle T, say T (0 1) for continuous media requests. A request (real-time, non-real-time), is admitted if T ri ni t rot t settle t rot t settle 2 t seek T rdisk rdisk A real-time request is admitted if T ri t rot t settle 2 t seek T rdisk Above scheme ensures both continuous and non-continuous media requests are allocated time during a service cycle. any time during a service cycle unused by continuous media requests is allocated to non-continuous media requests. Next Generation Information Systems 16 Silberschatz Length of T What about the length of T? Next Generation Information Systems 17 Silberschatz Buffer Space Constraints Let B be the available buffer size Let N be the number of admitted clients Assume infinite disk bandwidth Requirements: N 2 T ri B i 1 N T For a given buffer size B, the larger T, the fewer clients can be admitted. Next Generation Information Systems 18 Silberschatz Disk Bandwidth Constraints Assume infinite buffer space Use C-SCAN disk scheduling Requirements: T ri T i 1 r disk N 2 t settle N (trot t settle ) N T The larger T the larger N is Next Generation Information Systems 19 Silberschatz Combining Disk & Buffer Constraints N disk constraint buffer constraint T The optimal T is obtained by solving a quadratic equation of the disk and buffer space constraints. Next Generation Information Systems 20 Silberschatz Minimizing Response Time Under some workloads (e.g., request with small ri 's such as 64 Kbps), the value of T that maximizes throughput can be high (e.g., 20 secs.). This might yield high response times. Solution: maintain small T values in order not to degrade throughput, for each request Ri data is prefetched from disk in every ki service cycles (instead of in every service cycle) The maximum amount of data prefetched is ki T ri buffer space allocated to Ri is ki 1 T ri Next Generation Information Systems 21 Silberschatz Minimizing Response Time (contd.) Issues: Calculation of ki’s Admission control: k1 , k2 ,..., kn lcm service cycles to manage For a request Ri, finding the least loaded service cycles lcm k1 , k2 ,... kn ui ki l , 0 l 1 ki In order to reduce response time, start a new request Ri in the first possible service cycle and then move it incrementally to the selected least loaded service cycle. This solution also provides higher throughput for workloads with small ri’s Next Generation Information Systems 22 Silberschatz Querying Huge Data Sets Give me all objects (e.g., images) that look like this. If we are dealing with PetaBytes of data, this may take days or weeks. One solution is to capture “meta data” information about the stored objects as the objects are stored in the database. Querying is done against the “meta data”. Major issue – nature of the meta data. Another solution is to provide support for “approximate answers”. Next Generation Information Systems 23 Silberschatz Providing Approximate Answers Traditional databases provide exact answers to queries, but... In massive data environments, can take minutes to hours due to disk I/Os In distributed environments, data may be remote or currently unavailable In real-time environments, even single I/O may be too slow Next Generation Information Systems 24 Silberschatz Providing Approximate Answers (Cont.) Trade-off accuracy for performance: e.g., 30 minutes for exact answer vs. 3 seconds for an approximate answer with 5% error Examples where fast approximate answers are preferred: drill-down query sequence in data mining: searching for the “interesting” queries tentative answer when base data unavailable leading digits suffice (e.g., 3.5 million vs. 3.512 million) Can proceed to the exact answer, if desired Next Generation Information Systems 25 Silberschatz The AQUA System Approximate Query Engine for data warehousing (Fast) Query on the Aqua synopses (Slow) Query on the warehouse data SQL Query Q SQL Query Q’ Network Browser Excel DBMS for Large Data Warehouse Result (w/ error bounds) HTML XML Aqua synopses Aqua precomputes and maintains small synopses of the data Aqua provides approximate answers with accuracy guarantees, by rewriting user queries as depicted above Next Generation Information Systems 26 Silberschatz Aqua Synopses: The Key Ingredient (Small) Surrogate for the actual data. Must accurately estimate the exact answers from the synopses. As data is updated, must keep synopses up-to-date. We developed new techniques for summarizing data, and for adapting these summaries to changes in both the data and the query mix. First system to provide fast, highly-accurate approximate answers for a broad class of queries arising in data warehousing scenarios Next Generation Information Systems 27 Silberschatz Private, Public, and Sensitive Information in a Wired World Private information Only the data subject has a right to it. Public information Everyone has a right to it. Sensitive information “Legitimate users” have a right to it. It can harm data subjects, data owners, or data users if it is misused. Next Generation Information Systems 28 Silberschatz Erosion of Privacy “You have zero privacy. Get over it.” – Scott McNealy, 1999 Changes in technology are making privacy harder. increased use of computers and networks reduced cost for data storage increased ability to process large amounts of data Becoming more critical as public awareness, potential misuse, and conflicting goals increase. Next Generation Information Systems 29 Silberschatz “Public Records” in the Internet Age Depending on State and Federal law, “public records” can include: Birth, death, marriage, and divorce records Court documents and arrest warrants (including those who were acquitted) Property ownership and tax-compliance records Driver’s license information Occupational certification They are, by definition, “open to inspection by any person.” Traditionally: Many public records were “practically obscure.” Stored at the local level on hard-to-search media, e.g., paper, microfiche, or offline computer disks. Not often accurately and usefully indexed. Now: More and more public records, especially Federal records, are being put on public web pages in standard, searchable formats. Issues Should some Internet-accessible public records be only conditionally accessible? Should data subjects have more control? Should data collectors be legally obligated to correct mistakes? Next Generation Information Systems 30 Silberschatz Examples of Sensitive Information Copyright works Certain financial information Health Information Question: Should some information now in “public records” be reclassified as “sensitive”? Next Generation Information Systems 31 Silberschatz State of Technology We have the ability (if not always the will) to prevent improper access to private information. Encryption is very helpful here. We have little or no ability to prevent improper use of sensitive information. Encryption is less helpful here. Next Generation Information Systems 32 Silberschatz The PORTIA Project PORTIA: Privacy, Obligations, and Rights in Technology of Information Assessment Large ITR grant from NSF. It is five-year multi-institutional, multi- disciplinary, multi-modal research project on end-to-end handling of sensitive information in a wired world Researchers from: Stanford: Dan Boneh, Hector Garcia-Molina, John Mitchell, Rajeev Motwani Yale: Joan Feigenbaum, Ravi Kennan, Avi Silberschatz University of NM: Stephanie Forrest Stevens Institute: Rebecca Wright NYU: Helen Nissenbaum Plus participation by software industry, key user communities, advocacy organizations, and non-CS academics. http://crypto.stanford.edu/portia Next Generation Information Systems 33 Silberschatz PORTIA Goals Produce a next generation of technology for handling sensitive information that is qualitatively better than the current generation’s. Enable end-to-end handling of sensitive information over the course of its lifetime. Formulate an effective conceptual framework for policy making and philosophical inquiry into the rights and responsibilities of data subjects, data owners, and data users. Next Generation Information Systems 34 Silberschatz Five Major Research Themes Privacy-preserving data mining and privacy-preserving surveillance Database policy enforcement tools Sensitive data in P2P systems Policy-enforcement tools for database systems Identity theft and identity privacy Next Generation Information Systems 35 Silberschatz Privacy and Security on the Web An increasing number of web sites require user registration, which enables personalized services. This however, raises some concerns. Privacy concerns: providing the same user name (or e-mail) allows creation of comprehensive dossiers; providing your email address reveals your true identity Security concerns: using the same user name and password at multiple web sites enables password from insecure sites to be used to help determine password at secure sites Junk e-mail: giving your e-mail address makes you susceptible to junk e-mail Inconvenience: people have to invent and remember multiple user names and passwords Next Generation Information Systems 36 Silberschatz The LPWA system A tool for combining privacy, security and convenience . Enables personalized services by generating consistent, untraceable aliases for use on the web. quote.com axyz, x45t LPWA Czar, 4rt5 my.yahoo.com Boss, 56yh Arun Netravali expedia Next Generation Information Systems 37 Silberschatz The LPWA Proxy Properties Privacy: web sites cannot collude to create dossiers Security: different passwords for different web sites Convenience: no need to remember multiple user names and passwords Alias e-mail addresses support communication from web sites back to users and allow control of junk e-mail Next Generation Information Systems 38 Silberschatz Generation of Aliases At the first invocation of the LPWA proxy User provides: user’s e-mail address id a secret S (random string) Registering User types \u, \p, \@ for username, password and e-mail address, resp. LPWA uses id , S , and the domain-name of the web-site being visited to compute the users’ alias Repeat Visits User again types \u and \p for username and password LPWA computes the same alias-username/password. Next Generation Information Systems 39 Silberschatz Network System Challenges Next-generation network -- will be simpler, lower cost, and will provide customized services for consumers and businesses Converged networks -- will incorporate the best features of today’s voice and data networks Network management – automate many of the functions that are currently done by people. Next Generation Information Systems 40 Silberschatz Next-generation networks Yesterday’s Networks NM Next-Generation Networks NM NM NM 5E 5E 5E Service Layer Local ISP CLEC 5E Video Data Voice PSTN ADM ADM Electronic Layer ADM ADM ADM ADM DCS DCS ADM ADM ADM ADM ADM ADM ADM ADM DCS ADM DCS ADM Optical Layer Point-to-point optical links Circuit switched, centrally managed Separate networks for voice, data, video Fixed, closed Next Generation Information Systems 41 All-Optical mesh backbone Packet switched, distributed Unified network for customized multimedia services Open APIs for ISV services Silberschatz Next generation converged networks NEXT GENERATION NETWORK Data Network Next Generation Information Systems Converged Applications 42 Voice Network Silberschatz Network Management Challenges Managing today’s networks is extremely challenging due to their increased complexity Networks contain hundreds of network elements and thousands of physical links Network elements follow a multitude of protocols (e.g., BGP, OSPF, ISIS, RIP) Networks are heterogeneous and contain equipment from multiple different vendors Manually managing networks is tedious, labor-intensive, time-consuming and error-prone is not cost-effective due to severe shortages of and high costs of skilled labor Critical need for software tools that automate network management tasks Next Generation Information Systems 43 Silberschatz Next-Generation Network Management Next-Generation network management software functionality includes Keeping track of network inventory and topology Monitoring network link bandwidth and latency Storing, analyzing and reporting network performance data Load balancing by appropriately configuring network parameters Automating and simplifying network configuration tasks (e.g., VPNs) Value Proposition: Ease management and configuration of ISP networks Optimize utilization of network resources Goal: Make networks self-administering and self-tuning Next Generation Information Systems 44 Silberschatz There are many approaches to predicting the future I think there is a world market for maybe five computers. (Thomas Watson, 1943) Video won’t be able to hold onto any market it captures after the first six months. People will soon get tired of staring at a plywood box every night. (Darryl F. Zanuck, head of 20th Century Fox, 1946) 640K ought to be enough for anybody. (Bill Gates, 1981) “How do you want it – the crystal mumbo-jumbo or statistical probability?” Next Generation Information Systems 45 Silberschatz Five predictions for the new millennium 1 A mega-network of networks will enfold the earth in a communications “skin” with ubiquitous connectivity and enormous bandwidth. Next Generation Information Systems 46 Silberschatz Five predictions for the new millennium 2 By 2010, there will be so many interconnected devices that the volume of “infrachatter” among communicating machines will surpass communications among humans. Next Generation Information Systems 47 Silberschatz Five predictions for the new millennium 3 Bandwidth will be too cheap to meter. $ Next Generation Information Systems 48 Silberschatz Five predictions for the new millennium 4 Consumers and businesses will have a vast variety of individualized, custom services -written by countless programmers on an open mega-network. Next Generation Information Systems 49 Silberschatz Five predictions for the new millennium 5 Virtual reality will become a reality and will transform the way people live and conduct their business. This lecture will be given from the comfort of my office without me having to travel. Next Generation Information Systems 50 Silberschatz