Slides - WWW4 Server

Computer Science
EMFS: Email-based Personal Cloud Storage
NAS 2011
Jagan Srinivasan, Wei Wei, Xiaosong Ma, Ting Yu
Data Organization and Access
Email-based File System Design
Performance Evaluation
Related Work
Computer Science
 Existing personal cloud storage services
o Tie storage with internal data format and processing applications
o Non-free general-purpose storage and not widely utilized
 Existing email services
o The capacity of a single email account has increased dramatically
o Provided by many reliable and reputable online service providers
 Leveraging existing email services
o Benefit service providers as it extends their access to valuable customer
Computer Science
EMFS Overview
 Target Workload and Assumptions
o Typical personal workload
 Reading, editing, and backing up documents such as Word, pdf, etc.
 Targets file sizes ranging from several KBs to tens of MBs
o Users will not share storage with others or allow concurrent access to
his/her data.
 Design Goals
o Usability (generic file system interface)
o Scalability (extensible personal storage space)
o Reliability (access despite single email failure)
Computer Science
EMFS System Architecture
•Email File System Interface through FUSE
•Memory Cache
•Email Mapping Service
•Local Cache
•Email Cloud Storage Interface
Computer Science
Data Organization and Access
Email-based File System Design
Performance Evaluation
Related Work
Computer Science
Data Organization and Access
 File Organization
o Metadata
o File Data stored as attachments or in the body of emails
Computer Science
Data Organization and Access cont’d
 Metadata and Data Access
o Client cache management
o Metadata update
o Data access operations
 Consistency and Failure Recovery
o Adopt a mechanism to ensure the atomicity of updates
•(a) Lost metadata update
Computer Science
•(b) Lost part of data update
Data Organization and Access
Email-based File System Design
Performance Evaluation
Related Work
Computer Science
Email Protocol Selection
 Simple Mail Transfer Protocol(SMTP)
o Only used for transferring emails to the server
o Restriction on number of messages sent through SMTP
 Internet Message Access Protocol (IMAP)
o Support both sending and retrieving messages
o Allows users to “append” a message to their own mailbox
o Not limited by traffic restrictions
 Post Office Protocol (POP)
o Primarily used for retrieving emails
o Supports simple download-and-delete access pattern
Computer Science
Email Protocol Selection cont’d
 Email sending and appending performance
o IMAP is faster than SMTP in almost all cases, by 5.5% on average and up
to 42.64%
Computer Science
Data Placement Within Emails
 Multiple places used to store data in an email
Subject line
o Metadata is stored in the body section
o The unique identifiers are stored in the subject line
o Data can be stored either as attachments or in the body
Computer Science
Data Placement Within Emails cont’s
 Single email sending/retrieving performance
o Similar performance regardless of whether the payload is placed in the
body or the attachment
o Attachment payload slightly outperforms the body payload with Gmail
Computer Science
Block Size and File Striping
 Organize email accounts as a RAID
o Each account identified by a ”RAID Index” from 0 to n-1
o Data blocks striped across email accounts
o Blocks stored on randomly chosen disks instead of having a fixed array
of email disks and striping data in a round-robin manner
o Metadata emails are usually small, so they are not striped
 EMFS uses 512KB as its default block size and 8 as the default
stripe width
Computer Science
Block Size and File Striping cont’d
 Figure 5 measures a 4MB file’s read/write latency
o File access latency steadily decreases when we increase the file block
(attachment) size, for both Gmail and Gaweb mail
Computer Science
Block Size and File Striping cont’d
 Figure 6 and 7 show the effect of striping with different block
o Striping provides a significant performance improvement
o Increasing the stripe width beyond 8 or the block size beyond 1MB does
not help the performance
o Block sizes smaller than 256KB degrades performance in almost all cases
Computer Science
Data Replication
 Replication group
o Consists of two or more disks mirroring the same data
o Updates written to one of the email disks within the group
o Email disks (accounts) can be added or removed from a group
 Replication Strategies
o Read-one and Write-one
 All reads and writes from EMFS go to the same email account
o Read-fast and Write-fast
 Reads and writes go to different accounts based on their uploading
and downloading performance
Computer Science
Data Organization and Access
Email-based File System Design
Performance Evaluation
Related Work
Computer Science
EMFS Evaluation
 System Implementation
o Prototype is based on FUSE
o Implemented in around 3000 lines of Python code
o Two replication strategies implemented for comparison
 What we do
o Compare EMFS with three existing distributed file systems
o Use Postmark and IOZone and a synthetic file access benchmark
 Experiment Setup
o Duo-core desktop (2.66 Ghz) with 3 GB of RAM running Ubuntu 8.10
o Both NFS and AFS servers were configured on dedicated machines
inside the campus network
o Jungle Disk was configured such that background or asynchronous
transfers were disabled
o EMFS was configured using accounts from Gmail and Gawab Mail
Computer Science
Performance Results – Postmark
 Postmark measures performance for network based systems by
simulating access on short lived small files
 Generate different workloads (equal bias, read heavy, append
heavy, and create heavy) by varying the operation bias
 200 files
 File size range from 4K to 16MB
 200 transactions
 AFS and NFS perform better than EMFS
and Jungle Disk
 EMFS offers comparable performance to
Jungle Disk
 EMFS-Fast does offer better performance
than EMFS-One
Computer Science
Performance Results – IOZone
 Unlike Postmark, IOZone mainly focuses on file data access
 16 MB file
 Request sizes range from 128
KB to 4 MB
 AFS and Jungle Disk achieve a transfer rate between 25 to 50 MB/s for sequential read
 EMFS reports very high transfer rates
 Jungle Disk reports very low throughput (about 550-600 KB/s) for random reads
Computer Science
Performance Results – IOZone cont’d
 16 MB file
 Request sizes range from 128
KB to 4 MB
 EMFS is slightly better than Jungle Disk in terms of write throughput
 NFS and AFS are faster due to their high file transfer performance and low overhead
Computer Science
Performance Results – Editing Workload
 A synthetic benchmark that simulates a document editing task
 100 files, 14 directories (with
a maximum depth of 3)
 File sizes range from 8KB to
 Lookup operations for AFS is
lightning fast
 EMFS-Prefetch help reducing the
total lookup time by 17.4%
 All systems perform nearly the same for editing operations.
 EMFS-Fast does bring an improvement of 31% for file save operation, which is quite close to
Jungle Disk.
Computer Science
Data Organization and Access
Email-based File System Design
Performance Evaluation
Related Work
Computer Science
Related Work
 Email-based file systems
GmailFS []
YaFS [Lu, et al., IPDPS 2009]
Free email accounts for data backup [Traeger, et al., StorageSS 2006]
EMFS systematically examines email-based file system design issues
 Other existing client-server systems
 LftpFS []
 ExpandDrive []
o EMFS enables users to take advantage of widely available and
increasingly powerful web-based email services
 Distributed file systems
 NFS [Pawlowski, et al., USENIX 1994], AFS [Howard, et al., ACM
Trans 1998], LBFS [Muthitacharoen, et al., SOSP 2001], GFS
[Ghemawat, et al., SOSP 2003], and Ceph [Weil, et al., SODI 2006]
o EMFS complements existing studies on distributed file/storage systems
Computer Science
 To our best knowledge, our work is the first that systematically
examines email-based file system design issues, and
 Contributions
o Provides a personal cloud storage solution on top of multiple web-based
free email accounts
o Implements a prototype based on FUSE
o Evaluates the effectiveness of features such as multi-account space
aggregation, file striping, and data replication
Computer Science
•Thank you
Computer Science