1 Final Report: Distributed Consistent Secure USB Storage Sean Busch, Matt Dube, Eddie Lai, and Zhou Zheng Abstract – Backing up data is a necessary task for anyone who hopes to have access to their digital data at another time or location. Personal Computers are typically equipped with hard disk drives that allow for users to back up their data internally, while floppy disk drives and, more recently USB flash drives, have been used to store and transport data. While these are trusted solutions, each has the potential of failure due to wearing out from use, physical damage, or being lost. Thus, in order to guarantee the availability of a backup, one needs to create several backups. Here, we propose a simple, reliable hub that instantly creates several copies of a single back up on several different USB sticks with simply the push of a button. This device will be capable of syncing with other hubs across a network, allowing backups to be created or updated regardless of where they are located. I. INTRODUCTION O NE of the greatest benefits of using a computer is the ease with which one can save and copy data and backing up data has become a daily occurrence for billions of people. While many people have their own personal or home computer, users commonly need to transfer data from one location to another such as from their work computer to their home computer or to save work done on a public computer to a secure location. This backing up and transferring of data has been accomplished using floppy disks, and more recently using USB flash drives or other memory cards due to their greater memory and a smaller form factor. While this is a convenient solution, these small, portable memory devices have their own unique vulnerabilities. Data backed up on a portable storage device, such as a USB flash drive, is prone to physical damage, such as the wear and tear of being stored in a pocket or on a key ring. The diminutive size that makes these devices so portable also makes them easy to lose or misplace. To counteract these concerns, it is natural to make backups in several places and/or on several different devices. Keeping each of these backups up to date and consistent, however, is a chore that involves copying and pasting the most recent files from one place to another which needs to be repeated for each device. With the advent of cloud computing, there are several products, such as Dropbox, that allow users to save their data to the cloud in order to access it from anywhere at any time. However, some users' data may be too sensitive to release to external storage. Users should be able to have total control over where their data is stored and how it is accessed. Without owning and possessing the physical media on which the data is stored, it is hard to provide this kind of security. Our hub aims to solve this problem by taking several USB drives and synchronizing them with the touch of a button. This relieves the user of the tedious task of determining which backup is the most recent and then manually copying files from one place to another. With the added capability of networking, our hub would be able to synchronize drives in different locations. This allows users functionally beyond simply creating and maintain consistent backups. For example, a user on one side of the country could share photos or files with another hub owner on the other side of the country by simply loading the files to a USB flash drive and pressing a button. II. DESIGN A. System Overview The design for the system consists of multiple USB storage devices plugged into several hubs which are connected to a network. This specification can be seen by the high level diagram in Figure 1. The system's operation begins with the User initializing the Hub network via a web browser on their PC. Once the IP address of each hub has been identified and stored on each hub, the User then writes a file to a USB drive mounted on one of the hubs from their PC. Once the user initiates synchronization, the hub determines what changes have been made to the files on the drive and distributes the updates to the other USB drives in the group, both on the same hub and on other hubs distributed across the network. Each of the USB drives in the synchronized group is recognized by the drive’s unique identifier, ensuring that only the correct drives are updated. An additional feature of the hub is its secure application. One particular issue that can arise when creating several copies of the data is that the data then becomes vulnerable as there are more physical versions of it that exist. To ensure the security of each copy, the hub implements a secret sharing scheme that renders each copy obsolete unless combined with other copies, or “shares” as they will be referred to in this report. Upon synchronization, the data is broken up into several of these shares and distributed to each individual USB drive. In order for a user to access the original data, a subset of the shares must be present in the hub network. Once this number of shares are present, the data can be rebuilt and recovered. 2 on the hub, creating a simple, transparent connection on both ends. B) Detailed System Specification 4) User Interface There are two primary operations that the user must have control over: when a synchronization is initiated and when the USB drives can be unmounted. Additionally, the Hub must give the user feedback on when a synchronization is occurring, when the USB drives are mounted, when an error has occurred. The most efficient way to address this interface is with two buttons and two LEDs. The buttons are mounted on BeagleBoard's GPIO ports and one allows for the user to initiate a synchronization request while the other unmounts the USB drives. The LEDs are also mounted on the BeagleBoard's GPIO ports and one indicated when a synchronization is occurring while the other indicates when the USB drives are mounted. In order to allow for the functionality described in the system overview to be achieved, each hub is built upon an embedded system with a Linux operating system. In addition to processing power and an operating system each hub must have the features detailed in the block diagram shown in figure two. These features include contains internal, network connectivity, a USB interface, several USB ports, LEDs, external buttons, and file consistency software. 5) Secret Sharing This application is based on Shamir’s Secret Sharing Algorithm and implements what is called a (k,n) threshold scheme. Following this algorithm, a piece of data is broken into n shares on a synchronization request, with each individual piece containing no discernible information about the original data. In order for the original data to be recovered, there must be at least k of the original n shares present on the USB network 1) Consistency Software The consistency software running on each hub is responsible for maintaining consistent data for each group of USB storage devices connected to any of the other hubs. The software is also partially responsible for interfacing to the host computer as a single USB storage device containing a single copy of the consistent data across the system. As part of these responsibilities the custom software must detect when a change is made to data on the USB storage device and set up secure connections to other hubs in order to distribute the changes. The consistency software recognizes when changes have been,made using conventional UNIX tools. Timestamps and checksums are recorded, then “diff”ed with the previous record to determine what updates need to be made. 2) Hub Networking In order to maintain consistency across distributed hubs, the hubs must be able to communicate with one another across a network. This is achieved using the TCP client/server model, with the hub distributing the updates acting as the client and the hub receiving the updates acting as the client. When a sync is intiated, the client hub opens up a secure socket with the other server hubs using SSL. Once the client has connected to a server and distributed its updates, it disconnects. 3) Host PC Interface The Hub interfaces with the user's PC via the BeagleBoard's USB On-The-Go port. This connection allows for the Hub to connect directly to the User's PC without any configuration or software installation. Once connected to the user's PC, the hub is mounted as a USB Mass Storage Device. By being mounted this way, the user can directly access the USB drives mounted 6) Hub Initialization In order for the hubs to be able to network with one another, they must first know the IP address of the other hubs in the network. Since finding and recording this information cannot be automated, as it is unknown what IP address will be assigned to the hub, the user must input this information in order to configure the hub. To record these values in the hub, the user accesses a web server on the hub via a web client of their choice. After inputting the IP addresses of the other hubs in the network in the GUI, a user is then able to successfully network the hubs. 7) Design Alternatives Several design alternatives have been discussed throughout the design and implementation of this project. One such design alternative that was discussed was using Windows XP as the operating system running on the hubs. It was determined that a lightweight Linux operating system is more suited to our needs. One major decision needed to be made was how the hub would interface with a computer. One option we considered would be for the hub the be set up and accessed as network attached storage. This alternative, however, was ruled out as it was determined that there would be too much initial configuration on the part of the user, and this would be in direct contrast of our goal of making a device that is as simple and intuitive as possible. The greatest design decision we had to make was which embedded system to develop our hub on. We originally began developing on a Advantech development board featuring an Intel Atom processor. Despite this system’s relatively powerful processing power and vast memory, it was determined that it was not properly equipped to interface with another PC. Since one of our design specifications is that the hub must appear to a 3 user’s PC as a USB mass storage device, the embedded PC must have the proper hardware to achieve this. Since the Advantech board only features USB A ports, it can only be used as a USB host, and therefore cannot be seen as a mass storage device. Since we determined that this functionality was paramount to the success of our project, the development platform was changed to the BeagleBoard-xM due to its USB On-The-Go port. Despite having reduced processing power, the BeagleBoard-xM had the correct hardware to be mounted on a host PC as a mass storage device. FDR PROTOTYPE IMPLEMENTATION A. Overview Each USB hub is built on a BeagleBoard-xM development board. The BeagleBoard includes an ARM Cortex-A8 CPU clocked at 1GHz, 512 MB of RAM, onboard Ethernet jack, 4 port USB hub, a USB On-The-Go port, and flash memory provided by MicroSD. The hub is booted The prototype developed for the Midway Design Review (MDR) consists of two Intel Atom prototype boards (hub) connected to each other via a router. There will be two USB thumb drives plugged into each hub, each of which will be configured for a different “hub network”. An update on USB1 on HUB1 will only mirror the corresponding USB on HUB2 that is on the same network. Upon the user’s request, a “synchronize” function can be executed. This synchronize function will mirror the changes from the USB drives connected onto the corresponding hub to the other hub’s USB drives. Each hub will act as a server and a client. Files and folders added from one hub to another will maintain the original last modified timestamp. B. Algorithm The algorithm used to detect changes will be described in the following pages. 4 Each hub will act as a server. The server will be done in the following way: Listen to incoming port N Incoming connection? Y Interpret message Global sync message? Run synchronize (client) Folder added/ removed? File deleted? Generate trusted USBs on the hub and distribute update to those Timestamp changed? File being added? Generate one trusted USB on the update’s USB network and distribute to that one USB Distribute update locally to other USBs on the same USB network Update USB’s listing so they do not detect these updates as their own modifications Anymore incoming data? N Close all connected socket connections Y 5 Each hub will need to run a bash script for each of the USB drives connected. The bash script will perform the following operations: Take listing of all file’s checksums and timestamps, and all folders Diff the current listing to the previous listing Timestamp changed but CRC same? File removed from listings? Folder added or removed from listing? Timestamp and CRC change for file? File was only touched File was removed Folder was added/removed File added or edited; write file update to “file modification” file Write update to “non file modification” file 6 The client will be done in the following way: Client Any USBs connected? Y Run Consistency_Script on a trusted USB Does a “non file modification” file exist? Y Send update to other hubs for: 1) Adding/removing folders 2) Removing folders 3) Timestamp changes N Y Anymore updates? N Does a “file modification” file exist? N Y Secret Sharing or Normal? Normal Y Secret Sharing Take modified file, split the file, distribute shares to self and other hubs Send file update to self and other hubs Anymore updates? N Anymore USBs? N Close all outgoing socket connections and exit program 7 PROJECT MANAGEMENT C. Team Roles Linux Bash Script, C Socket Program: Eddie Linux Bash Script/ C Socket Program/ (k, n) threshold: Matt USB Interface: Sean GPIO: Zhou Summary and Conclusion Overall, the main focus of the project is creating a distributed, consistent, and secure environment. The users are able to sync with other hubs across the network, allowing backups or updates regardless of their location. The distributing technique is based on TCP/IP protocol. For consistency we are recording the checksum and time stamps to see if any changes were made since the last check. Security will focus on a k-n threshold scheme. With these features, our device will create a consistent, secure, and distributed network environment. 8 APENDIX D. Application of Mathematics, Science and Engineering In developing this prototype, we have used material from at least three science of engineering courses: 1. ECE 353: Learned how to build embedded systems and program microcontrollers. 2. ECE 354: Designed a network between two FPGA development boards to send pixels of a large image 3. ECE 374: Learned about IP, server/client connections, created a simple client/server socket application in Java Detailed Example For our prototype, we had to set up a TCP/IP socket connection to allow two hubs to be able to communicate with each other. Each hub had to act as a server and a client to the other hub. Initially, the client and server was set up on one hub through a local socket connection communicating through a single port. We successfully maintained consistency between two directories using our software and the local socket connection. After successfully maintaining consistency through a local socket connection, we had to expand this function to two different hubs. In our networking course, we had to develop a simple client service application. Based on our experience from that socket application and the image-sharing network, we were able to debug any issues that occurred. E. Design and Performance of Experiments, Data Analysis and Interpretation Not yet performed. F. Design of System, Component, or Process to Meet Desired Needs within Realistic Constraints The system requirements consist of having a small, compact hub so that users can carry it with them without it being an issue. Currently, our hub is an Intel Atom prototype board and is very large to have to carry around. This will be addressed in future prototypes and the final design, in which a custom-built PCB will be used. The other system requirement consists of performing data transfer during the consistency maintenance. With extremely large files, the transfer will take long amounts of time. Also, data transfer rates are dependent on the connected network, which is out of our control. G. Eddie Lai, CSE: Worked on socket and bash programming to maintain consistency with Matt. Matt Dube, CSE: Set up hubs to be able to network with each other, researched potential applications, socket and bash programming Sean Busch, EE: Worked on the hub to host PC interface. Looked into building a custom PCB with Zhou Zhou Zheng, EE: Webmaster. Worked on hardware research. H. Identification, Formulation, and Solution of Engineering Problems One of the main obstacles was how to detect whether a file was modified based on a timestamp update. We solved this issue by recording the checksum of the file, but there is also the case where the timestamp may update without a modification. Because of these issues, we are recording the checksum and the timestamp. By detecting a timestamp update without a modification, re-transmitting the file would be avoided. I. Understanding of Professional and Ethical Responsibility An issue of professional and ethical responsibility is that if we were to be a General Dynamics group, there would be expectations about delivering a satisfactory prototype that they find useful. Since none of the team members have had any involvement with military and secret applications, we had to contact an employee at General Dynamics to give us an idea of some relevant projects. During our brainstorming of a project, we received several emails from the General Dynamics representative and it gave us an idea of something they would find beneficial. During the career fair, we also brought up the idea to the representative for General Dynamics and he loved the idea. J. Team Communication Weekly team meetings with our advisor is scheduled for each Friday at 11am. Weekly team meetings without the advisor is scheduled on a weekly basis. During the faculty meeting, individual and group progress is discussed along with challenging problems that our group has encountered. Our weekly team meetings are for assigning roles to team members, ensuring the previous roles were completed, and brainstorming about potential upcoming issues. Unplanned scheduled meetings occur many times a week in Duda for many hours at a time for brainstorming and discussion. Written and stored communication has been done through email and individual journals shared among the team. Eddie Lai has served as the primary faculty contact in charge of responding to emails and setting up any meetings. K. Understanding of the impact of engineering solutions in a global, echonomic, environmental and societal context Our project will have positive ____ impacts by allowing users to collaborate over a distributed network. Our project allows distributed collaboration ona set of data and allows a group of people to have control over when a set of people can have access to data. Data cannot be accessed without the minimum subset of users, which will prevent one user from having all the power. L. Application of Material Acquired Outside of Coursework Three examples of sources used outside of coursework is: 1. Md5deep manual page. This provided us with the solution on how to record all checksums and their file paths. 2. Du manual page. This provided us with the solution on how to list all folder paths 3. Bash scripting and C socket programming tutorials posted on the internet. These provided us with specific solutions to issues of how to detect and send changes 9 M. Knowledge of Contemporary Issues People always wants to make a backup of their data for safekeeping; many often times, there are multiple backups. After a change is made on one of the backups, it is extremely tedious to mirror the updates to all of the other backups especially with many backups or many updates. Our project addresses this issue and gives an extremely easy and transparent method in keeping backups consistent in many different locations. Our project also allows people to collaborate on a common set of data all over the world. Distributed file sharing is an important part of communication. References: [1] Cooper, Mendel. “Advanced Bash-Scripting Guide”. http://tldp.org/LDP/abs/html/ [2] Kurihara, Jun. “A New (k, n)- Threshold Secret Sharing Scheme and Its Extension”. http://isc08.twisc.org/slides/S10P2_A_New_(k,n)Threshold_Secret_Sharing_Scheme_and_Its_Extension.pdf [3] Metalx1000. “Learn Bash Scripts-tutorial”. http://www.youtube.com/watch?v=QGvvJO5UIs4 [4] AIMB-212_DS (English Version). PDF. http://support.advantech.com.tw/Support/DownloadSearchByPro duct.aspx?keyword=AIMB212&ctl00_ContentPlaceHolder1_EbizTabStripNoForm1_Tab= Datasheet [5] Aimb-212_user_manual_ed.2-FINAL. PDF. http://support.advantech.com.tw/Support/DownloadSearchByPro duct.aspx?keyword=AIMB212&ctl00_ContentPlaceHolder1_EbizTabStripNoForm1_Tab= Manual