Secured bulk file transfer over HTTP(S) Yibiao Li and Andrew McNab Manchester University Abstract A method of secured bulk file transferring over HTTP(S) for Gridsite is discussed in this article. Unlike FTP, this method can transfer a file from one Grid node to another Grid node directly, using zero memory of the client computer. The verified information is transferred over HTTPS while the file is transferred over HTTP. To speed up the file transfer, a multi­connection technique is adopted. Keywords: bulk file, HTTP(s) transfer, GridSite, GridHTTP protocol 1. Background GridSite[5] was originally a web application developed for managing and formatting the content of the GridPP[8] website. Over the past three years it has grown into a set of extensions to the Apache web server and a toolkit for Grid credentials, GACL access control lists and HTTP(S) protocol operations. A powerful client end command, htcp, was developed for user to operate (delete, move, copy etc) file/directory on GridSite nodes. Recently, a functionality of the bulk file transfer was added into the command, which can now transfer a bulk file between two GridSite nodes directly without using the memory of the local machine. For the ease of the reader's understanding, here we briefly introduce the GridSite. 1.1GridSite node Each GridSite node is equipped with apache[6] and GridSite package. Normally, a GridSite node can be accessible by both HTTP and HTTPS. The authorized user (see the following section) can “write” or “update” the contents of GridSite node over HTTPS besides reading the contents of it over HTTP. 1.1GridSite authentication and authorization To access GridSite node over HTTPS, user must have a user certificate issued by related Certification Authority (CA). A user certificate usually has a version of user's name and affiliation as its Distinguished Name (DN) ­ for example, "/C=UK/O=eScience/OU=UniversityName/L=G roupName/CN=FirstName Surname". Once the user has obtained a user certificate in his name from his CA, the user needs to make sure it is loaded into the browser the user normally uses to browse the web. Browsers want the certificate and private key in the PKCS#12 format, which is normally a single file with the extension ".p12". Many programs which are based on OpenSSL, such as Globus and curl, prefer the PEM (".pem") format for certificates, with separate certificate and key files ("usercert.pem" and "userkey.pem"). These two formats can be easily converted to each other with software tools. Once the user certificate is loaded into the browser, the user should be able to see his/her certificate name appear when looking at an HTTPS GridSite page which has the page footers enabled. If GridSite understands the user certificate, it displays a "You are ..." line in the footer. Once users access a GridSite node with their identity, they will be authorized appropriate rights depending on their identity. GridSite allows site administrators to specify these rights for individuals and groups using GACL access control files (see next section). GACL defines who can read files, who can list directories, who can write or create files and who can modify the GACL policy files. To get increased access to an area of a site, the user needs to contact the administrator for that area and give the DN of the user抯 certificate (it's not necessary to send any certificate files.) 1.3 Access Control DN Lists appear in the Grid Access Control Lists (GACL) used by GridSite. These are stored as .gacl files in directories: if the .gacl file is present, it governs access to the directory; if it is absent, then the parent directories are searched upwards until a .gacl is found. The GridSite GACL Reference explains the XML format of these files, but they can be edited using the ACL editor built into the GridSite system by people who have the Admin permission within the ACL. If a user has this permission in a given directory, when the user views directory listings or files with a browser in that directory the user will see the option "Manage Directory" in the page footer. This allows the user to get a listing of the directory and the .gacl file will appear at the top if it's present. If not, then there will be a button to create a new .gacl file with the same permissions as have been inherited by that directory from its parent. GACL allows quite complex conditions to be imposed on access, but normally user can think of an ACL as being composed of a number of entries, each of which contains one condition (the required credential) and a set of allowed and denied permissions. Credentials can be individual user's certificate names or whole groups of certificate names if a DN List is given. (User can also specifiy hostname patterns using Unix shell wildcards (eg *.ac.uk) or EDG VOMS attribute certificates ­ see the GACL Reference for details.) Permissions can be Admin (edit the ACL), Write (create, modify or delete files), List (browse the directory) or Read (read files.) Permissions can be allowed or denied. If denied by any entry, the permission is not available to that user or DN List (depending on what credential type was associated with the Deny.) 2. Why transfer files over HTTP(S)? Normally, there are the following way to transfer files over internet. 2.1 Email One of the most important aspects of the Internet is the ability to send large files easily. Email is still the primary way to receive or send large files over the Internet. Unfortunately, using Email to send large files or receive large files is fraught with drawbacks. In today's world, file sizes are getting larger and larger but email technology has not advanced at the same pace. It is no longer efficient to send large files via Email and in many cases it is impossible. 2.2 FTP FTP is a method for exchanging files over the internet utilizing standard TCP/IP protocols to enable data transfer. FTP can be used to upload and download files of almost any size from or to a central server. It is a well­established and consistently implemented protocol that can be enabled on the Windows Storage Server. The advantages of FTP include: Support for all kinds of clients: Standardized implementation of the protocol means that virtually any FTP client, running on a Microsoft or non­Microsoft operating system, can use the FTP server. High performance and simplicity: Performance and simplicity of the protocol makes it a convenient option for file transfers across the Internet. The primary disadvantage of FTP is that data and logon information is sent unencrypted across the network. This could result in the discovery of logon accounts or passwords. This information could be used by unauthorized individuals to access other systems. 2.3 HTTP The HTTP[1] protocol is a protocol for file transfer over internet. It is often used to download HTML files or image files through a web browser such as IE, Mozilla, or FireFox. But it can also be used to file upload by some command under Unix/Linux OS. 2.4 HTTPS HTTPS is a communications protocol designed to transfer encrypted information between computers over the Internet. HTTPS is HTTP using a Secure Socket Layer (SSL). It is recommended that users utilize HTTPS when transferring files containing security sensitive information. 3. Design 3.1 GridHTTP protocol To realize the file transfer between GridSite nodes, GridHTTP protocol was designed, which supports bulk data transfers via unencrypted HTTP while the information of authentication and authorization with the usual grid credentials over HTTPS. To initiate a GridHTTP transfer, clients set an Upgrade: GridHTTP/1.0 header when making an HTTPS request for a file. This header notifies the server that the client would prefer to retrieve the file by HTTP rather than HTTPS, if possible. The authentication and authorization are done via HTTPS (X.509, VOMS, GACL etc deciding whether it is right) and then the server may redirect the client to an HTTP version of the file using a standard HTTP 302 redirect 3.3 Bulk file transfer between GridSite nodes Assume that a grid user wants to copy a bulk file from GridSite node (source server) to another GridSite node by giving a batch of commands (so he cannot logon destination to copy file directly) on a computer denoted as client computer. Now we consider a simple case, copying files without secured factor. In general, using a command like wget or curl, the user can copy the file to the local computer first, then upload it to the destination server as shown in figure 1. internet Source Server f i le Fig 1 user downloads file first to local computer, then uploads it to destination server Though this method can do the job, it apparently waste time, internet bandwidth and local machine memory and disk space. Instead, we can seek a way to send command to destination server, and ask it to get file from source server. file 3.2 Advantages in ter net D estin atio n Server m m an d Source Ser ver co One big advantage of redirecting to a pure HTTP GET transfer is not just that the server and client don't have to spend CPU en/decrypting it, but that Apache can use the sendfile() system call to tell the kernel to copy it directly from the file system to the network socket (or can use the Linux kernel module HTTP server, which has much the same effect.) This means the data never has to be copied through user space (the so­called zero copy mode.) As far as client side APIs go, any client side library which supports HTTP redirects and cookies and lets user add his/her own headers is sufficient (even the curl command line tool lets user do this, with the ­H and ­c options, without having to make any modifications to its code.) From GridSite version 1.1.11, htcp supports GridHTTP redirection, by using the ­­grid­http option. D estination Server fil e response giving the HTTP URL (which can be on a different server, in the general case.) For small files, the server can choose to return the file over HTTPS as the response body. When contacting a legacy server, the Upgrade header will be silently ignored and the file will be returned via HTTPS as normal. For redirection to plain HTTP transport, a standard HTTP Set­Cookie header is used to send the client a one­time pass­code in the form of a cookie, GRIDHTTP_PASSCODE, which much be presented to obtain the file via HTTP. This one­time pass­code only works for the file in question, and only works once: the current implementation stores it in a file and deletes the file when the pass­code is used. (This mechanism is no worse than GridFTP for providing an unencrypted data channel: it's vulnerable to man­in­the­middle attacks or snooping to obtain a copy of the requested file, but not vulnerable to replay attacks or to other files being obtained by the attacker.) As you can see, GridHTTP is really a profile for using the HTTP/1.1 standard, rather than a new protocol or a set of extensions: no new headers or methods are involved. Ways of extending it to support variable TCP window sizes so it can be used for a mix of long and short distance connections (currently the TCP window size has to be set in the Apache configuration file), and support for third­party transfers using the HTTP COPY method from WebDAV are being added to the GridSite implementation. Fig 2 user sends command to destination server to ask it to copy file from source server To realize it, there should be a module on the server side, which can: 1. receive a http request, 2. retrieve the file information from the request package, 3. send request to source server to get the file, 4. receive file and save it. According to extendible feature of the Apache, we can develop such a module and attach it to the Apache. Now let us add the secured feature to the above case in GridSite circumstance. To support the remote bulk file copy, the GridSite node should: 1. verify the user certificate (source server). 2. produce one­tine pass­code (source server). 3. check the user access to the destination directory, respond to the user if no write access (destination server). 4. send file request with pass­code as a cookie via HTTP (destination server). 5. redirect HTTPS request to HTTP (source server). 6. retrieve the file and save it to some directory (destination server). 7. respond to user when finishing (destination server). The client end command should do the following: 1. send a file pass­code request to source GridSite node with the user certificate over HTTPS 2. retrieve pass­code from source GridSite node over HTTPS 3. send file copy request to destination GridSite node with the pass­code over HTTPS A completed description of the bulk file copy system can be given as (see figure 3): 1. Client uses user ID (user certificate and user key) to request a one time pass­code from source gridsite server (HTTPS) 2. Source server verifies the user ID. 3. Source server issues a onetime pass­code to client (HTTPS) 4. Client sends a request to destination server to get file from source server with onetime pass­code (HTTPS) 5. Destination server sends request to source server with pass­code (HTTPS) 6. Source server verifies the pass­code 7. Source server transfer file to destination server (HTTP) 7. send file via H T T P 5. req uest a file w ith onetim e passcode via H T T PS 6 ver ify passco de 2 ver ify user keys Sour ce Ser ver 3 . G iv e a 1. r e qu w it es t h o n ke e o y t im ne v ia e ti m H p a e TT ss pa P S co ss de co de v ia HT TP S in te rnet D estination S erver fil e P S e t g TT t o H n d e v ia a m d m co co s s a pa nd e S e ti m 4 . o ne h w it Fig 3 a completed bulk file transfer system (single connection) Furthermore, considering that transferring a bulk file could take a lot of time, we can use the multi­connection technique to get different part of each in each connection at the same time. that will speed up the bulk file copying. But one problem is that when the destination server sends the request for file copying, a one­time pass­code is required for each connection, to get one­time pass­codes for connections (one pass­ code for each connection), a secured connection have to be created to transfer the request and response of one­time pass­code, and the source server will have a mechanism to produce a one­ time pass­code when it receives a request with original pass­code. Thus the completed procedure for multi­ connection bulk file copy can be described as: 1. User send a request to source server with uses user ID (user certificate and user key) to request a reusable time pass­code (HTTPS), 2. Source server verifies the user ID, 3. Source server issues a reusable pass­code to client computer if ID is OK, or responds a error message (HTTPS), 4. Client sends a request to destination server to get file from source server with reusable pass­code (HTTPS), 5. Destination server sends request to source server with reusable pass­code to get file size (HTTPS), 6. Destination server creates multi­ connection, and get one­time pass­code for each connection (HTTPS), 7. Source server verifies the pass­code, and produce one­time pass­code and send it back to the destination server (HTTPS). 8. Destination server requests part of file in each connection with one­time pass­code (HTTPS), 9. Source server checks the one­time pass­ code and transfer part of file to destination server (HTTP) 10. Destination server combines parts of file into a completed file, and save it to the directory required. Please note that in step 9 and 10, if we change the HTTP connection to HTTPS connection, then the whole file will be transferred over HTTPS. So it is easy to extended to the case of transferring both file and security data over HTTPS. C onnec tion n 7 . send file via H T T P C onnec tion 2 7 . send file via H T T P code over HTTPS, then to get the part of the file over HTTP. 7 . send file via H T T P C onnec tion 1 4.2 Client side 5 . re quest a file w ith o netim e p asscode via H T T P S 6 verify passcode interne t 2 ve rify user keys So urce S erver 3. G 1. D estination Se rver r e qu es iv e t a ke o ne o y v tim ne ia e p t im H T as e TP s c pa S od ss e w co de ith v ia HT TP S le t fi P S g e T T to H n d v ia ma ode m c co ss a a nd e p S e tim 4 . on e h w it Fig 4 a completed bulk file transfer system (multi­connection) 4 Implementation To realize the design in section 3, we consider the implementation on the server side and client side separately. 1.1server side On the server side, we need to realize two main modules. The one is responsible for producing the one­time pass­code and reusable pass­code and relative check (source server side in the above discussion), the other is responsible for responding client requests and copying file (destination server in the above discussion). In practise, we developed first module (pass­ code) as part of GridSite, and developed the second module as a CGI program. The pass­code module was part of GridSite package which can be compiled and run as an apache module[2][3]. the key task here is the one­ time pass­code encoding, as described in the protocol, the pass­code should be related with one file including its full path and the time. Here we generate a random number and save it to some directory that is only accessible by the Apache server, once the file related to this pass­ code has been sent, the pass­code will then be deleted, or after a specified period, even the file has not been sent, it will be deleted. This pass­ code mechanism here ensures that even someone knows a pass­code by some means, he cannot get access right to other files. The second module is responsible for receiving the user request and obtaining the file from source server. In the current version of GridSite package, we developed it as a CGI program, in Apache configuration file, it is mapped to HTTP COPY method. It was developed with C and libcurl, the key matter in the one connection case is to transfer the pass­code as a cookie via HTTPS connection, then get the file over HTTP. In the multi­connection case, we used the multi­ thread technique. For each connection, we use the reusable pass­code to get the one­time pass­ On the client side, the command was built in the powerful command htcp. The options needed to pass to the destination server, such as connection number, block size of each connection and thread number specified by user are transferred with OPTS in the HTTP header over HTTPS. 5. The server configuration and command usage 5.1 The server configuration As described in section 4.1, there are two modules in server side. The pass­code module has built in the GridSite package, so after the package is installed and the apache service starts, this module has been started, there is no extra configuration needed. The second module is also included in the GridSite package but run as a separate program, named gridsite­copy.cgi, it should be installed into a specified directory that is normally mapped to /cgi­bin/ in apache configuration file. So after the GridSite package is installed, check the directory if you know where it is, or check the apache configuration file https.conf first to find out where it is, then check if the file gridsite­copy.cgi is there. The following line is needed to add in the apache configuration file httpd.conf: Script COPY /cgi­bin/gridsite­copy.cgi If you want to use copy a file from the GridSite node to another, you must have write access for the destination directory on the destination node. The access configuration is described in section 1.3 access control. 5.2 Client command To copy a file from one GridSite node to another, you use htcp command provided by GridSite package. Here we give two examples. Example 1: copy a file data.dat from node A to node B’s data directory using one connection: htcp rmtcp https://a/data.dat https://b/data/ Example 2: copy a file data.dat from node A to node B’s data directory using multi­connection: htcp ­rmtcp ­connection­number 5 ­block­size 20 https://a/data.dat https://b/data/ Note that in the above examples, the option rmtcp ask htcp to execute the remote copy module; option connection­number indicates how name connections will be used to get the file, and option block­size indicates the maximum k­byte for each connection. To use htcp command, users should copy their certificates in a directory .globus or current directory. 6. Notes There is another application tool for the GridSites to transfer files, GridFTP[9], which is based on the popular File Transportation Protocol FTP, supporting functionalities such as: 1) Grid Security Infrastructure (GSI) 2) Third­party control of data transfer 3) Parallel data transfer 4) Striped data transfer 5) Partial file transfer As shown in the previous sections, the GridHTTP provides similar functionalities, but most difference from GridFTP is GridHTTP is based on HTTP protocol. There are a lot of arguments on advantages and disadvantages of file transferring over HTTP or FTP. It is hard to determine which is better from the theory. For the grid environments, GridFTP needs extra installation and configurations, while the htcp can embedded into GridSite package as an Apache module and an independent CGI application program, and needs quite simple configurations. The remote copy method for GridSite nodes discussed in this article can be easily applied to and developed in normal Apache nodes. 7Acknowledgements This work was funded by the Particle Physics and Astronomy Research Council through the GridPP programme. References: [1] R. Fielding etc, Hypertext Transfer Protocol – HTTP/1.1”, http://www.w3.org/ Protocols/rfc2616/rfc2616.html, 1999 [2] L. Stein and D. MacEachern, Writing Apache Modules with Perl and C”, O'Reilly & Associates, 1999 [3] B. Laurie and P. Laurie Apache: The Definitive Guide, Third Edition”, O'Reilly & Associates, Third Edition, 2002 [4] Thomas Boutell, Featuring C and Perl 5 Source Code”, Addison Wesley, 1996 [5] GridSite software and documents: http://www.gridsite.org [6] Apache official website: http://www.apache.org [7] Curl and libcurl stuff: http://curl.haxx.se [8] GridPP website: http://www.gridpp.ac.uk [9] GridFTP website: http://www.globus.org/toolkit/docs/3.2/gridftp/