Secured bulk file transfer over HTTP(S) Yibiao Li and Andrew McNab Abstract

advertisement
Secured bulk file transfer over HTTP(S)
Yibiao Li and Andrew McNab
Manchester University
Abstract
A method of secured bulk file transferring over HTTP(S) for Gridsite is discussed in this article.
Unlike FTP, this method can transfer a file from one Grid node to another Grid node directly, using
zero memory of the client computer. The verified information is transferred over HTTPS while the
file is transferred over HTTP. To speed up the file transfer, a multi­connection technique is adopted.
Keywords: bulk file, HTTP(s) transfer, GridSite, GridHTTP protocol
1. Background
GridSite[5] was originally a web application
developed for managing and formatting the
content of the GridPP[8] website. Over the past
three years it has grown into a set of extensions
to the Apache web server and a toolkit for Grid
credentials, GACL access control lists and
HTTP(S) protocol operations. A powerful client
end command, htcp, was developed for user to
operate (delete, move, copy etc) file/directory on
GridSite nodes. Recently, a functionality of the
bulk file transfer was added into the command,
which can now transfer a bulk file between two
GridSite nodes directly without using the
memory of the local machine.
For the ease of the reader's understanding, here
we briefly introduce the GridSite. 1.1GridSite node
Each GridSite node is equipped with apache[6]
and GridSite package. Normally, a GridSite
node can be accessible by both HTTP and
HTTPS. The authorized user (see the following
section) can “write” or “update” the contents of
GridSite node over HTTPS besides reading the
contents of it over HTTP.
1.1GridSite authentication and authorization
To access GridSite node over HTTPS, user must
have a user certificate issued by related
Certification Authority (CA). A user certificate
usually has a version of user's name and
affiliation as its Distinguished Name (DN) ­ for
example,
"/C=UK/O=eScience/OU=UniversityName/L=G
roupName/CN=FirstName Surname". Once the user has obtained a user certificate in
his name from his CA, the user needs to make
sure it is loaded into the browser the user
normally uses to browse the web. Browsers
want the certificate and private key in the
PKCS#12 format, which is normally a single file
with the extension ".p12". Many programs
which are based on OpenSSL, such as Globus
and curl, prefer the PEM (".pem") format for
certificates, with separate certificate and key
files ("usercert.pem" and "userkey.pem"). These
two formats can be easily converted to each
other with software tools.
Once the user certificate is loaded into the
browser, the user should be able to see his/her
certificate name appear when looking at an
HTTPS GridSite page which has the page
footers enabled. If GridSite understands the user
certificate, it displays a "You are ..." line in the
footer.
Once users access a GridSite node with their
identity, they will be authorized appropriate
rights depending on their identity. GridSite
allows site administrators to specify these rights
for individuals and groups using GACL access
control files (see next section). GACL defines
who can read files, who can list directories, who
can write or create files and who can modify the
GACL policy files. To get increased access to
an area of a site, the user needs to contact the
administrator for that area and give the DN of
the user抯 certificate (it's not necessary to send
any certificate files.)
1.3 Access Control
DN Lists appear in the Grid Access Control
Lists (GACL) used by GridSite. These are
stored as .gacl files in directories: if the .gacl file
is present, it governs access to the directory; if it
is absent, then the parent directories are
searched upwards until a .gacl is found. The GridSite GACL Reference explains the
XML format of these files, but they can be
edited using the ACL editor built into the
GridSite system by people who have the Admin
permission within the ACL. If a user has this permission in a given directory,
when the user views directory listings or files
with a browser in that directory the user will see
the option "Manage Directory" in the page
footer. This allows the user to get a listing of the
directory and the .gacl file will appear at the top
if it's present. If not, then there will be a button
to create a new .gacl file with the same
permissions as have been inherited by that
directory from its parent. GACL allows quite complex conditions to be
imposed on access, but normally user can think
of an ACL as being composed of a number of
entries, each of which contains one condition
(the required credential) and a set of allowed
and denied permissions. Credentials can be individual user's certificate
names or whole groups of certificate names if a
DN List is given. (User can also specifiy
hostname patterns using Unix shell wildcards
(eg *.ac.uk) or EDG VOMS attribute certificates
­ see the GACL Reference for details.) Permissions can be Admin (edit the ACL),
Write (create, modify or delete files), List
(browse the directory) or Read (read files.)
Permissions can be allowed or denied. If denied
by any entry, the permission is not available to
that user or DN List (depending on what
credential type was associated with the Deny.) 2. Why transfer files over HTTP(S)?
Normally, there are the following way to
transfer files over internet.
2.1 Email
One of the most important aspects of the
Internet is the ability to send large files easily.
Email is still the primary way to receive or send
large files over the Internet. Unfortunately,
using Email to send large files or receive large
files is fraught with drawbacks. In today's world,
file sizes are getting larger and larger but email
technology has not advanced at the same pace. It
is no longer efficient to send large files via
Email and in many cases it is impossible.
2.2 FTP
FTP is a method for exchanging files over the
internet utilizing standard TCP/IP protocols to
enable data transfer. FTP can be used to upload and download files
of almost any size from or to a central server. It
is a well­established and consistently
implemented protocol that can be enabled on the
Windows Storage Server.
The advantages of FTP include:
Support for all kinds of clients: Standardized
implementation of the protocol means that
virtually any FTP client, running on a Microsoft
or non­Microsoft operating system, can use the
FTP server.
High performance and simplicity:
Performance and simplicity of the protocol
makes it a convenient option for file transfers
across the Internet.
The primary disadvantage of FTP is that data
and logon information is sent unencrypted
across the network. This could result in the
discovery of logon accounts or passwords. This
information could be used by unauthorized
individuals to access other systems.
2.3 HTTP
The HTTP[1] protocol is a protocol for file
transfer over internet. It is often used to
download HTML files or image files through a
web browser such as IE, Mozilla, or FireFox.
But it can also be used to file upload by some
command under Unix/Linux OS.
2.4 HTTPS
HTTPS is a communications protocol designed
to transfer encrypted information between
computers over the Internet. HTTPS is HTTP
using a Secure Socket Layer (SSL). It is recommended that users utilize HTTPS
when transferring files containing security
sensitive information.
3. Design
3.1 GridHTTP protocol
To realize the file transfer between GridSite
nodes, GridHTTP protocol was designed, which
supports bulk data transfers via unencrypted
HTTP while the information of authentication
and authorization with the usual grid credentials
over HTTPS. To initiate a GridHTTP transfer, clients set an
Upgrade: GridHTTP/1.0 header when
making an HTTPS request for a file. This header
notifies the server that the client would prefer to
retrieve the file by HTTP rather than HTTPS, if
possible. The authentication and authorization
are done via HTTPS (X.509, VOMS, GACL etc
deciding whether it is right) and then the server
may redirect the client to an HTTP version of
the file using a standard HTTP 302 redirect
3.3 Bulk file transfer between GridSite nodes
Assume that a grid user wants to copy a bulk file
from GridSite node (source server) to another
GridSite node by giving a batch of commands
(so he cannot logon destination to copy file
directly) on a computer denoted as client
computer.
Now we consider a simple case, copying files
without secured factor.
In general, using a command like wget or curl,
the user can copy the file to the local computer
first, then upload it to the destination server as
shown in figure 1.
internet
Source Server
f i le
Fig 1 user downloads file first to local computer, then uploads it to
destination server
Though this method can do the job, it apparently
waste time, internet bandwidth and local
machine memory and disk space.
Instead, we can seek a way to send command to
destination server, and ask it to get file from
source server. file
3.2 Advantages in ter net
D estin atio n Server
m
m
an
d
Source Ser ver
co
One big advantage of redirecting to a pure
HTTP GET transfer is not just that the server
and client don't have to spend CPU
en/decrypting it, but that Apache can use the
sendfile() system call to tell the kernel to copy it
directly from the file system to the network
socket (or can use the Linux kernel module
HTTP server, which has much the same effect.)
This means the data never has to be copied
through user space (the so­called zero copy
mode.) As far as client side APIs go, any client side
library which supports HTTP redirects and
cookies and lets user add his/her own headers is
sufficient (even the curl command line tool lets
user do this, with the ­H and ­c options, without
having to make any modifications to its code.) From GridSite version 1.1.11, htcp supports
GridHTTP redirection, by using the ­­grid­http
option. D estination Server
fil e
response giving the HTTP URL (which can be
on a different server, in the general case.) For
small files, the server can choose to return the
file over HTTPS as the response body. When
contacting a legacy server, the Upgrade header
will be silently ignored and the file will be
returned via HTTPS as normal. For redirection to plain HTTP transport, a
standard HTTP Set­Cookie header is used to
send the client a one­time pass­code in the form
of a cookie, GRIDHTTP_PASSCODE, which
much be presented to obtain the file via HTTP.
This one­time pass­code only works for the file
in question, and only works once: the current
implementation stores it in a file and deletes the
file when the pass­code is used. (This
mechanism is no worse than GridFTP for
providing an unencrypted data channel: it's
vulnerable to man­in­the­middle attacks or
snooping to obtain a copy of the requested file,
but not vulnerable to replay attacks or to other
files being obtained by the attacker.) As you can see, GridHTTP is really a profile for
using the HTTP/1.1 standard, rather than a new
protocol or a set of extensions: no new headers
or methods are involved. Ways of extending it to support variable TCP
window sizes so it can be used for a mix of long
and short distance connections (currently the
TCP window size has to be set in the Apache
configuration file), and support for third­party
transfers using the HTTP COPY method from
WebDAV are being added to the GridSite
implementation. Fig 2 user sends command to destination server to ask it to copy
file from source server
To realize it, there should be a module on the
server side, which can:
1. receive a http request,
2. retrieve the file information from the
request package,
3. send request to source server to get the file,
4. receive file and save it.
According to extendible feature of the Apache,
we can develop such a module and attach it to
the Apache.
Now let us add the secured feature to the above
case in GridSite circumstance. To support the remote bulk file copy, the
GridSite node should:
1. verify the user certificate (source server).
2. produce one­tine pass­code (source server).
3. check the user access to the destination
directory, respond to the user if no write
access (destination server).
4. send file request with pass­code as a cookie
via HTTP (destination server).
5. redirect HTTPS request to HTTP (source
server).
6. retrieve the file and save it to some
directory (destination server).
7. respond to user when finishing (destination
server).
The client end command should do the
following:
1. send a file pass­code request to source
GridSite node with the user certificate over
HTTPS
2. retrieve pass­code from source GridSite
node over HTTPS
3. send file copy request to destination
GridSite node with the pass­code over
HTTPS
A completed description of the bulk file copy
system can be given as (see figure 3):
1. Client uses user ID (user certificate and
user key) to request a one time pass­code
from source gridsite server (HTTPS)
2. Source server verifies the user ID.
3. Source server issues a onetime pass­code
to client (HTTPS)
4. Client sends a request to destination server
to get file from source server with onetime
pass­code (HTTPS)
5. Destination server sends request to source
server with pass­code (HTTPS)
6. Source server verifies the pass­code
7. Source server transfer file to destination
server (HTTP)
7. send file via H T T P
5. req uest a file w ith onetim e passcode via H T T PS
6 ver ify passco de
2 ver ify user keys
Sour ce Ser ver
3 . G
iv e
a
1.
r e
qu
w it es t
h o n
ke e
o
y t im
ne
v ia e
ti m
H p a
e TT ss
pa
P S co
ss
de
co
de
v
ia HT
TP
S
in te rnet
D estination S erver
fil e P S
e t g TT
t o H
n d e v ia
a
m d
m co
co s s
a pa
nd e S e ti m
4 . o ne
h
w it
Fig 3 a completed bulk file transfer system (single connection)
Furthermore, considering that transferring a bulk
file could take a lot of time, we can use the
multi­connection technique to get different part
of each in each connection at the same time. that
will speed up the bulk file copying. But one
problem is that when the destination server
sends the request for file copying, a one­time
pass­code is required for each connection, to get
one­time pass­codes for connections (one pass­
code for each connection), a secured connection
have to be created to transfer the request and
response of one­time pass­code, and the source
server will have a mechanism to produce a one­
time pass­code when it receives a request with
original pass­code.
Thus the completed procedure for multi­
connection bulk file copy can be described as:
1. User send a request to source server with
uses user ID (user certificate and user key) to
request a reusable time pass­code (HTTPS),
2. Source server verifies the user ID,
3. Source server issues a reusable pass­code
to client computer if ID is OK, or responds a
error message (HTTPS),
4. Client sends a request to destination server
to get file from source server with reusable
pass­code (HTTPS),
5. Destination server sends request to source
server with reusable pass­code to get file
size (HTTPS),
6. Destination server creates multi­
connection, and get one­time pass­code for
each connection (HTTPS),
7. Source server verifies the pass­code, and
produce one­time pass­code and send it back
to the destination server (HTTPS).
8. Destination server requests part of file in
each connection with one­time pass­code
(HTTPS),
9. Source server checks the one­time pass­
code and transfer part of file to destination
server (HTTP)
10. Destination server combines parts of file
into a completed file, and save it to the
directory required.
Please note that in step 9 and 10, if we
change the HTTP connection to HTTPS
connection, then the whole file will be
transferred over HTTPS. So it is easy to
extended to the case of transferring both file
and security data over HTTPS.
C onnec tion n
7 . send file via H T T P
C onnec tion 2
7 . send file via H T T P
code over HTTPS, then to get the part of the file
over HTTP.
7 . send file via H T T P
C onnec tion 1
4.2 Client side
5 . re quest a file w ith o netim e p asscode via H T T P S
6 verify passcode
interne t
2 ve rify user keys
So urce S erver
3.
G
1.
D estination Se rver
r e
qu
es
iv e
t a
ke o ne
o
y v tim
ne
ia e p
t im
H T as
e TP s c
pa
S od
ss
e w
co
de
ith
v
ia HT
TP
S
le
t fi P S
g e T T
to H
n d v ia
ma ode
m c
co ss
a a
nd e p
S e tim
4 . on e
h
w it
Fig 4 a completed bulk file transfer system (multi­connection)
4 Implementation
To realize the design in section 3, we consider
the implementation on the server side and client
side separately.
1.1server side
On the server side, we need to realize two main
modules. The one is responsible for producing
the one­time pass­code and reusable pass­code
and relative check (source server side in the
above discussion), the other is responsible for
responding client requests and copying file
(destination server in the above discussion). In practise, we developed first module (pass­
code) as part of GridSite, and developed the
second module as a CGI program. The pass­code module was part of GridSite
package which can be compiled and run as an
apache module[2][3]. the key task here is the one­
time pass­code encoding, as described in the
protocol, the pass­code should be related with
one file including its full path and the time. Here
we generate a random number and save it to
some directory that is only accessible by the
Apache server, once the file related to this pass­
code has been sent, the pass­code will then be
deleted, or after a specified period, even the file
has not been sent, it will be deleted. This pass­
code mechanism here ensures that even
someone knows a pass­code by some means, he
cannot get access right to other files.
The second module is responsible for receiving
the user request and obtaining the file from
source server. In the current version of GridSite
package, we developed it as a CGI program, in
Apache configuration file, it is mapped to HTTP
COPY method. It was developed with C and
libcurl, the key matter in the one connection
case is to transfer the pass­code as a cookie via
HTTPS connection, then get the file over HTTP.
In the multi­connection case, we used the multi­
thread technique. For each connection, we use
the reusable pass­code to get the one­time pass­
On the client side, the command was built in the
powerful command htcp. The options needed to
pass to the destination server, such as
connection number, block size of each
connection and thread number specified by user
are transferred with OPTS in the HTTP header
over HTTPS.
5. The server configuration and
command usage
5.1 The server configuration
As described in section 4.1, there are two
modules in server side. The pass­code module
has built in the GridSite package, so after the
package is installed and the apache service
starts, this module has been started, there is no
extra configuration needed.
The second module is also included in the
GridSite package but run as a separate program,
named gridsite­copy.cgi, it should be installed
into a specified directory that is normally
mapped to /cgi­bin/ in apache configuration file.
So after the GridSite package is installed, check
the directory if you know where it is, or check
the apache configuration file https.conf first to
find out where it is, then check if the file
gridsite­copy.cgi is there. The following line is
needed to add in the apache configuration file
httpd.conf:
Script COPY /cgi­bin/gridsite­copy.cgi
If you want to use copy a file from the GridSite
node to another, you must have write access for
the destination directory on the destination node.
The access configuration is described in section
1.3 access control.
5.2 Client command
To copy a file from one GridSite node to
another, you use htcp command provided by
GridSite package. Here we give two examples.
Example 1: copy a file data.dat from node A to
node B’s data directory using one connection:
htcp rmtcp https://a/data.dat https://b/data/
Example 2: copy a file data.dat from node A to
node B’s data directory using multi­connection:
htcp ­rmtcp ­connection­number 5 ­block­size
20 https://a/data.dat https://b/data/
Note that in the above examples, the option
rmtcp ask htcp to execute the remote copy
module; option connection­number indicates
how name connections will be used to get the
file, and option block­size indicates the
maximum k­byte for each connection. To use htcp command, users should copy their
certificates in a directory .globus or current
directory.
6. Notes
There is another application tool for the
GridSites to transfer files, GridFTP[9], which is
based on the popular File Transportation
Protocol FTP, supporting functionalities such as:
1) Grid Security Infrastructure (GSI) 2) Third­party control of data transfer
3) Parallel data transfer
4) Striped data transfer
5) Partial file transfer
As shown in the previous sections, the
GridHTTP provides similar functionalities, but
most difference from GridFTP is GridHTTP is
based on HTTP protocol. There are a lot of arguments on advantages and
disadvantages of file transferring over HTTP or
FTP. It is hard to determine which is better from
the theory. For the grid environments, GridFTP
needs extra installation and configurations,
while the htcp can embedded into GridSite
package as an Apache module and an
independent CGI application program, and
needs quite simple configurations.
The remote copy method for GridSite nodes
discussed in this article can be easily applied to
and developed in normal Apache nodes.
7Acknowledgements
This work was funded by the Particle
Physics and Astronomy Research Council
through the GridPP programme.
References:
[1] R. Fielding etc, Hypertext Transfer Protocol
– HTTP/1.1”, http://www.w3.org/
Protocols/rfc2616/rfc2616.html, 1999
[2] L. Stein and D. MacEachern, Writing
Apache Modules with Perl and C”, O'Reilly &
Associates, 1999 [3] B. Laurie and P. Laurie
Apache: The Definitive Guide, Third Edition”,
O'Reilly & Associates, Third Edition, 2002
[4] Thomas Boutell, Featuring C and Perl 5
Source Code”, Addison Wesley, 1996
[5] GridSite software and documents:
http://www.gridsite.org
[6] Apache official website:
http://www.apache.org
[7] Curl and libcurl stuff: http://curl.haxx.se
[8] GridPP website: http://www.gridpp.ac.uk
[9]
GridFTP
website:
http://www.globus.org/toolkit/docs/3.2/gridftp/
Download