Department of Information Technology
IT-Vth SEM
Information Storage and Mgt. (IT-502)
PUT- TEST
Time: 2:30Hrs.
M. Marks: 20
Note: - Attempt any one question from each unit.
UNIT-I
Q.1. What do you mean by Storage Technology?
Q.2 Discuss the characteristics of developing an ILM strategy. Also list its advantages.
UNIT-II
Q.3 Explain architecture of intelligent disk subsystems.
Q.4 What are various factors that affect the performance of disk drives?
UNIT-III
Q.5 Discuss various forms of network storage used in ISM.
Q.6 Explain CAS protocol.
UNIT-IV
Q.7 What are the Disaster recovery principles which are required to protect essential data of
any organization?
Q.8 Discuss the key components which are required to manage network in SNMP.
UNIT-V
Q.9. What is cloud computing?
Q.10. Discuss Types of techniques used by RAID Model.
Information Storage and Mgt. (IT-502)
II-MIDTERM TEST- SOLUTIONS
SOLUTION Q.1
Evolution of Storage Technology and Architecture
The evolution of open systems and the affordability and ease of deployment that they offer made
it possible for business units/departments to have their own servers and storage. In earlier
implementations of open systems, the storage was typically internal to the server. Originally,
there were very limited policies and processes for managing these servers and the data created.
To overcome these challenges, storage technology evolved from non-intelligent internal storage
to intelligent networked storage. Highlights of this technology evolution include:
Redundant Array of Independent Disks (RAID): This technology was developed to address
the cost, performance, and availability requirements of data. It continues to evolve today and is
used in all storage architectures such as DAS, SAN, and so on.
Direct-attached storage (DAS): This type of storage connects directly to a server (host) or a
group of servers in a cluster. Storage can be either internal or external to the server. External
DAS alleviated the challenges of limited internal storage capacity.
Storage area network (SAN): This is a dedicated, high-performance Fibre Channel (FC)
network to facilitate block-level communication between servers and storage. Storage is
partitioned and assigned to a server for accessing its data. SAN offers scalability, availability,
performance, and cost benefits compared to DAS.
Network-attached storage (NAS): This is dedicated storage for file serving applications. Unlike
a SAN, it connects to an existing communication network (LAN) and provides file access to
heterogeneous clients. Because it is purposely built for providing storage to file server
applications, it offers higher scalability, availability, performance, and cost benefits compared to
general purpose file servers.
Internet Protocol SAN (IP-SAN): One of the latest evolutions in storage architecture, IP-SAN
is a convergence of technologies used in SAN and NAS. IP-SAN provides block-level
communication across a local or wide area network (LAN or WAN), resulting in greater
consolidation and availability of data.
Storage technology and architecture continues to evolve, which enables organizations to
consolidate, protect, optimize, and leverage their data to achieve the highest return on
information assets.
Information Lifecycle
The information lifecycle is the “change in the value of information” over time. When data is
first created, it often has the highest value and is used frequently. As data ages, it is accessed less
frequently and is of less value to the organization. Understanding the information lifecycle helps
to deploy appropriate storage infrastructure, according to the changing value of information. For
example, in a sales order application, the value of the information changes from the time the
order is placed until the time that the warranty becomes void (see Figure 1-7). The value of the
information is highest when a company receives a new sales order and processes it to deliver the
product. After order fulfilment, the customer or order data need not be available for real-time
access. The company can transfer this data to less expensive secondary storage with lower
accessibility and availability requirements unless or until a warranty claim or another event
triggers its need. After the warranty becomes void, the company can archive or dispose of data to
create space for other high-value information.
Figure: Changing value of sales order information
Information Lifecycle Management
Today’s business requires data to be protected and available 24 × 7. Data centres can accomplish
this with the optimal and appropriate use of storage infrastructure.
An effective information management policy is required to support this infrastructure and
leverage its benefits. Information lifecycle management (ILM) is a proactive strategy that
enables an IT organization to effectively manage the data throughout its lifecycle, based on
predefined business policies. This allows an IT organization to optimize the storage
infrastructure for maximum return on investment.
SOLUTION 2:
An ILM strategy should include the following characteristics: Business-centric: It should be integrated with key processes, applications, and initiatives of the
business to meet both current and future growth in information.
Centrally managed: All the information assets of a business should be under the purview of the
ILM strategy.
Policy-based: The implementation of ILM should not be restricted to a few departments. ILM
should be implemented as a policy and encompass all business applications, processes, and
resources.
Heterogeneous: An ILM strategy should take into account all types of storage platforms and
operating systems.
Optimized: Because the value of information varies, an ILM strategy should consider the
different storage requirements and allocate storage resources based on the information’s value to
the business.
Implementing an ILM strategy has the following key benefits that directly address the challenges
of information management:
Improved utilization by using tiered storage platforms and increased visibility of all enterprise
information.
Simplified management by integrating process steps and interfaces with individual tools and by
increasing automation.
A wider range of options for backup, and recovery to balance the need for business continuity.
Maintaining compliance by knowing what data needs to be protected for what length of time.
Lower Total Cost of Ownership (TCO) by aligning the infrastructure and management costs
with information value. As a result, resources are not wasted, and complexity is not introduced
by managing low-value data at the expense of high-value data.
SOLUTION 3
Disk Drive Components: A disk drive uses a rapidly moving arm to read and write data across a
flat platter coated with magnetic particles. Data is transferred from the magnetic platter through
the R/W head to the computer. Several platters are assembled together with the R/W head and
controller, most commonly referred to as a hard disk drive (HDD). Data can be recorded and
erased on a magnetic disk any number of times. This section details the different components of
the disk, the mechanism for organizing and storing data on disks, and the factors that affect disk
performance. Key components of a disk drive are platter, spindle, read/write head, actuator arm
assembly, and controller.
Figure: Disk Drive Components
Platter
A typical HDD consists of one or more flat circular disks called platters (Figure 2-3). The data is
recorded on these platters in binary codes (0s and 1s). The set of rotating platters is sealed in a
case, called a Head Disk Assembly (HDA). A platter is a rigid, round disk coated with magnetic
material on both surfaces (top and bottom). The data is encoded by polarizing the magnetic area,
or domains, of the disk surface. Data can be written to or read from both surfaces of the platter.
The number of platters and the storage capacity of each platter determine the total capacity of the
drive.
Figure: Spindle and platter
Spindle
A spindle connects all the platters, as shown in Figure, and is connected to a motor. The
motor of the spindle rotates with a constant speed. The disk platter spins at a speed of several
thousands of revolutions per minute (rpm). Disk drives have spindle speeds of 7,200 rpm, 10,000
rpm, or 15,000 rpm. Disks used on current storage systems have a platter diameter of 3.5” (90
mm). When the platter spins at 15,000 rpm, the outer edge is moving at around 25 percent of the
speed of sound. The speed of the platter is increasing with improvements in technology, although
the extent to which it can be improved is limited.
Read/Write Head
Read/Write (R/W) heads, read and write data from or to a platter. Drives have two R/W heads
per platter, one for each surface of the platter. The R/W head changes the magnetic polarization
on the surface of the platter when writing data. While reading data, this head detects magnetic
polarization on the surface of the platter. During reads and writes, the R/W head senses the
magnetic polarization and never touches the surface of the platter. When the spindle is rotating,
there is a microscopic air gap between the R/W heads and the platters, known as the head flying
height. This air gap is removed when the spindle stops rotating and the R/W head rests on a
special area on the platter near the spindle. This area is called the landing zone. The landing zone
is coated with a lubricant to reduce friction between the head and the platter. The logic on the
disk drive ensures that heads are moved to the landing zone before they touch the surface. If the
drive malfunctions and the R/W head accidentally touches the surface of the platter outside the
landing zone, a head crash occurs. In a head crash, the magnetic coating on the platter is
scratched and may cause damage to the R/W head. A head crash generally results in data loss.
Figure: Actuator arm assembly
Actuator Arm Assembly
The R/W heads are mounted on the actuator arm assembly, which positions the R/W head at the
location on the platter where the data needs to be written or read. The R/W heads for all platters
on a drive are attached to one actuator arm assembly and move across the platters
simultaneously.
Controller
The controller is a printed circuit board, mounted at the bottom of a disk drive. It consists of a
microprocessor, internal memory, circuitry, and firmware. The firmware controls power to the
spindle motor and the speed of the motor. It also manages communication between the drive and
the host. In addition, it controls the R/W operations by moving the actuator arm and switching
between different R/W heads, and performs the optimization of data access.
Physical Disk Structure
Data on the disk is recorded on tracks, which are concentric rings on the platter around the
spindl. The tracks are numbered, starting from zero, from the outer edge of the platter. The
number of tracks per inch (TPI) on the platter (or the track density) measures how tightly the
tracks are packed on a platter.
Each track is divided into smaller units called sectors. A sector is the smallest, individually
addressable unit of storage. The track and sector structure is written on the platter by the drive
manufacturer using a formatting operation. The number of sectors per track varies according to
the specific drive. The first personal computer disks had 17 sectors per track. Recent disks have a
much larger number of sectors on a single track.
SOLUTION: 4
A disk drive is an electromechanical device that governs the overall performance of the storage
system environment. The various factors that affect the performance of disk drives are discussed
below:
Disk Service Time
Disk service time is the time taken by a disk to complete an I/O request. Components that
contribute to service time on a disk drive are seek time, rotational latency, and data transfer rate.
Seek Time
The seek time (also called access time) describes the time taken to position the R/W heads across
the platter with a radial movement (moving along the radius of the platter). In other words, it is
the time taken to reposition and settle the arm and the head over the correct track. The lower the
seek time, the faster the I/O operation. Disk vendors publish the following seek time
specifications:
Full Stroke: The time taken by the R/W head to move across the entire width of the disk, from
the innermost track to the outermost track.
Average: The average time taken by the R/W head to move from one random track to another,
normally listed as the time for one-third of a full stroke.
Track-to-Track: The time taken by the R/W head to move between adjacent tracks. Each of
these specifications is measured in milliseconds. The average seek time on a modern disk is
typically in the range of 3 to 15 milliseconds. Seek time has more impact on the read operation
of random tracks rather than adjacent tracks. To minimize the seek time, data can be written to
only a subset of the available cylinders. This results in lower usable capacity than the actual
capacity of the drive. For example, a 500 GB disk drive is set up to use only the first 40 percent
of the cylinders and is effectively treated as a 200 GB drive. This is known as short-stroking the
drive.
Rotational Latency
To access data, the actuator arm moves the R/W head over the platter to a particular track while
the platter spins to position the requested sector under the R/W head. The time taken by the
platter to rotate and position the data under the R/W head is called rotational latency. This
latency depends on the rotation speed of the spindle and is measured in milliseconds. The
average rotational latency is one-half of the time taken for a full rotation. Similar to the seek
time, rotational latency has more impact on the reading/writing of random sectors on the disk
than on the same operations on adjacent sectors. Average rotational latency is around 5.5 ms for
a 5,400-rpm drive, and around 2.0 ms for a 15,000-rpm drive.
Data Transfer Rate
The data transfer rate (also called transfer rate) refers to the average amount of data per unit time
that the drive can deliver to the HBA. It is important to first understand the process of read and
write operations in order to calculate data transfer rates. In a read operation, the data first moves
from disk platters to R/W heads, and then it moves to the drive’s internal buffer. Finally, data
moves from the buffer through the interface to the host HBA. In a write operation, the data
moves from the HBA to the internal buffer of the disk drive through the drive’s interface. The
data then moves from the buffer to the R/W heads. Finally, it moves from the R/W heads to the
platters. The data transfer rates during the R/W operations are measured in terms of internal and
external transfer rates. Internal transfer rate is the speed at which data moves from a single track
of a platter’s surface to internal buffer (cache) of the disk. Internal transfer rate takes into
account factors such as the seek time. External transfer rate is the rate at which data can be
moved through the interface to the HBA. External transfer rate is generally the advertised speed
of the interface, such as 133 MB/s for ATA. The sustained external transfer rate is lower than the
interface speed.
SOLUTION 5
DAS: The most common form of server storage today is still direct attached storage. The disks
may be internal to the server or they may be in an array that is connected directly to the server.
Either way, the storage can be accessed only through that server. An application server will have
its own storage; the next application server will have its own storage; and the file and print
servers will each have their own storage. Backups must either be performed on each individual
server with a dedicated tape drive or across the LAN to a shared tape device consuming a
significant amount of bandwidth. Storage can only be added by taking down the application
server, adding physical disks, and rebuilding the storage array. When a server is upgraded, its
data must be migrated to the new server.
In small installations this setup can work well, but it gets very much more difficult to manage as
the number of servers increases. Backups become more challenging, and because storage is not
shared anywhere, storage utilization is typically very low in some servers and overflowing in
others. Disk storage is physically added to a server based on the predicted needs of the
application. If that application is underutilized, then capital cost has been unnecessarily tied up.
If the application runs out of storage, it must be taken down and rebuilt after more disk storage
has been added. In the case of file services, the typical response to a full server is to add another
file server. This adds more storage, but all the clients must now be reset to point to the new
network location, adding complexity to the client side. Additional cost in the form of Client
Access Licenses (CAL) must also be taken into account.
NAS: NAS challenges the traditional file server approach by creating systems designed
specifically for data storage. Instead of starting with a general-purpose computer and configuring
or removing features from that base, NAS designs begin with the bare-bones components
necessary to support file transfers and add features "from the bottom up."
Like traditional file servers, NAS follows a client/server design. A single hardware device, often
called the NAS box or NAS head, acts as the interface between the NAS and network clients.
These NAS devices require no monitor, keyboard or mouse. They generally run an embedded
operating system rather than a full-featured NOS. One or more disk (and possibly tape) drives
can be attached to many NAS systems to increase total capacity. Clients always connect to the
NAS head, however, rather than to the individual storage devices.
Clients generally access a NAS over an Ethernet connection. The NAS appears on the network as
a single "node" that is the IP address of the head device.
A NAS can store any data that appears in the form of files, such as email boxes, Web content,
remote system backups, and so on. Overall, the uses of a NAS parallel those of traditional file
servers.
NAS systems strive for reliable operation and easy administration. They often include built-in
features such as disk space quotas, secure authentication, or the automatic sending of email alerts
should an error be detected.
A NAS appliance is a simplified form of file server; it is optimized for file sharing in an
organization. Authorized clients can see folders and files on the NAS device just as they can on
their local hard drive. NAS appliances are so called because they have all of the required
software preloaded and they are easy to install and simple to use. Installation consists of rack
mounting, connecting power and Ethernet, and configuring via a simple browser-based tool.
Installation is typically achieved in less than half an hour. NAS devices are frequently used to
consolidate file services. To prevent the proliferation of file servers, a single NAS appliance can
replace many regular file servers, simplifying management and reducing cost and workload for
the systems administrator. NAS appliances are also multiprotocol, which means that they can
share files among clients using Windows® and UNIX®-based operating systems.
SAN : A SAN allows more than one application server to share storage. Data is stored at a block
level and can therefore be accessed by an application, not directly by clients. The physical
elements of the SAN (servers, switches, storage arrays, etc.) are typically connected with FibreChannel – an interconnect technology that permits high-performance resource sharing. Backups
can be performed centrally and can be more easily managed to avoid interrupting the
applications. The primary advantages of a SAN are its scalability and flexibility. Storage can be
added without disrupting the applications, and different types of storage can be added to the
pool. With the advent of storage area networks, adding storage capacity has become simplified
for systems administrators, because it’s no longer necessary to bring down the application server.
Additional storage is simply added to the SAN, and the new storage can then be configured and
made immediately available to those applications that need it. Upgrading the application server is
also simplified; the data can remain on the disk arrays, and the new server just needs to point to
the appropriate data set. Backups can be centralized, reducing workload and providing greater
assurance that the backups are complete. The time taken for backups is dramatically reduced
because the backup is performed over the high speed SAN, and no backup traffic ever impacts
users on the LAN. However, the actual implementation of a SAN can be quite daunting, given
the cost and complexity of Fibre Channel infrastructure components. For this reason, SAN
installations have primarily been confined to large organizations with dedicated storage
management resources. The last few years have seen the emergence of iSCSI (which means
SCSI over IP or Internet Protocol) as a new interconnect for a SAN. iSCSI is a lower cost
alternative to Fibre Channel SAN infrastructure and is an ideal solution for many small and
medium sized businesses. Essentially all of the same capability of FC-SAN is provided, but the
interconnect is Ethernet cable and the switches are Gigabit Ethernet – the same low-cost
technology you use today for your LAN. The tradeoff is slightly lower performance, but most
businesses simply will not notice. iSCSI also provides a simple migration path for businesses to
use more comprehensive data storage management technologies without the need for a “fork-lift”
upgrade.
CAS: The Central Authentication Service (CAS) is a single sign-on protocol for the web. Its
purpose is to permit a user to access multiple applications while providing their credentials (such
as userid and password) only once. It also allows web applications to authenticate users without
gaining access to a user's security credentials, such as a password. The name CAS also refers to a
software package that implements this protocol. The CAS protocol involves at least three parties:
a client web browser, the web application requesting authentication, and the CAS server. It may
also involve a back-end service, such as a database server, that does not have its own HTTP
interface but communicates with a web application.
When the client visits an application desiring to authenticate to it, the application redirects it to
CAS. CAS validates the client's authenticity, usually by checking a username and password
against a database (such as Kerberos or Active Directory).
If the authentication succeeds, CAS returns the client to the application, passing along a security
ticket. The application then validates the ticket by contacting CAS over a secure connection and
providing its own service identifier and the ticket. CAS then gives the application trusted
information about whether a particular user has successfully authenticated.
CAS allows multi-tier authentication via proxy address. A cooperating back-end service, like a
database or mail server, can participate in CAS, validating the authenticity of users via
information it receives from web applications. Thus, a webmail client and a webmail server can
all implement CAS.
SOLUTION 6:
The Central Authentication Service (CAS) is a single sign-on protocol for the web. Its purpose is
to permit a user to access multiple applications while providing their credentials (such as userid
and password) only once. It also allows web applications to authenticate users without gaining
access to a user's security credentials, such as a password. The name CAS also refers to a
software package that implements this protocol.
The CAS protocol involves at least three parties: a client web browser, the web application
requesting authentication, and the CAS server. It may also involve a back-end service, such as a
database server, that does not have its own HTTP interface but communicates with a web
application.
When the client visits an application desiring to authenticate to it, the application redirects it to
CAS. CAS validates the client's authenticity, usually by checking a username and password
against a database (such as Kerberos or Active Directory).
If the authentication succeeds, CAS returns the client to the application, passing along a security
ticket. The application then validates the ticket by contacting CAS over a secure connection and
providing its own service identifier and the ticket. CAS then gives the application trusted
information about whether a particular user has successfully authenticated.
CAS allows multi-tier authentication via proxy address. A cooperating back-end service, like a
database or mail server, can participate in CAS, validating the authenticity of users via
information it receives from web applications. Thus, a webmail client and a webmail server can
all implement CAS.
SOLUTION 7
Disaster recovery principles for any organization
Disaster recovery is becoming increasingly important for businesses aware of the threat of both
man-made and natural disasters. Having a disaster recovery plan will not only protect your
organization’s essential data from destruction, it will help you refine your business processes and
enable your business to recover its operations in the event of a disaster. Though each
organization has unique knowledge and assets to maintain, general principles can be applied to
disaster recovery. This set of planning guidelines can assist your organization in moving forward
with an IT disaster recovery project.
Restoration and recovery procedures
Imagine that a disaster has occurred. You have the data, now what should you do with it? If you
don’t have any restoration and recovery procedures, your data won’t be nearly as useful to you.
With the data in hand, you need to be able to re-create your entire business from brand-new
systems. You’re going to need procedures for rebuilding systems and networks. System recovery
and restoration procedures are typically best written by the people that currently administer and
maintain the systems. Each system should have recovery procedures that indicate which versions
of software and patches should be installed on which types of hardware platforms. It's also
important to indicate which configuration files should be restored into which directories. A good
procedure will include low-level file execution instructions, such as what commands to type and
in what order to type them.
Backups are key
As an IT or network administrator, you need to bring all your key data, processes, and
procedures together through a backup system that is reliable and easy to replicate. Your IT
director's most important job is to ensure that all systems are being backed up on a reliable
schedule. This process, though it seems obvious, is often not realized. Assigning backup
responsibilities to an administrator is not enough. The IT department needs to have a written
schedule that describes which systems get backed up when and whether the backups are full or
incremental. You also need to have the backup process fully documented. Finally, test your
backup process to make sure it works. Can you restore lost databases? Can you restore lost
source
code?
Can
you
restore
key
system
files?
Finally, you need to store your backup media off-site, preferably in a location at least 50 miles
from your present office. Numerous off-site storage vendors offer safe media storage. Iron
Mountain is one example. Even if you’re using an off-site storage vendor, it doesn't hurt to send
your weekly backup media to another one of your field offices, if you have one.
SNMP:
Simple Network Management Protocol (SNMP) is an "Internet-standard protocol for managing
devices on IP networks". Devices that typically support SNMP include routers, switches, servers,
workstations, printers, modem racks, and more. It is used mostly in network management
systems to monitor network-attached devices for conditions that warrant administrative attention.
SNMP is a component of the Internet Protocol Suite as defined by the Internet Engineering Task
Force (IETF). It consists of a set of standards for network management, including an application
layer protocol, a database schema, and a set of data objects.
SNMP exposes management data in the form of variables on the managed systems, which
describe the system configuration. These variables can then be queried (and sometimes set) by
managing applications.
SOLUTION 8
An SNMP-managed network consists of three key components:



Managed device
Agent — software which runs on managed devices
Network management system (NMS) — software which runs on the manager
A managed device is a network node that implements an SNMP interface that allows
unidirectional (read-only) or bidirectional access to node-specific information. Managed devices
exchange node-specific information with the NMSs. Sometimes called network elements, the
managed devices can be any type of device, including, but not limited to, routers, access servers,
switches, bridges, hubs, IP telephones, IP video cameras, computer hosts, and printers.
An agent is a network-management software module that resides on a managed device. An agent
has local knowledge of management information and translates that information to or from an
SNMP specific form.
A network management system (NMS) executes applications that monitor and control managed
devices. NMSs provide the bulk of the processing and memory resources required for network
management. One or more NMSs may exist on any managed network.
SMI-S: SMI-S, or the Storage Management Initiative – Specification, is a storage standard
developed and maintained by the Storage Networking Industry Association (SNIA). It has also
been ratified as an ISO standard. SMI-S is based upon the Common Information Model and the
Web-Based Enterprise Management standards defined by the Distributed Management Task
Force, which define management functionality via HTTP. The most recent approved version of
SMI-S is available at the SNIA.
The main objective of SMI-S is to enable broad interoperable management of heterogeneous
storage vendor systems. The current version is SMI-S V1.6.0. Over 75 software products and
over 800 hardware products are certified as conformant to SMI-S.
At a very basic level, SMI-S entities are divided into two categories:


Clients are management software applications that can reside virtually anywhere within a
network, provided they have a communications link (either within the data path or outside
the data path) to providers.
Servers are the devices under management. Servers can be disk arrays, virtualization engines,
host bus adapters, switches, tape drives, etc.
CIM: The Common Information Model (CIM) is an open standard that defines how managed
elements in an IT environment are represented as a common set of objects and relationships
between them. This is intended to allow consistent management of these managed elements,
independent of their manufacturer or provider.
One way to describe CIM is to say that it allows multiple parties to exchange management
information about these managed elements. However, this falls short in expressing that CIM not
only represents these managed elements and the management information, but also provides
means to actively control and manage these elements. By using a common model of information,
management software can be written once and work with many implementations of the common
model without complex and costly conversion operations or loss of information.
The CIM standard is defined and published by the Distributed Management Task Force (DMTF).
A related standard is Web-Based Enterprise Management (WBEM, also defined by DMTF)
which defines a particular implementation of CIM, including protocols for discovering and
accessing such CIM implementations.
The CIM standard includes the CIM Infrastructure Specification and the CIM Schema:
CIM Infrastructure Specification
The CIM Infrastructure Specification defines the architecture and concepts of CIM, including a
language by which the CIM Schema (including any extension schema) is defined, and a method
for mapping CIM to other information models, such as SNMP. The CIM architecture is based
upon UML, so it is object-oriented: The managed elements are represented as CIM classes and
any relationships between them are represented as CIM associations. Inheritance allows
specialization of common base elements into more specific derived elements.
CIM Schema
The CIM Schema is a conceptual schema which defines the specific set of objects and
relationships between them that represent a common base for the managed elements in an IT
environment. The CIM Schema covers most of today's elements in an IT environment, for
example computer systems, operating systems, networks, middleware, services and storage. The
CIM Schema defines a common basis for representing these managed elements. Since most
managed elements have product and vendor specific behavior, the CIM Schema is extensible in
order to allow the producers of these elements to represent their specific features seamlessly
together with the common base functionality defined in the CIM Schema.
Sol. 9 Cloud Computing
In cloud computing large accessible computing resources are provided “as a service” to users on
internet because it is “it is internet based system development”. It includes SaaS, web
infrastructure, web and other technologies. Industry and research community are attracting
towards it. In this paper
Construction and the problems that arise during the construction of cloud competing platform is
explained. The compatible GFS file system is designed with many chunks of different sizes to
help huge data processing. It also introduce implementation enhancements on map reduce to
improve the output of system. Some issues are also discussed. The implementation of platform
for some specific domain in cloud computing services, it also implements large web text mining
as final application.
Cloud computing is a new model for delivering and hosting services on the internet. Cloud
computing eliminates the user requirements to plan for provisioning, that is why it is attractive
for the business community. It has the ability that it can be started from very small scale and can
be increased as the resourced increases.
Cloud computing provides many opportunities for IT industry for still there are many issues
relate to it. In this paper cloud computing is defined and the key concepts, state of art
implementation and architectural principles
Cloud computing is not a new technology and that is why there are different perceptions of it. To
run different businesses in many ways there are many operation models technologies.
Virtualization and utility based pricing is not a new technology with respect to cloud computing.
These technologies are used to meet economic requirements and demands. Grid computing is
distributed computing model that organized network resources to attain common computational
objectives.
Cloud applications are present on application layer in the highest level of hierarchy. To achieve
better availability, lower operating cost and performance cloud computing control automatic
scaling feature.
The resources are provided as services to public by service providers in the cloud. This cloud is
known as internal cloud. Private clouds are made for single organizations to use. Hybrid cloud is
the combination of private and public cloud models that tries to deal with limitations of each
approach. A hybrid cloud is more flexible than private and public clouds
Sol. 10
A RAID set is a group of disks. Within each disk, a predefined number of contiguously
addressable disk blocks are defined as strips. The set of aligned strips
that spans across all the disks within the RAID set is called a stripe. Figure 3-2 shows physical
and logical representations of a striped RAID set.
Striped RAID set
Strip size (also called stripe depth) describes the number of blocks in a strip, and is the maximum
amount of data that can be written to or read from a single HDD in the set before the next HDD
is accessed, assuming that the accessed data starts at the beginning of the strip. Note that all
strips in a stripe have the same number of blocks, and decreasing strip size means that data is
broken into smaller pieces when spread across the disks. Stripe size is a multiple of strip size by
the number of HDDs in the RAID set. Stripe width refers to the number of data strips in a stripe.
Striped RAID does not protect data unless parity or mirroring is used. However, striping may
significantly improve I/O performance. Depending on the type of RAID implementation, the
RAID controller can be configured to access data across multiple HDDs simultaneously.
Mirroring
Mirroring is a technique whereby data is stored on two different HDDs, yielding two copies of
data. In the event of one HDD failure, the data is intact on the surviving HDD (see Figure 3-3)
and the controller continues to service the host’s data requests from the surviving disk of a
mirrored pair.
Mirrored disks in an array
When the failed disk is replaced with a new disk, the controller copies the data from the
surviving disk of the mirrored pair. This activity is transparent to the host. In addition to
providing complete data redundancy, mirroring enables faster recovery from disk failure.
However, disk mirroring provides only data protection and is not a substitute for data backup.
Mirroring constantly captures changes in the data, whereas a backup captures point-in-time
images of data. Mirroring involves duplication of data — the amount of storage capacity needed
is twice the amount of data being stored. Therefore, mirroring is considered expensive and is
preferred for mission-critical applications that cannot afford data loss. Mirroring improves read
performance because read requests can be serviced by both disks. However, write performance
deteriorates, as each write request manifests as two writes on the HDDs. In other words,
mirroring does not deliver the same levels of write performance as a striped RAID.
Parity
Parity is a method of protecting striped data from HDD failure without the cost of mirroring. An
additional HDD is added to the stripe width to hold parity, a mathematical construct that allows
re-creation of the missing data. Parity is a redundancy check that ensures full protection of data
without maintaining a full set of duplicate data. Parity information can be stored on separate,
dedicated HDDs or distributed across all the drives in a RAID set. Figure 3-4 shows a parity
RAID. The first four disks, labeled D, contain the data. The fifth disk, labeled P, stores the parity
information, which in this case is the sum of the elements in each row. Now, if one of the Ds
fails, the missing value can be calculated by subtracting the sum of the rest of the elements from
the parity value.