- SBForge

advertisement
Presentation of the global web
crawling architecture at the
Bibliothèque nationale de France
Global architecture of the DLWeb
• Simplified application diagram
• Platforms
• Hardware specification
• Virtual machines
Database
PILOT
gulliver 101
CRAWLERS
gulliver 105 - 134
Simplified application diagram
DEDUP
INDEX
SERVER
Jobs
GUI
JobManager
Broker
Indexer
Administration
interface
viewerProxy
Internet
Storage array
Simplified Platforms
application diagram
DEDUP
INDEX
SERVER
PILOT
Databases
Indexer
GUI
Indexer
master
Indexer
JobManager
NAS
Indexer
Operational Platform
Broker
Indexer
ViewerProxy
Platforms
1 PILOT, 1 deduplication indexer, 1 indexer master, 2 to 8
indexers, 30 to 60 CRAWLERS.
Variable and scalable number of computers
Need to rationalize hardware resources and sharing
Operational Platform : PFO
Identical setup to the PFO, the PFO-MAB (MAB = Marche A
Blanc = Trial Run) aims to simulate and test harvests in real
conditions for our curators team (DLN).
It is used for example to refine and/or test new crawling setups
of the crawlers, for loads tests etc.
Trial Run Platform : PFO-MAB
Its size is also variable and subject to changes.
The PFP for its part, is a technical test platform for the use of
our engineers team (DSI).
Pre-production Platform : PFP
It allows us to validate architectural choices, to test
deployments etc.
Platforms
Our needs :
Flexibility regarding the number of CRAWLERS allocated to a platform.
Hardware resources sharing and optimisation.
All classical needs of production environments such as robustness and reliability.
HYPERVISEUR
Solution : virtualisation !!!
Virtual computers
Configuration « templates »
Resource pool grouping of the
computers
Automatic management of all
shared resources
The DL-WEB cluster
Cluster DL-WEB
Shared resources
1
2
3
4
5
6
7
Physical servers
Frame capable of containing up to 14 blade servers plus
one media blade. Currently, seven of those servers are
dedicated to the DL-WEB.
4 Ethernet -> 4 switch LAN
2 HBA -> 2 switch SAN (FC)
Blade server
Secured redundant SAN
connection to the NetApp
Switch SAN FC
array
Switch LAN
Switch SAN
NetApp array
Networking
Virtual servers
eth0
eth1
Virtual switch
VLAN521
VLAN522
M
Each blade possesses a virtual switch configured as follows :
Vlan521 (eth0, BNF private network) 1 active NIC + 2 spare
Vlan522 (eth1, direct internet connection via a dedicated gateway) 1
active NIC + 2 spare.
Management port : 1 dedicated interface + 1 spare
Dive in the hardware
1
2
3
4
2 x 9 RAM of 4 Gb = 72 Gb RAM / machine
2 sockets On every socket, 1 CPU, 2 core & 4 treads
Total of 16 logical CPUs per machine
Physical Machines
2x9x4Gb = 72 Gb
2x2x4 = 16 CPU
7x72 = 504 Gb
7x16 = 112 CPU
Platforms
Functional choice: 1 machine = 1 crawler
Reasons:
- Flexibility of use: a crawler can move from one platform to another
- single templates
- CPU power completely dedicated to a crawler if needed
- robustness – if a machine crashes only one job is affected
Drawbacks:
- Loss of disk space due to duplicated file systems.
- loss of RAM.
Virtual Machines
Administration interface « VMware vSphere Client » for hosts and clusters
Physical servers
Platforms
Fonctional Groups
Virtual machines hosted on gulliver03
•PILOT
• Active Machines
•CRAWLER
• Inactive Machines
•INDEX MASTER
•INDEXER
• Templates
Virtual machines
DSR and V-motion
A virtual machine is hosted on a single
physical server at a given time.
If the load of VM hosted on one of the
servers becomes too heavy, some of the
VMs are moved onto another host
dynamically and without interruption.
If one of the hosts fails, all the VM hosted on
this server are moved to other hosts and are
rebooted.
Fault tolerance (FT)
• An active copy of the FT VM runs on another server.
• if the server where the master VM is hosted fails, the ghost VM instantly takes
control without interruption.
• A copy is then created on a third server, this process takes some time (up to few
minutes)
• The other VMs are moved and restarted as explained previously.
FT can be quite greedy regarding
resources especially concerning
network consumption. That explains
why we have activated this functionality
only for the PILOT
Multiples advantages
• Flexibility : dynamically add/remove/move virtual machines
• Easily modify/adapt hardware configuration (e.g. add CPU )
• Optimisation and mutualisation of hardware resources
• Centralised administration for managing and supervising our park
• Easy increase of resources if needed (by adding one or more blade in the
cluster)
• More reliable physical servers
• Some key machines can be even more secured with the FT functionality
Download