Presentation of the global web crawling architecture at the Bibliothèque nationale de France Global architecture of the DLWeb • Simplified application diagram • Platforms • Hardware specification • Virtual machines Database PILOT gulliver 101 CRAWLERS gulliver 105 - 134 Simplified application diagram DEDUP INDEX SERVER Jobs GUI JobManager Broker Indexer Administration interface viewerProxy Internet Storage array Simplified Platforms application diagram DEDUP INDEX SERVER PILOT Databases Indexer GUI Indexer master Indexer JobManager NAS Indexer Operational Platform Broker Indexer ViewerProxy Platforms 1 PILOT, 1 deduplication indexer, 1 indexer master, 2 to 8 indexers, 30 to 60 CRAWLERS. Variable and scalable number of computers Need to rationalize hardware resources and sharing Operational Platform : PFO Identical setup to the PFO, the PFO-MAB (MAB = Marche A Blanc = Trial Run) aims to simulate and test harvests in real conditions for our curators team (DLN). It is used for example to refine and/or test new crawling setups of the crawlers, for loads tests etc. Trial Run Platform : PFO-MAB Its size is also variable and subject to changes. The PFP for its part, is a technical test platform for the use of our engineers team (DSI). Pre-production Platform : PFP It allows us to validate architectural choices, to test deployments etc. Platforms Our needs : Flexibility regarding the number of CRAWLERS allocated to a platform. Hardware resources sharing and optimisation. All classical needs of production environments such as robustness and reliability. HYPERVISEUR Solution : virtualisation !!! Virtual computers Configuration « templates » Resource pool grouping of the computers Automatic management of all shared resources The DL-WEB cluster Cluster DL-WEB Shared resources 1 2 3 4 5 6 7 Physical servers Frame capable of containing up to 14 blade servers plus one media blade. Currently, seven of those servers are dedicated to the DL-WEB. 4 Ethernet -> 4 switch LAN 2 HBA -> 2 switch SAN (FC) Blade server Secured redundant SAN connection to the NetApp Switch SAN FC array Switch LAN Switch SAN NetApp array Networking Virtual servers eth0 eth1 Virtual switch VLAN521 VLAN522 M Each blade possesses a virtual switch configured as follows : Vlan521 (eth0, BNF private network) 1 active NIC + 2 spare Vlan522 (eth1, direct internet connection via a dedicated gateway) 1 active NIC + 2 spare. Management port : 1 dedicated interface + 1 spare Dive in the hardware 1 2 3 4 2 x 9 RAM of 4 Gb = 72 Gb RAM / machine 2 sockets On every socket, 1 CPU, 2 core & 4 treads Total of 16 logical CPUs per machine Physical Machines 2x9x4Gb = 72 Gb 2x2x4 = 16 CPU 7x72 = 504 Gb 7x16 = 112 CPU Platforms Functional choice: 1 machine = 1 crawler Reasons: - Flexibility of use: a crawler can move from one platform to another - single templates - CPU power completely dedicated to a crawler if needed - robustness – if a machine crashes only one job is affected Drawbacks: - Loss of disk space due to duplicated file systems. - loss of RAM. Virtual Machines Administration interface « VMware vSphere Client » for hosts and clusters Physical servers Platforms Fonctional Groups Virtual machines hosted on gulliver03 •PILOT • Active Machines •CRAWLER • Inactive Machines •INDEX MASTER •INDEXER • Templates Virtual machines DSR and V-motion A virtual machine is hosted on a single physical server at a given time. If the load of VM hosted on one of the servers becomes too heavy, some of the VMs are moved onto another host dynamically and without interruption. If one of the hosts fails, all the VM hosted on this server are moved to other hosts and are rebooted. Fault tolerance (FT) • An active copy of the FT VM runs on another server. • if the server where the master VM is hosted fails, the ghost VM instantly takes control without interruption. • A copy is then created on a third server, this process takes some time (up to few minutes) • The other VMs are moved and restarted as explained previously. FT can be quite greedy regarding resources especially concerning network consumption. That explains why we have activated this functionality only for the PILOT Multiples advantages • Flexibility : dynamically add/remove/move virtual machines • Easily modify/adapt hardware configuration (e.g. add CPU ) • Optimisation and mutualisation of hardware resources • Centralised administration for managing and supervising our park • Easy increase of resources if needed (by adding one or more blade in the cluster) • More reliable physical servers • Some key machines can be even more secured with the FT functionality