Cluster Management - UFM Test Installation Guide Background: The NVIDIA Unified Fabric Manager (UFM) platform brings a new twist to data center network management by combining enhanced real-time network telemetry with AI-driven network intelligence and analytics to support resilient and scalable InfiniBand data centers. After deploying UFM on the server, unmanageable switches are also available for management functions. The UFM platform enables research and industry data center operators to efficiently provision, monitor, manage, preventatively troubleshoot, and maintain InfiniBand data center networks, and includes several different levels of solutions and a comprehensive feature set to meet a wide range of modern scale-out data center needs. With UFM, you can achieve higher network resource utilization, gain a competitive advantage, and reduce operational expenses. HOW IS UFM DIFFERENT FROM OPENSM OpenSM is a routing engine, not a Fabric Management solution UFM uses OpenSM as its routing engine but has many other things on top KEY FEATURES Subnet Management (SM) Automatic Network Discovery Fabric Performance / Error / Congestion Monitoring Fabric Visualization + Multisite Portal Congestion and Performance Analysis Performance Optimization Device Discovery Chassis and FRU Monitoring Device Management Event / Fault Management Fabric Configuration Automation The following is the configuration tutorial for the application test. 1. Online Application for Test License Fill out the form by clicking on the following link: https://enterpriseproductregistration.nvidia.com/ ?LicType=EVAL&Produ ctFamily=UFM The customer or partner will need to accept and complete the information and submit the application to receive a 60-day test and evaluation license. Attention, you must use your corporate email address, otherwise you will not be approved. Please watch for emails after submitting your application. After the application is approved, you will receive the test license. 2. Register to Activate Test License If you are logging into the NVIDIA APPLICATION HUB website for the first time, you will receive another email to activate your account and change your password. Click "SET PASSWORD" in the email to enter the change screen, and then you can login to the NVIDIA APPLICATION HUB website. (https://nvid.nvidia.com/dashboard/#/dashboard). Click on the link to go to NVIDIA LICENSING PORTAL to start the activation process. Click NETWORK ENTITLEMENT to view network-related Licenses. In the screen, the PAK ID given in the email attachment above corresponds to the software License here. Click Action and select Manage license. Confirm the ID in the new window, enter the MAC address of the UFM server NIC and the MAC address of the HA UFM server (optional), and then you can click GENERATE LICENSE FILE to generate the License file. If you have multiple NICs, you can select the MAC address of any NIC. Click the link to download License. 3. Download UFM Software In the NVIDIA LICENSING PORTAL, click into SOFTWARE DOWNLOAD, select the package corresponding to the OS version, and agree to download it. 4. Install UFM Software The default UFM installation path is/opt/ufm. UFM optional mounting options are: Standalone Deployment HA High Reliability Deployment Container Deployment Some services may be affected during the installation process: httpd (apachi2 in Ubuntu) dhcpd After the installation is complete, you need to activate the License and perform the initial configuration. To activate the license, you only need to copy the license file above to the path: /opt/ufm/files/licenses [root@localhost conf]# ls /opt/ufm/files/licenses mlnx-ufm-mdpagop65w-2dp0y6hdq4-9j5knopcmf-20221009063602.lic For UFM software dependencies, please refer to the online documentation at Prerequisites for UFM Server Software Installation , After the zip archive is unpacked several files are as follows: Decompress it ufm-6.10.0-3.rhel7.mofed5.tgz , Just run the install.sh script to install and follow the printed information to resolve dependencies and conflicting software issues. After completing the installation follow the prompts to enable and start ufm. [root@localhost ufm-6.10.0-3.rhel7.mofed5]# ./install.sh Do you want to install UFM server [Y|n]? Y UFM IB PREREQUISITE TEST Installed distribution [OK] Server architecture [OK] OFED version [OK] Other SM [OK] Timezone cofiguration [OK] Python version [OK] IPtables service [OK] Required RPM(s) [OK] Required Python Packages [OK] Sudoers directory existence [OK] Sudoers directory inclusion [OK] Conflicting RPM(s) [OK] Conflicting unhandled packages(s) [OK] IB interface [OK] Localhost resolving [OK] Hostname resolving [OK] SELinux disabled [OK] Available disk space [OK] Write permissions on /tmp for other [OK] Virtual IP Port [OK] Ufmapp user definitions [OK] Checking that all required ports are available Checking tcp ports Checking state of port 3307 Port 3307 is free Checking state of port 2222 Port 2222 is free Checking state of port 8088 Port 8088 is free Checking state of port 8080 Port 8080 is free Checking state of port 8081 Port 8081 is free Checking state of port 8082 Port 8082 is free Checking state of port 8083 Port 8083 is free Checking state of port 8089 Port 8089 is free Checking udp ports Checking state of port 6306 Port 6306 is free Checking state of port 8005 Port 8005 is free Checking tcp ports allowed for httpd Checking state of port 443 Port 443 is free Checking state of port 80 Port 80 is free localhost.localdomain: All prerequisite tests passed. See /tmp/ufm_prereq.log for more details Installing UFM... Installing UFM related mft utilities [*] Restoring HA flags... [*] UFM installation log : /tmp/ufm_install_5874.log [*] UFM Installation finished successfully. [*] To enable UFM on startup run: systemctl enable ufm-enterprise.service [*] To Start UFM Please run: systemctl start ufm-enterprise.service [root@localhost ufm-6.10.0-3.rhel7.mofed5]# systemctl enable ufmenterprise.service Created symlink from /etc/systemd/system/multi-user.target.wants/ufmenterprise.service to /usr/lib/systemd/system/ufm-enterprise.service. [root@localhost licenses]# systemctl start ufm-enterprise.service [root@localhost licenses]# systemctl status ufm-enterprise.service ● ufm-enterprise.service - UFM Enterprise Loaded: loaded (/usr/lib/systemd/system/ufm-enterprise.service; enabled; vendor preset: disabled) Active: active (exited) since Sun 2022-10-09 17:06:08 CST; 8s ago Process: 7477 ExecStart=/etc/init.d/ufmd start (code=exited, status=0/SUCCESS) Main PID: 7477 (code=exited, status=0/SUCCESS) Tasks: 167 CGroup: /system.slice/ufm-enterprise.service ├─7758 tail -f /opt/ufm/opensm/smc_in ├─7759 /opt/ufm/opensm/sbin/opensm --config /opt/ufm/files/conf/opensm/opensm.conf -q local ├─7768 osm_crashd ├─8392 /usr/bin/python /bin/supervisord -config=/opt/ufm/files/conf/telemetry/supervisord.conf ├─8442 /opt/ufm/venv_ufm/bin/python3 -O /opt/ufm/periodicreport/main/periodic_report_runner.pyc ├─8446 /opt/ufm/telemetry/bin/launch_ibdiagnet --config /opt/ufm/files/conf/telemetry/launch_ibdiagnet_config.ini ├─8447 /opt/ufm/telemetry/bin/watcher --config /opt/ufm/files/conf/telemetry/launch_ibdiagnet_config.ini ├─8448 /opt/ufm/telemetry/bin/watcher --config /opt/ufm/files/conf/telemetry/launch_ibdiagnet_config.ini ├─8455 /opt/ufm/telemetry/bin/launch_ibdiagnet --config /opt/ufm/files/conf/telemetry/launch_ibdiagnet_config.ini ├─8496 /opt/ufm/venv_ufm/bin/python3 -O /opt/ufm/unhealthyports/upcore/unhealthy_ports_main.pyc ├─8498 timeout 10010 /opt/ufm/telemetry/bin/ibdiagnet -long_run_timeout 1000 --long_run_iteration 10000 -o /opt/ufm/files/log -i mlx5_0 --skip dup_guids --config_file /opt/ufm/conf/opensm/ibd... ├─8499 /opt/ufm/telemetry/bin/ibdiagnet --long_run_timeout 1000 -long_run_iteration 10000 -o /opt/ufm/files/log -i mlx5_0 --skip dup_guids -config_file /opt/ufm/conf/opensm/ibdiag.conf --ski... └─8520 /opt/ufm/venv_ufm/bin/python3 /opt/ufm/ufmhealth/UfmHealthRunner.pyc /opt/ufm/files/conf/UFMHealthConfiguration.xml Oct 09 17:06:05 localhost.localdomain ufmd[7477]: Starting Daily Report: [ OK ] Oct 09 17:06:07 localhost.localdomain ufmd[7477]: Starting UnhealthyPorts: [ ] Oct 09 17:06:08 localhost.localdomain systemd[1]: Started UFM Enterprise. Oct 09 17:06:12 localhost.localdomain ibdiagnet[8657]: No scope files. Total switches/ports [1/41], CAs/ports [2/2] Oct 09 17:06:12 localhost.localdomain ibdiagnet[8657]: Stage "Discovery" real: 0.031801 user: 0.003511, sys: 0.009811 Oct 09 17:06:13 localhost.localdomain ibdiagnet[8657]: Stage "Port Counters" real: 1.009172 user: 0.001860, sys: 0.000648 Oct 09 17:06:13 localhost.localdomain ibdiagnet[8657]: Stage "Virtualization" real: 0.000449 user: 0.000197, sys: 0.000251 Oct 09 17:06:13 localhost.localdomain ibdiagnet[8657]: Stage "Temperature Sensing" real: 0.000476 user: 0.000085, sys: 0.000108 Oct 09 17:06:13 localhost.localdomain ibdiagnet[8657]: Stage "Routers" real: 0.000113 user: 0.000049, sys: 0.000062 Oct 09 17:06:13 localhost.localdomain ibdiagnet[8657]: Stage "Post Reports Generation" real: 0.000245 user: 0.000108, sys: 0.000137 OK