Uploaded by 1749743659

cluster-management-ufm-test-installation-guide-006896

advertisement
Cluster Management - UFM Test Installation Guide
Background:
The NVIDIA Unified Fabric Manager (UFM) platform brings a new twist to data center network management
by combining enhanced real-time network telemetry with AI-driven network intelligence and analytics to
support resilient and scalable InfiniBand data centers. After deploying UFM on the server, unmanageable
switches are also available for management functions.
The UFM platform enables research and industry data center operators to efficiently provision, monitor,
manage, preventatively troubleshoot, and maintain InfiniBand data center networks, and includes several
different levels of solutions and a comprehensive feature set to meet a wide range of modern scale-out data
center needs. With UFM, you can achieve higher network resource utilization, gain a competitive advantage,
and reduce operational expenses.
HOW IS UFM DIFFERENT FROM OPENSM
OpenSM is a routing engine, not a Fabric Management solution
UFM uses OpenSM as its routing engine but has many other things on top
KEY FEATURES
Subnet Management (SM)
Automatic Network Discovery
Fabric Performance / Error / Congestion Monitoring
Fabric Visualization + Multisite Portal
Congestion and Performance Analysis
Performance Optimization
Device Discovery
Chassis and FRU Monitoring
Device Management
Event / Fault Management
Fabric Configuration
Automation
The following is the configuration tutorial for the application test.
1. Online Application for Test License
Fill out the form by clicking on the following link: https://enterpriseproductregistration.nvidia.com/
?LicType=EVAL&Produ ctFamily=UFM
The customer or partner will need to accept and complete the information and submit the
application to receive a 60-day test and evaluation license. Attention, you must use your corporate
email address, otherwise you will not be approved.
Please watch for emails after submitting your application. After the application is approved, you
will receive the test license.
2. Register to Activate Test License
If you are logging into the NVIDIA APPLICATION HUB website for the first time, you will receive
another email to activate your account and change your password. Click "SET PASSWORD" in the
email to enter the change screen, and then you can login to the NVIDIA APPLICATION HUB website.
(https://nvid.nvidia.com/dashboard/#/dashboard).
Click on the link to go to NVIDIA LICENSING PORTAL to start the activation process. Click
NETWORK ENTITLEMENT to view network-related Licenses.
In the screen, the PAK ID given in the email attachment above corresponds to the software License
here.
Click Action and select Manage license.
Confirm the ID in the new window, enter the MAC address of the UFM server NIC and the MAC
address of the HA UFM server (optional), and then you can click GENERATE LICENSE FILE to
generate the License file. If you have multiple NICs, you can select the MAC address of any NIC.
Click the link to download License.
3. Download UFM Software
In the NVIDIA LICENSING PORTAL, click into SOFTWARE DOWNLOAD, select the package
corresponding to the OS version, and agree to download it.
4. Install UFM Software
The default UFM installation path is/opt/ufm.
UFM optional mounting options are:
Standalone Deployment
HA High Reliability Deployment
Container Deployment
Some services may be affected during the installation process:
httpd (apachi2 in Ubuntu)
dhcpd
After the installation is complete, you need to activate the License and perform the initial configuration.
To activate the license, you only need to copy the license file above to the path: /opt/ufm/files/licenses
[root@localhost conf]# ls /opt/ufm/files/licenses
mlnx-ufm-mdpagop65w-2dp0y6hdq4-9j5knopcmf-20221009063602.lic
For UFM software dependencies, please refer to the online documentation at Prerequisites for UFM Server
Software Installation , After the zip archive is unpacked several files are as follows:
Decompress it ufm-6.10.0-3.rhel7.mofed5.tgz ,
Just run the install.sh script to install and follow the printed information to resolve dependencies
and conflicting software issues.
After completing the installation follow the prompts to enable and start ufm.
[root@localhost ufm-6.10.0-3.rhel7.mofed5]# ./install.sh
Do you want to install UFM server
[Y|n]? Y
UFM IB PREREQUISITE TEST
Installed distribution
[OK]
Server architecture
[OK]
OFED version
[OK]
Other SM
[OK]
Timezone cofiguration
[OK]
Python version
[OK]
IPtables service
[OK]
Required RPM(s)
[OK]
Required Python Packages
[OK]
Sudoers directory existence
[OK]
Sudoers directory inclusion
[OK]
Conflicting RPM(s)
[OK]
Conflicting unhandled packages(s)
[OK]
IB interface
[OK]
Localhost resolving
[OK]
Hostname resolving
[OK]
SELinux disabled
[OK]
Available disk space
[OK]
Write permissions on /tmp for other
[OK]
Virtual IP Port
[OK]
Ufmapp user definitions
[OK]
Checking that all required ports are available
Checking tcp ports
Checking state of port 3307
Port 3307 is free
Checking state of port 2222
Port 2222 is free
Checking state of port 8088
Port 8088 is free
Checking state of port 8080
Port 8080 is free
Checking state of port 8081
Port 8081 is free
Checking state of port 8082
Port 8082 is free
Checking state of port 8083
Port 8083 is free
Checking state of port 8089
Port 8089 is free
Checking udp ports
Checking state of port 6306
Port 6306 is free
Checking state of port 8005
Port 8005 is free
Checking tcp ports allowed for httpd
Checking state of port 443
Port 443 is free
Checking state of port 80
Port 80 is free
localhost.localdomain: All prerequisite tests passed. See /tmp/ufm_prereq.log for
more details
Installing UFM...
Installing UFM related mft utilities
[*] Restoring HA flags...
[*] UFM installation log : /tmp/ufm_install_5874.log
[*] UFM Installation finished successfully.
[*] To enable UFM on startup run:
systemctl enable ufm-enterprise.service
[*] To Start UFM Please run:
systemctl start ufm-enterprise.service
[root@localhost ufm-6.10.0-3.rhel7.mofed5]# systemctl enable ufmenterprise.service
Created symlink from /etc/systemd/system/multi-user.target.wants/ufmenterprise.service to /usr/lib/systemd/system/ufm-enterprise.service.
[root@localhost licenses]# systemctl start ufm-enterprise.service
[root@localhost licenses]# systemctl status ufm-enterprise.service
● ufm-enterprise.service - UFM Enterprise
Loaded: loaded (/usr/lib/systemd/system/ufm-enterprise.service; enabled;
vendor preset: disabled)
Active: active (exited) since Sun 2022-10-09 17:06:08 CST; 8s ago
Process: 7477 ExecStart=/etc/init.d/ufmd start (code=exited, status=0/SUCCESS)
Main PID: 7477 (code=exited, status=0/SUCCESS)
Tasks: 167
CGroup: /system.slice/ufm-enterprise.service
├─7758 tail -f /opt/ufm/opensm/smc_in
├─7759 /opt/ufm/opensm/sbin/opensm --config
/opt/ufm/files/conf/opensm/opensm.conf -q local
├─7768 osm_crashd
├─8392 /usr/bin/python /bin/supervisord -config=/opt/ufm/files/conf/telemetry/supervisord.conf
├─8442 /opt/ufm/venv_ufm/bin/python3 -O
/opt/ufm/periodicreport/main/periodic_report_runner.pyc
├─8446 /opt/ufm/telemetry/bin/launch_ibdiagnet --config
/opt/ufm/files/conf/telemetry/launch_ibdiagnet_config.ini
├─8447 /opt/ufm/telemetry/bin/watcher --config
/opt/ufm/files/conf/telemetry/launch_ibdiagnet_config.ini
├─8448 /opt/ufm/telemetry/bin/watcher --config
/opt/ufm/files/conf/telemetry/launch_ibdiagnet_config.ini
├─8455 /opt/ufm/telemetry/bin/launch_ibdiagnet --config
/opt/ufm/files/conf/telemetry/launch_ibdiagnet_config.ini
├─8496 /opt/ufm/venv_ufm/bin/python3 -O
/opt/ufm/unhealthyports/upcore/unhealthy_ports_main.pyc
├─8498 timeout 10010 /opt/ufm/telemetry/bin/ibdiagnet -long_run_timeout 1000 --long_run_iteration 10000 -o /opt/ufm/files/log -i mlx5_0
--skip dup_guids --config_file /opt/ufm/conf/opensm/ibd...
├─8499 /opt/ufm/telemetry/bin/ibdiagnet --long_run_timeout 1000 -long_run_iteration 10000 -o /opt/ufm/files/log -i mlx5_0 --skip dup_guids -config_file /opt/ufm/conf/opensm/ibdiag.conf --ski...
└─8520 /opt/ufm/venv_ufm/bin/python3
/opt/ufm/ufmhealth/UfmHealthRunner.pyc
/opt/ufm/files/conf/UFMHealthConfiguration.xml
Oct 09 17:06:05 localhost.localdomain ufmd[7477]: Starting Daily Report:
[
OK
]
Oct 09 17:06:07 localhost.localdomain ufmd[7477]: Starting UnhealthyPorts:
[
]
Oct 09 17:06:08 localhost.localdomain systemd[1]: Started UFM Enterprise.
Oct 09 17:06:12 localhost.localdomain ibdiagnet[8657]: No scope files. Total
switches/ports [1/41], CAs/ports [2/2]
Oct 09 17:06:12 localhost.localdomain ibdiagnet[8657]: Stage "Discovery" real:
0.031801 user: 0.003511, sys: 0.009811
Oct 09 17:06:13 localhost.localdomain ibdiagnet[8657]: Stage "Port Counters"
real: 1.009172 user: 0.001860, sys: 0.000648
Oct 09 17:06:13 localhost.localdomain ibdiagnet[8657]: Stage "Virtualization"
real: 0.000449 user: 0.000197, sys: 0.000251
Oct 09 17:06:13 localhost.localdomain ibdiagnet[8657]: Stage "Temperature
Sensing" real: 0.000476 user: 0.000085, sys: 0.000108
Oct 09 17:06:13 localhost.localdomain ibdiagnet[8657]: Stage "Routers" real:
0.000113 user: 0.000049, sys: 0.000062
Oct 09 17:06:13 localhost.localdomain ibdiagnet[8657]: Stage "Post Reports
Generation" real: 0.000245 user: 0.000108, sys: 0.000137
OK
Download