7.2.5. Performance Monitoring
This section provides information on the performance of the fabric using data from the Performance Monitoring tool as well as the Performance Manager parameters.
7.2.5.1. Monitoring Fabric Performance
The opatop command allow you to start up the Fabric Performance Monitoring TUI so that you can monitor the performance of the fabric.
The Fabric Performance Monitor TUI displays performance, congestion, and statistical information about a fabric. Fabric information is divided into two main starting points for analyzing fabric traffic:
Performance (bandwidth utilization): Can identify over-utilized areas (bottlenecks) and under-utilized areas (potentially misconfigured).
Statistics: Can identify problems in fabric hardware or configuration, as well as congestion and other performance situations.
This section describes:
The TUI menus used to gather Fabric Performance data.
What to do with the data you have gathered.
7.2.5.1.1. Accessing the Fabric Performance Monitor
The Fabric Performance Monitor allows you to monitor performance, congestion, and statistics information in a fabric.
Using the opatop Command
To start up the Fabric Performance Monitor from the command prompt, perform the following steps:
Log in to the server as root.
At the command prompt, enter
opatop.The Fabric Performance Monitor Summary screen is displayed.
opatop: Img: 10s @ Wed Sep 14 11:29:52 2016, Live Summary: SW: 0 Ports: SW: 0 HFI: 2 Link: 1 SM: 1 Node NRsp: 0 Skip: 0 Port NRsp: 0 Skip: 0 AvgMBps MinMBps MaxMBps AvgKPps MinKPps MaxKPps 0 All Int 0 0 0 0 0 0 Integ:min Congst:min SmaCong:min Bubble:min Secure:min Routing:min 1 HFIs Int 0 0 0 0 0 0 Integ:min Congst:min SmaCong:min Bubble:min Secure:min Routing:min 2 SWs No ports in group Master-SM: LID: 0x0001 Port: 1 Priority: 0 State: Master Name: phcppriv10 hfi1_0 PortGUID: 0x0011750101575300 Secondary-SM: none Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help | sS Pmcfg Imginfo View 0-n:
From the Cornelis FastFabric OPA Tools Menu
To start up the Fabric Performance Monitor menu from the Cornelis FastFabric OPA Tools menu, perform the following steps:
Log in to the server as root.
At the command prompt, enter
opafastfabric.The Cornelis FastFabric OPA Tools menu is displayed.
Cornelis FastFabric OPA Tools Version: X.X.X.X.X 1) Chassis Setup/Admin 2) Externally Managed Switch Setup/Admin 3) Host Setup 4) Host Verification/Admin 5) Fabric Monitoring X) Exit (or Q)
At the cursor, type
5.The FastFabric OPA Fabric Monitoring menu is displayed.
FastFabric OPA Fabric Monitoring Menu 0) Fabric Performance Monitoring [ Skip ] P) Perform the Selected Actions N) Select None X) Return to Previous Menu (or ESC or Q)Table 19. FastFabric OPA Fabric Monitoring Menu DescriptionsMenu Item
Description
0) Fabric Performance MonitoringAllows you to access the TUI that monitors the performance, congestion, and statistics information about a fabric.
Associated CLI Command:
opatopType
0to toggle to the[Perform]option.Type
Pto perform the operation.The Fabric Performance Monitor information is displayed.
opatop: Img: 10s @ Fri Sep 16 11:35:24 2016, Live Summary: SW: 0 Ports: SW: 0 HFI: 2 Link: 1 SM: 1 Node NRsp: 0 Skip: 0 Port NRsp: 0 Skip: 0 AvgMBps MinMBps MaxMBps AvgKPps MinKPps MaxKPps 0 All Int 0 0 0 0 0 0 Integ:min Congst:min SmaCong:min Bubble:min Secure:min Routing:min 1 HFIs Int 0 0 0 0 0 0 Integ:min Congst:min SmaCong:min Bubble:min Secure:min Routing:min 2 SWs No ports in group Master-SM: LID: 0x0001 Port: 1 Priority: 0 State: Master Name: phcppriv10 hfi1_0 PortGUID: 0x0011750101575300 Secondary-SM: none Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help | sS Pmcfg Imginfo View 0-n:
7.2.5.1.2. How to Use the Fabric Performance Monitor TUI
The Fabric Performance Monitor TUI allows you to view and interact with live performance data.
Reading the TUI Screens
The figure below shows the major sections common to all Fabric Performance Monitor TUI screens.
![]() |
Section of Screen | Description |
|---|---|
| Refers to the CLI command that initiates the Fabric Performance Monitoring TUI. NOTE: |
Image Identification | Displays the following image (Img) information:
|
Screen-Specific Information | Displays information and layout of the selected screen. NOTE: Each screen is different and will be discussed in subsequent sections. |
Common Input Commands | Displays the common input commands that appear on every screen and performs the same action.
Commands are case insensitive except where specifically noted otherwise. The ENTER key must be pressed after multi-character commands and for |
Screen-Specific Input Commands | Displays the screen-specific commands. |
Navigating the Screens
The Fabric Performance Monitoring TUI allows you to access various screens in a hierarchical manner to examine the state of a fabric. Through the screen-specific commands, each screen will provide access to the next screen or back to the parent screen.
The Fabric Performance Monitoring TUI screen navigational hierarchy is shown below.
As an example, if you want to navigate from the Group Info Sel screen to the Group BW Stats screen, perform the following steps:
The Group Info Sel screen is shown below.
opatop: Img: 10s @ Thu Sep 22 15:44:47 2016, Live Group Info Sel: HFIs Int NumPorts: 2 Rate Min: 100g Max: 100g Ext NumPorts: 0 Group Performance (P) Group Statistics (S) Group Config (C) Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help | P S C:
The selections for the next level of screens are displayed as:
Group Performance (P) Group Statistics (S) Group Config (C)
The menu options are shown in the screen-specific commands as:
Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help | P S C:From the Group Info Sel screen, enter
P.The Group BW Stats screen is displayed.
opatop: Img: 10s @ Thu Sep 22 15:52:27 2016, Live Group Performance: HFIs Criteria: Util-High Number: 10 Int: TotMBps AvgMBps MinMBps MaxMBps TotKPps AvgKPps MinKPps MaxKPps 0 0 0 0 0 0 0 0 Buckt 0+% 10+% 20+% 30+% 40+% 50+% 60+% 70+% 80+% 90+% 2 0 0 0 0 0 0 0 0 0 NoResp Int Ports: PMA: 0 Topo: 0 Max 0+% 25+% 50+% 75+% 100+% Int Congestion 0 2 0 0 0 0 Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help | cC N0-n Detail:Type
u(lowercase) to return to the Group Info Sel screen.Type
u(lowercase) to return to the Summary screen.
Important
To switch between Port and Virtual Fabric Grouping screens, press V at the Summary screen and navigate through the hierarchy.
Viewing the Fabric Performance Monitoring Summary Screen
The top-level Summary screen shows the basic fabric configuration information as well as performance and statistics information. This is the initial screen you see when you start up the TUI.
After looking at the Summary screen you can decide which area of the fabric (performance or statistics) and which port group or virtual fabric most warrants investigation, and can then drill down into that area.
To view the Fabric Performance Monitoring Summary screen, perform the following steps:
Log in to the server as root.
At the command prompt, enter
opatop.The Summary screen is displayed.
opatop: Img: 10s @ Wed Sep 14 11:29:52 2016, Live Summary: SW: 0 Ports: SW: 0 HFI: 2 Link: 1 SM: 1 Node NRsp: 0 Skip: 0 Port NRsp: 0 Skip: 0 AvgMBps MinMBps MaxMBps AvgKPps MinKPps MaxKPps 0 All Int 0 0 0 0 0 0 Integ:min Congst:min SmaCong:min Bubble:min Secure:min Routing:min 1 HFIs Int 0 0 0 0 0 0 Integ:min Congst:min SmaCong:min Bubble:min Secure:min Routing:min 2 SWs No ports in group Master-SM: LID: 0x0001 Port: 1 Priority: 0 State: Master Name: phcppriv10 hfi1_0 PortGUID: 0x0011750101575300 Secondary-SM: none Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help | sS Pmcfg Imginfo View 0-n:To change to the Virtual Fabrics (VF) Summary screen, type
V.The VF Summary screen is shown as in the example below.
opatop: Img: 10s @ Thu Sep 22 15:20:07 2016, Live Summary: SW: 0 Ports: SW: 0 HFI: 2 Link: 1 SM: 1 Node NRsp: 0 Skip: 0 Port NRsp: 0 Skip: 0 AvgMBps MinMBps MaxMBps AvgKPps MinKPps MaxKPps 0 Admin Int 0 0 0 0 0 0 Integ:min Congst:min SmaCong:min Bubble:min Secure:min Routing:min Master-SM: LID: 0x0001 Port: 1 Priority: 0 State: Master Name: phcppriv10 hfi1_0 PortGUID: 0x0011750101575300 Secondary-SM: none Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help | sS Pmcfg Imginfo View 0-n:
The table below describes the Summary screen field descriptions.
Field | Description |
|---|---|
Fabric Configuration Information | Fabric configuration information includes:
|
Performance and Statistics for Each Port Group | Fabric performance and statistics are presented based on port groupings and virtual fabrics grouping: For Port Groups:
For Virtual Fabrics Group:
These groups provide a natural subdivision of the ports in a fabric for analysis. For each group, the following statistics are reported:
|
Performance Utilization | Performance Utilization for each port group is divided into up to three subgroups based on whether a port's neighbor port is in its group:
|
Statistics Categories | The statistics categories are:
Statistics categories are each based on one or more port counters. Each statistics category’s status indicator is shown at one of five values/colors based on the category value as compared to a threshold value:
|
Viewing the PM Configuration
The PM Configuration screen displays information as provided by the PM.
Note
The PM Configuration screen is the same for VF and non-VF.
The PM Configuration screen has no screen-specific input commands.
To view PM Configuration, perform the following steps:
Log in to the server as root.
At the command prompt, enter
opatop.The Summary screen is displayed.
Type
p.The PM Configuration screen is displayed as shown in the example below.
opatop: Img: 10s @ Thu Sep 22 15:23:17 2016, Live PM Config: Sweep Interval: 10 sec PM Flags(0x33): ProcessHFICntrs=On ProcessVLCntrs=On ClrDataCntrs=Off Clr64bitErrCntrs=Off Clr32bitErrCntrs=On Clr8bitErrCntrs=On Max Clients: 3 Total Images: 10 Freeze Images: 5 Freeze Lease: 60 seconds Ctg Thresholds: Integrity: 100 Congestion: 100 SmaCongest: 100 Bubble: 100 Security: 10 Routing: 100 Integrity Wts: Link Qual: 40 Uncorrectable: 100 Link Downed: 25 Rcv Errors: 100 Excs Bfr Ovrn: 100 FM Config Err: 100 Link Err Reco: 100 Loc Link Integ: 0 Lnk Wdth Dngd: 100 Congest Wts: Cong Discards: 100 Rcv FECN: 5 Rcv BECN: 1 Mark FECN: 25 Xmit Time Cong 25 Xmit Wait: 10 PM Memory Size: 169 MB (169295080 bytes) PMA MADs: MaxAttempts: 3 MinRespTimeout: 35 RespTimeout: 250 Sweep: MaxParallelNodes: 10 PmaBatchSize: 2 ErrorClear: 7 Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help |Type
u(lowercase) to return to the Summary Screen.
The table below describes the PM Configuration screen field descriptions.
Field | Description |
|---|---|
Sweep Interval | The time over the image data is relevant. Default is 10 seconds. NOTE: Normally, the opatop interval should be set to a value ≥ Sweep Interval. |
PM Flags | Shows whether PM Flags are On or Off for:
|
Max Clients | Maximum clients. |
Total Images |
|
Ctg Thresholds | Category thresholds:
|
Integrity Wts | Integrity weights:
|
Congest Wts | Congestion weights: |
PM Memory Size | Size of the PM memory footprint in MB and bytes. |
PMA MADs |
|
Sweep | Sweep information:
|
Viewing Image Information
The Image Information screen shows the image information as provided by the PM.
Note
The Image Information screen is the same for VF and non-VF.
The PM Configuration screen has no screen-specific input commands.
To view Image Information, perform the following steps:
Log in to the server as root.
At the command prompt, enter
opatop.The Summary screen is displayed.
Type
I.The Image Info screen is displayed as shown in the example below.
opatop: Img: IIs @ Day Month Date HR:MIN:SEC YYYY, Live Image Inopatop: Img: 10s @ Thu Sep 22 16:51:58 2016, Live Image Info: Sweep Start: Thu Sep 22 16:51:58 2016 Sweep Duration: 0.001 Seconds Image Interval: 10 Seconds Num SW-Ports: 0 HFI-Ports: 2 Num SWs: 0 Num Links: 1 Num SMs: 1 Num NRsp Nodes: 0 Ports: 0 Unexpected Clear Ports: 0 Num Skip Nodes: 0 Ports: 0 Master-SM: LID: 0x0001 Port: 1 Priority: 0 State: Master Name: phcppriv10 hfi1_0 PortGUID: 0x0011750101575300 Secondary-SM: none Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help |Type
u(lowercase) to return to the Summary Screen.
The following table describes the Image Information screen field descriptions.
Field | Description |
|---|---|
Sweep Start | Timestamp for the start of the sweep |
Sweep Duration | Length of time for the sweep |
Image Interval | The time over the image data is relevant. Default is 10 seconds. |
Num [Ports] | Number of ports in each group:
|
Num SWs | Number of switches |
Node Information | Node information including:
|
Port Information | Port information including:
|
SM Information | Primary and secondary SM details
|
Viewing Bandwidth Utilization
For each valid performance data subgroup, the Bandwidth Utilization screen displays the total, average, minimum, and maximum MBps and KPps. For each subgroup, ten performance 'buckets' count the number of ports whose 'MBps compared to link rate' value corresponds to that bucket. This provides an indication of how the data rate of the group compares to its potential.
To view bandwidth utilization, perform the following steps:
Log in to the server as root.
At the command prompt, enter
opatop.The Summary screen is displayed.
Determine which set of statistics you want to view:
To view Group information, continue to the next step.
To view VF information, type
V.
Type the number for the specific group statistics that you want to view:
For Port Group:
0– All1– SuperNICs2– SWs
For VF Group:
0– Default1– Admin
The Info Select screen is displayed as shown in the following example.
opatop: Img: 10s @ Fri Sep 23 09:44:49 2016, Live Group Info Sel: HFIs Int NumPorts: 2 Rate Min: 100g Max: 100g Ext NumPorts: 0 Group Performance (P) Group Statistics (S) Group Config (C) Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help | P S C:
Type
P.The Bandwidth (BW) Util screen is displayed as shown in the following example.
opatop: Img: 10s @ Fri Sep 23 09:46:09 2016, Live Group BW Util: HFIs Criteria: Util-High Number: 10 Int: TotMBps AvgMBps MinMBps MaxMBps TotKPps AvgKPps MinKPps MaxKPps 0 0 0 0 0 0 0 0 Buckt 0+% 10+% 20+% 30+% 40+% 50+% 60+% 70+% 80+% 90+% 2 0 0 0 0 0 0 0 0 0 NoResp Int Ports: PMA: 0 Topo: 0 Max 0+% 25+% 50+% 75+% 100+% Int Congestion 0 2 0 0 0 0 Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help | cC N0-n Detail:To set the BW stats Criteria for the focus query, type
c(lowercase) to scroll forward orC(uppercase) to scroll in reverse to select one of the following choices:Util-High– Bandwidth Utilization (highest first)UtlPkt-Hi– Packet Utilization (highest first)Util-Low– Bandwidth Utilization (lowest first)VF-Ut-Hi– VF Bandwidth Utilization (highest first)VF-Pkt-Hi– VF Packet Utilization (highest first)VF-Ut-Low– VF Bandwidth Utilization (lowest first)
To change the Number of entries in the BW stats list, type
Nand enter the target number of entries; then pressEnter.Type
Dto initiate the group focus query and access the detailed Group Focus screen (refer to Viewing Focus Information.)Type
u(lowercase) for each screen you've accessed until you are back to the screen you want.
The table below describes the bandwidth screen field descriptions.
Field | Description |
|---|---|
Group Name | Name of the group examined. For Port Groups:
For Virtual Fabrics Group:
|
Criteria | Focus criterion for Group Focus screen:
Focus criterion for VF Group Focus screen:
|
Number | Number of ports for a group focus query. |
Performance Data Subgroup | Performance statistics for each port group are further divided into up to three subgroups based on whether a port's neighbor port is in its group:
|
Statistics | For each group, the following statistics are reported:
|
Performance Buckets | Count the number of ports whose 'MBps compared to link rate' value corresponds to that bucket. This provides an indication of how the data rate of the group compares to its potential. Ten buckets from 0+% to 90+%, in 10% increments |
NoResp Ports | No Response Ports per subgroup: |
Congestion buckets | Provides context (from the Statistics Screen)
|
Viewing Statistics Category
The Statistics Category screen displays statistics for a port group.
To view the statistics category, perform the following steps:
Log in to the server as root.
At the command prompt, enter
opatop.The Summary screen is displayed.
Determine which set of statistics you want to view:
To view Group information, continue to the next step.
To view VF information, type
V.
Type the number for the specific group statistics that you want to view:
For Port Group:
0– All1– SuperNICs2– SWs
For VF Group:
0– Default1– Admin
The Info Select screen is displayed as shown in the following example.
opatop: Img: 10s @ Fri Sep 23 09:44:49 2016, Live Group Info Sel: HFIs Int NumPorts: 2 Rate Min: 100g Max: 100g Ext NumPorts: 0 Group Performance (P) Group Statistics (S) Group Config (C) Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help | P S C:
Type
S.The Category (Ctg) Stats screen is displayed as shown in the following example.
opatop: Img: 10s @ Fri Sep 23 11:55:09 2016, Live Group Ctg Stats: HFIs Criteria: Integ Number: 10 Int Max 0+% 25+% 50+% 75+% 100+% Integrity 0 2 0 0 0 0 Congestion 0 2 0 0 0 0 SmaCongest 0 2 0 0 0 0 Bubble 0 2 0 0 0 0 Security 0 2 0 0 0 0 Routing 0 2 0 0 0 0 Utilization: 0.0% Discards: 0.0% Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help | cC N0-n Detail:To set the category stats Criteria for the focus query, type
c(lowercase) to scroll forward orC(uppercase) to scroll in reverse to select one of the following choices:To change the Number of entries in the Err Stats list, type
Nand enter the target number of entries; then pressEnter.Type
Dto initiate the group focus query and access the detailed Group Focus screen (refer to Viewing Focus Information).Type
u(lowercase) for each screen you've accessed until you are back to the screen you want.
The following table describes the bandwidth screen field descriptions.
Field | Description |
|---|---|
Group Name | Name of the group examined. For Port Groups:
For Virtual Fabrics Group:
|
Criteria (Statistics Categories) | Focus criteria/statistics categories:
The integrity and congestion error values are calculated by using a weighted sum. The weights for each and the threshold value for each error category can be seen in the PM Configuration screen (Viewing the PM Configuration). |
Number | Number of entries for a group focus query. |
Performance Data Subgroup | Performance statistics for each port group are further divided into up to three subgroups based on whether a port's neighbor port is in its group:
|
Int or Ext | Location of the port in relation to the group.
|
Category buckets | For each subgroup within a category, there are five histogram buckets. Each bucket has a width of 25% (0+%, 25+%, etc.) with the last bucket width for beyond the threshold (100+%). A bucket is used to measure the number of ports whose category value, when compared to the threshold, falls within the range of the bucket. This provides an indication of how counter rates compare to their thresholds.
|
Utilization | Percent of error utilization; aids congestion analysis. |
Discards | Percent of errors discarded; aids congestion analysis. |
Viewing Configuration Information
The Configuration screen displays a list of the ports in a group, including the LID, port number, port GUID, and NodeDesc for each.
To view configuration information, perform the following steps:
Log in to the server as root.
At the command prompt, enter
opatop.The Summary screen is displayed.
Determine which set of statistics you want to view:
To view Group information, continue to the next step.
To view VF information, type
V.
Type the number for the specific group statistics that you want to view:
For Port Group:
0– All1– SuperNICs2– SWs
For VF Group:
0– Default1– Admin
The Info Select screen is displayed as shown in the example below.
opatop: Img: 10s @ Fri Sep 23 09:44:49 2016, Live Group Info Sel: HFIs Int NumPorts: 2 Rate Min: 100g Max: 100g Ext NumPorts: 0 Group Performance (P) Group Statistics (S) Group Config (C) Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help | P S C:
Type
C.The Config screen is displayed as shown in the example below.
opatop: Img: 10s @ Fri Sep 23 12:07:29 2016, Live Group Config: HFIs NumPorts: 2 Ix LIDx Port Node GUID 0x NodeDesc 0 0001 1 0011750101575300 phcppriv10 hfi1_0 1 0002 1 001175010157E443 phcppriv11 hfi1_0 Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help | sS P0-n:Type
s(lowercase) to scroll forward orS(uppercase) to scroll backward through multiple screens of a long port list.Type
Pand enter the target Ix number; then pressEnterto view the Port Stats screen for the specified Ix (refer to Viewing Port Statistics).Type
u(lowercase) for each screen you've accessed until you are back to the screen you want.
The table below describes the Configuration screen field descriptions.
Field | Description |
|---|---|
Group Name | Name of the group examined. For Port Groups:
For Virtual Fabrics Group:
|
NumPorts | Number of ports returned in the group configuration query. |
Ix | An index value that is used to select a port to view in the Port Stats screen. |
LIDx | LID information |
Port | Port Index. |
Node GUID 0x | Global Unique Identifier (GUID) for the Node. |
NodeDesc | Description of the node. |
Viewing Focus Information
The Focus information screen displays a list of the ports within a group, including the LID, port number, focus criterion, port GUID, and NodeDesc of each. If the port has a neighbor port, the same information is displayed for the neighbor.
Note
The Focus information screen is the same for VF and non-VF.
To view focus information, perform the following steps:
Log in to the server as root.
At the command prompt, enter
opatop.The Summary screen is displayed.
Determine which set of statistics you want to view:
To view Group information, continue to the next step.
To view VF information, type
V.
Type the number for the specific group statistics that you want to view:
For Port Group:
0– All1– SuperNICs2– SWs
For VF Group:
0– Default1– Admin
The Info Select screen is displayed.
Determine the Information Select menu to access:
To view the Focus information screen for BW Summary, type
P.To view the Focus information screen for Err Summary, type
S.
Determine the Criteria for the focus query:
To set the BW stats Criteria for the focus query, type
c(lowercase) to scroll forward orC(uppercase) to scroll in reverse to select one of the following choices:Util-High– Bandwidth Utilization (highest first)UtlPkt-Hi– Packet Utilization (highest first)Util-Low– Bandwidth Utilization (lowest first)VF-Ut-Hi– VF Bandwidth Utilization (highest first)VF-Pkt-Hi– VF Packet Utilization (highest first)VF-Ut-Low– VF Bandwidth Utilization (lowest first)
To set the category stats Criteria for the focus query, type
c(lowercase) to scroll forward orC(uppercase) to scroll in reverse to select one of the following choices:Integrity category(highest first)Congestion category(highest first)SmaCongestion category(highest first)Bubble category(highest first)Security category(highest first)Routing category(highest first)VF Congestion category(highest first)VF Bubble category(highest first)
Type
D.The Focus information screen is displayed as shown in the example below.
opatop: Img: 10s @ Fri Sep 23 13:03:09 2016, Live Group Focus: HFIs GrpNumPorts: 2 NumPorts: 1 Number: 10 Ix Util-High LIDx Port Node GUID 0x NodeDesc 0 0.0 0001 1 0011750101575300 phcppriv10 hfi1_0 <-> 0.0 0002 1 001175010157E443 phcppriv11 hfi1_0 Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help | sS cC N0-n P0-n:To change the criteria after accessing this screen, type
c(lowercase) to scroll forward orC(uppercase) to scroll in reverse to select one of the following choices:Util-High– Bandwidth Utilization (highest first)UtlPkt-Hi– Packet Utilization (highest first)Util-Low– Bandwidth Utilization (lowest first)VF-Ut-Hi– VF Bandwidth Utilization (highest first)VF-Pkt-Hi– VF Packet Utilization (highest first)VF-Ut-Low– VF Bandwidth Utilization (lowest first)Integrity category(highest first)Congestion category(highest first)SmaCongestion category(highest first)Bubble category(highest first)Security category(highest first)Routing category(highest first)VF Congestion category(highest first)VF Bubble category(highest first)
To change the Number of entries in the focus list, type
Nand enter the target number of entries; then pressEnter.Type
s(lowercase) to scroll forward orS(uppercase) to scroll backward through multiple screens of a long port list.Type
Pand enter the target Ix number; then pressEnterto view the detailed Port Stats screen (refer to Viewing Port Statistics).Type
u(lowercase) for each screen you've accessed until you are back to the screen you want.
The table below describes the Focus screen field descriptions.
Field | Description |
|---|---|
Group Name | Name of the group examined. For Port Groups:
For Virtual Fabrics Group:
|
GrpNumPorts | Number of ports selected, as determined by the combination of group, criteria, and requested ports. |
NumPorts | Number of ports returned in the group configuration query. |
Number | Number of ports for a group focus query. |
Ix | An index value that is used to select a port to view in the Port Stats screen. |
Criteria | Limits the focus to specific port statistics. For BW stats (bandwidth statistics):
For Ctg stats (category statistics):
|
LIDx | LID information. |
Port | Port Index. NOTE: A symbol may be present on the first character of each line related to a port. This symbol is used to indicate a non-ideal condition was observed when calculating the relevant port's data. The possible conditions are, the PM was told to ignore this port ('~'), the PM failed to query this port ('!'), and the PM topology does not know this port’s identity ('?'). |
Node GUID 0x | Global Unique Identifier (GUID) for the Node. |
NodeDesc | Description of the node. |
Viewing Port Statistics
The Port Statistics screen displays a specific port and LID's performance and statistics counters.
To view port statistics, perform the following steps:
Log in to the server as root.
At the command prompt, enter
opatop.The Summary screen is displayed.
Determine which set of statistics you want to view:
To view Group information, continue to the next step.
To view VF information, type
V.
Type the number for the specific group statistics that you want to view:
For Port Group:
0– All1– SuperNICs2– SWs
For VF Group:
0– Default1– Admin
The Info Select screen is displayed.
Determine the Information Select menu to access:
To view the Port Stats screen for BW Summary, type
P.To view the Port Stats screen for Err Summary, type
S.To view the Port Stats screen for Configuration information, type
C.If you are accessing the Port Stats screen from the Configuration information screen, skip to Step 7.
Determine the Criteria for the focus query as described in Viewing Bandwidth Utilization or Viewing Statistics Category.
Type
Dto access the Focus information screen.To make changes to the Focus information prior to accessing the Port Stats screen, refer to Viewing Focus Information.
Type
Pand enter the target Ix number; then pressEnterto view the detailed Port Stats screen.The Port Stats screen is displayed.
Note
Neighbor port and link information are available only when accessed through the Focus Information screen. It is not available through the Configuration information screen.
opatop: Img: 10s @ Fri Sep 23 14:07:40 2016, Live Port Stats: HFIs LID: 0x2 PortNum: 1 Rate: 100g MTU: 4096 NodeDesc: phcppriv11 hfi1_0 NodeGUID: 0x001175010157E443 Neighbor: phcppriv10 hfi1_0 LID: 0x1 PortNum: 1 Xmit: Data: 0 MB ( 63 Flits) Pkts: 1 Recv: Data: 0 MB ( 10 Flits) Pkts: 1 Multicast: Xmit Pkts: 0 Recv Pkts: 0 Integrity: | Congestion: Link Quality: 5 | Cong Discards: 0 Uncorrectable: 0 | Rcv FECN*: 0 Link Downed: 0 | Rcv BECN: 0 Lanes Down: 0 | Mark FECN: 0 Rcv Errors: 0 | Xmit Time Cong: 0 Excs Bfr Ovrn*: 0 | Xmit Wait: 0 FM Conf Err: 0 | Routing and Others: Lnk Err Recov: 0 | Rcv Sw Relay: 0 Loc Lnk Integ: 0 | Xmit Discards: 0 Security: | Bubble: Xmit Constrain: 0 | Xmit Wasted BW: 0 Rcv Constrain*: 0 | Xmit Wait Data: 0 SmaCongestion (VL15): | Rcv Bubble*: 0 Cong Discards: 0 Xmit Wait: 0 Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help | Neighbor |
Type
Vto toggle between Port Stats and VF Port Stats screens.Note
VF Port Stats information can only be accessed when you are viewing VF statistics (selected in Step 3).
opatop: Img: 10s @ Wed May 23 15:01:45 2018, Bkmk Now:Wed May 23 15:36:50 2018 VF Port Stats: Admin LID: 0x1 PortNum: 1 NodeDesc: hdwtpriv35.hd.intel.com NodeGUID: 0x001175010165AE75 Xmit: Data: 0 MB ( 1575 Flits) Pkts: 155 Recv: Data: 0 MB ( 19179 Flits) Pkts: 154 Congestion: Cong Discards: 0 | Rcv FECN*: 0 Mark FECN: 0 | Rcv BECN: 0 Xmit Wait: 0 | Xmit Time Cong: 0 Bubble: Xmit Wasted BW: 0 | Rcv Bubble*: 0 Xmit Wait Data: 0 | Routing and Others: Xmit Discards: 0 Counters may be shared between Virtual Fabrics Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help | vV:
Type
Nto switch between statistics for the port and its neighbor port.Type
u(lowercase) for each screen you've accessed until you are back to the screen you want.
The table below describes the Port Statistics screen field descriptions.
Field | Description |
|---|---|
Group Name | Name of the group examined. For Port Groups:
For Virtual Fabrics Group:
|
LIDx | LID information for the node. |
PortNum | Port number of the node. |
Rate | Link rate. |
MTU | MTU, if available. |
NodeDesc | Description of the node. |
NodeGUID | Global Unique Identifier (GUID) for the node. |
Neighbor | Description of the neighboring node. |
Xmit Data | Size of the data transmitted in MB and Flits and the number of packets. |
Recv Data | Size of the data received in MB and Flits and the number of packets. |
Multicast: Xmit Pkts | Number of multicast packets transmitted. |
Multicast: Recv Pkts | Number of multicast packets received. |
Statistics Counters |
A trailing asterisk (*) on the counter name indicates the count will be used in computing Statistics Category information for the neighbor port. |
Navigating PM Sweeps
The Fabric Performance Monitoring TUI allows you to access statistics from sequential PM sweeps (the PM keeps a history of previous sweep images) and queries the PM at a user-specified interval (10 seconds by default). Sweeps are accessed from the short term history database being recorded by the PM. This allows access to statistics from up to 24 hours in the past.
When the Fabric Performance Monitoring TUI queries for statistics for the most recent PM sweep, it is in “Live” mode. In Live mode, the data will change, at the opatop interval rate, as opatop queries new PM sweeps. At each screen (summary or detail), the data being displayed is refreshed for the current PM sweep.
A PM sweep can be in “frozen” mode. The data in a frozen sweep will not change, allowing the statistics to be examined in summary and detail screens.
The Fabric Performance Monitoring TUI allows you to navigate the focus to another sweep within the history of sweeps maintained by the PM. For the duration of focus on such a sweep, it will remain frozen. You can examine other screens for the selected image while in "Historic" mode. Navigation can be performed backward or forward, 1 or 5 sweeps at a time, to a specific time, to a bookmarked time, or back to live data.
To navigate the historical PM sweeps, perform the following steps:
Navigate to the screen that you want to analyze historically.
The date stamp below shows the time of the frozen image (highlighted in bold) and the current on-going time (highlighted in italics).
opatop: Img: 10s @ Fri Sep 23 17:32:32 2016, HistNow:Fri Sep 23 17:33:08 2016 Port Stats: HFIs LID: 0x1 PortNum: 1 Rate: 100g MTU: 4096 NodeDesc: phcppriv10 hfi1_0 NodeGUID: 0x0011750101575300 Neighbor: phcppriv11 hfi1_0 LID: 0x2 PortNum: 1 Xmit: Data: 0 MB ( 10 Flits) Pkts: 1 Recv: Data: 0 MB ( 63 Flits) Pkts: 1 Multicast: Xmit Pkts: 0 Recv Pkts: 0 Integrity: | Congestion: Link Quality: 5 | Cong Discards: 0 Uncorrectable: 0 | Rcv FECN*: 0 Link Downed: 0 | Rcv BECN: 0 Lanes Down: 0 | Mark FECN: 0 Rcv Errors: 0 | Xmit Time Cong: 0 Excs Bfr Ovrn*: 0 | Xmit Wait: 0 FM Conf Err: 0 | Routing and Others: Lnk Err Recov: 0 | Rcv Sw Relay: 0 Loc Lnk Integ: 0 | Xmit Discards: 0 Security: | Bubble: Xmit Constrain: 0 | Xmit Wasted BW: 0 Rcv Constrain*: 0 | Xmit Wait Data: 0 SmaCongestion (VL15): | Rcv Bubble*: 0 Cong Discards: 0 Xmit Wait: 0 Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help | Neighbor |
Choose from the following options:
Type
r(lowercase) to go back one sweep at a time.Type
R(uppercase) to go back five sweeps at a time.Type
f(lowercase) to move ahead one sweep at a time.Type
F(uppercase) to move ahead five sweeps at a time.Type
t<time> and press Enter to go to a sweep at a specific time.Allowed input formats for <time> include:
# [hour(s)/minute(s)/second(s)] ago (for example,
t 1 hour ago)YYYY:MM:DD HH:MM:SS (for example,
t 2019:07:11 12:00:12)
Type
bto move to the most recently bookmarked image.Type
Lto return to the Live data.
Bookmarking a Sweep
The Fabric Performance Monitoring TUI allows you to bookmark a sweep to review the information. For the duration of the Bookmark, all information is frozen. You can navigate through the various screens to review the frozen information. The sweep will remain frozen until you explicitly "Unbookmark" it.
Note
opatop allows only one sweep at a time to be bookmarked.
To bookmark a PM sweep, perform the following steps:
Navigate to the screen you want to capture and analyze.
Type
B(uppercase) to bookmark the screen.In the Image Identification line (line 1), the Live image changes to Bkmk (bookmark) as highlighted in bold in the example screen below.
opatop: Img: 10s @ Fri Sep 23 16:44:42 2016, Bkmk Now:Fri Sep 23 16:44:53 2016 Summary: SW: 0 Ports: SW: 0 HFI: 2 Link: 1 SM: 1 Node NRsp: 0 Skip: 0 Port NRsp: 0 Skip: 0 AvgMBps MinMBps MaxMBps AvgKPps MinKPps MaxKPps 0 All Int 0 0 0 0 0 0 Integ:min Congst:min SmaCong:min Bubble:min Secure:min Routing:min 1 HFIs Int 0 0 0 0 0 0 Integ:min Congst:min SmaCong:min Bubble:min Secure:min Routing:min 2 SWs No ports in group Master-SM: LID: 0x0001 Port: 1 Priority: 0 State: Master Name: phcppriv10 hfi1_0 PortGUID: 0x0011750101575300 Secondary-SM: none Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help | sS Pmcfg Imginfo View 0-n:The bookmark will remain until you explicitly remove it.
Type
Lto return to the Live data.Type
b(lowercase) to return to the bookmarked image.
To remove a bookmark from a PM sweep, perform the following steps:
Type
b(lowercase) to return to the bookmarked image.Type
U(uppercase).In the Image Identification line (line 1), the Bkmk image changes back to Live (bookmark) as highlighted in bold in the example screen below.
opatop: Img: 10s @ Fri Sep 23 16:49:52 2016, Live Summary: SW: 0 Ports: SW: 0 HFI: 2 Link: 1 SM: 1 Node NRsp: 0 Skip: 0 Port NRsp: 0 Skip: 0 AvgMBps MinMBps MaxMBps AvgKPps MinKPps MaxKPps 0 All Int 0 0 0 0 0 0 Integ:min Congst:min SmaCong:min Bubble:min Secure:min Routing:min 1 HFIs Int 0 0 0 0 0 0 Integ:min Congst:min SmaCong:min Bubble:min Secure:min Routing:min 2 SWs No ports in group Master-SM: LID: 0x0001 Port: 1 Priority: 0 State: Master Name: phcppriv10 hfi1_0 PortGUID: 0x0011750101575300 Secondary-SM: none Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help | sS Pmcfg Imginfo View 0-n:
Using the opatop Command Line Options
While opatop starts the Fabric Performance Monitoring TUI, you can use the command line options as shown below:
opatop [-v] [-q] [-hhfi] [-pport] [--timeout] [-iseconds]
--helpProduces full help text.
-v/--verbose levelSpecifies the verbose output level. Value is additive and includes:
1Screen
4STDERR opatop
16STDERR PaClient
-q/--quietDisables progress reports.
-h/--hfi hfiSpecifies the SuperNIC, numbered 1..n. Using 0 specifies that the
-p portport is a system-wide port number. Default is 0.-p/--port portSpecifies the port, numbered 1..n. Using 0 specifies the first active port. Default is 0.
--timeoutSpecifies the timeout (response wait time) in ms. Default is 1000 ms.
-i/--interval secondsInterval in
secondsat which PA queries are performed to refresh to the latest PA image. Default is 10 seconds.
-h and -p options permit a variety of selections:
-h 0First active port in the system (Default).
-h 0 -p 0First active port in the system.
-hxFirst active port on SuperNIC
x.-hx-p 0First active port on SuperNIC
x.-h 0 -pyPort
ywithin the system (no matter which ports are active).-hx-pySuperNIC
x, porty.
7.2.5.1.3. Top Level Data
Top level data refers to the high-level perspective of the possible data you gathered from PA attributes. You can "drill down" to get information that is more specific, such as a list of ports.
Fabric Configuration and PM Image Information
From the PA, you can access fabric configuration and PM Image information. This data shows general information about the PM Image, including a unique 64-bit ID that you can use to access all the data collected for this PM Image.
The ImageInfo query can provide additional information such as:
Basic topology information includes the number of SuperNICs, Switch Nodes, and Ports. Also, you can view topology information about the Primary (Main) SM and the Secondary (Standby) SM, if present.
PM sweep data includes the start time, sweep duration, and the time over which this Image is valid, as well as the number of ports and nodes that had failures and for which data was not gathered.
If you see No Response Nodes and No Response Ports, review the FM's Log to find out what failures are causing these ports to be unsuccessful.
Likely reasons for No Response Nodes and No Response Ports are timeouts due to port bounces or reboots. This happens because the PM sweep may already be underway and is using the most recently completed SM Sweep's topology data, which may not include the bounced port's new port state. If this only happens one time, it should be okay to ignore; but, if this is a transient or reoccurring issue, you will likely see that port or its neighbor appears to have integrity issues and should drill down and get more data on the offending ports.
If you see Unexpected Clears, review the FM's Log to find out what ports and what counters are being unexpectedly cleared.
CLI Tools such as opapmaquery and opareport can clear PMA counters and can trigger this, so check with other users first. Additionally, a reboot of the node may also reset the counters.
PM Port Group’s Performance Utilization and Statistical Data
From the PA, you can access PM Port Group performance utilization and statistical data that provides conglomerated data of all the links within a PM Port Group.
A port's Performance data will fall within one of three subgroups based upon whether both (Internal subgroup), only itself (Send subgroup), or just its neighbor (Receive subgroup) is within the PM Port Group. The performance subgroup data has three subsections: the ten-bucket utilization percentage histogram, the performance statistics, and the no response ports counters.
The Statistical data available is divided into two subgroups: Internal and External. Each subgroup has the following subsections: a five-bucket histogram and a maximum value field for each of the six PA Categories.
If you see ports in the higher percentage buckets, it means that those ports are experiencing high values of that Category. The values for each bucket represent the number of ports that are "binned" within that percentage range (bucket).
If you see ports in the higher buckets for the Integrity category, then you will need to drill down further to find out what ports are experiencing Integrity issues. Note that reboots and general fabric maintenance (such as moving systems, replacing cables, etc.) can create false positives. You may want to verify if this issue recurs after a planned interruption is over before continuing to drill down to gather more data.
If you see ports in the Congestion higher percentage buckets, then you should check whether a node is being overloaded by the jobs running or by lack of allocated resources. Also, make sure you are using an appropriate Routing algorithm for your fabric. In a more serious situation, you may have to investigate the traffic pattern of the application, ISL resources, or over-subscription in the fabric. You can drill down to find out what ports are having this congestion and identify what resources are perhaps being over utilized and need to be redistributed.
If there are ports in the SMA Congestion higher percentage buckets, then you should check the SM and verify the configuration. SMA congestion is congestion specific to SM-only traffic and should happen only under extreme conditions.
PMA No Response Ports are the same No Response Ports from the fabric configuration data, only limited in scope to the specific PM Port Group. PMA No Response ports are usually one-offs and follow the same steps as no response ports from fabric configuration (refer to section Fabric Configuration and PM Image Information).
Topology Incomplete ports are extremely rare. These are ports that should have had an active neighbor (all but Switch Port Zero), but do not. This is usually an indication that the SM has an inaccurate topology. Forcing an SM re-sweeps may clear this error if no other errors are occurring. Otherwise, you will have to drill down, find the Neighbor Port information, and manually bounce the link.
7.2.5.1.4. Mid-Tier Data
Mid-tier data contains statistical information for the link-level perspective. The data provides a sorted list of links that you can use to drill down further to get each port's exact Port Counter values (if available). Links are sorted based on criteria such as PA Category values and utilization metrics.
Sorted Lists of Links based on Statistical Criteria
After choosing a criterion (Utilization or PA Category), a sorted list of links is formed, which is ordered by the value of the criterion for both ports of the link. Switch Port Zero evaluates as a Link with no neighbor.
Depending on the type of criterion, the number of offending ports, and their location in the fabric, several conditions are possible. While not always needed, more specific port counter values can be gathered at the per-port level when drilling down to the bottom tier.
Having just one link with an integrity issue may have several causes.
One of the more common and benign causes is that the other side of the link bounced during the PM Sweep and appeared to be down (LinkQualityIndicator = 0 [Down]). This can also be seen in the No Response Ports values described in the sections, Fabric Configuration and PM Image Information and PM Port Group’s Performance Utilization and Statistical Data. As this is usually a one-off, it can be ignored.
If this issue is not planned and is reoccurring, the port may be experiencing a Signal Integrity issue. Degradation of Signal Integrity may have several possible causes. A quick check of the cable's connections or the quality of the cable itself may be the likely solution.
Several FastFabric tools, such as opalinkanalysis, can help you identify links that may be misconfigured or operating at slower speeds. If one end of the link is attached to a SuperNIC, you can try to access the node's syslog and see if the SuperNIC is reporting any errors. In addition, you can "drill down" further to view the individual Port Counter values.
When multiple links have a non-zero value, you will need to determine their location in the fabric and group the links by proximity and purpose (compute, storage, etc.). For any links that cannot be grouped, you can follow the same process as described in If You See One Link With A Non-Zero Integrity Value. However, links that are in close proximity or share a similar purpose may have related causes and may require a slight change to the manner in which the shared issue is debugged.
An example would be if the group of links was attached to the same switch, then there may be an issue with the connections going to and from the switch, or, more likely, the environment around the switch is contributing in some way (power, heating, etc.). In this case, you should first verify the switch's physical state is as expected. You can use a tool such as opaswitchadmin or opachassisadmin to gather data such as power supply status and temperature. If the group was all storage nodes, then perhaps the issue is related to the storage devices or software.
If you see a single link with a high congestion value and the link is an ISL, then you may need to investigate the application's design, configuration, and placement in the fabric. Next, you should check the over-subscription ratio and verify the topology of the fabric. Tools such as opareport, opaextractmissinglinks, and opaxlattopology can be used to verify the current topology against a predefined topology configuration file. If the link is attached to a SuperNIC, you may need to alter what jobs are being run on that node and see if it may be overloaded. The Congestion value for a port is adjusted based on the Utilization of the link to give a more accurate display of Congestion on the port.
However, if the link does not have a high Utilization but still has high Congestion, the link may be experiencing a more serious issue. You can "drill down" further to view the individual Port Counter values to better identify the issue.
7.2.5.1.5. Lowest Tier Data
PA Port Counters are the lowest tier of data available through the PA User interface.
Individual Port Counter Data
The lowest level in the PA is the Port Counters Data. Depending on the options of the request, response values can be a delta between itself and the previous Image or the RAW counter values obtained during a PM sweep. From these values, you can see the individual Port Counters that were used to compute the PA Categories at the higher levels.
Refer to the CN5000 Maintenance and Troubleshooting Guide, which provides information on all of the counters, their categories, and how they are computed into each category. This document also includes the rationale behind the computation and inclusion of the counters in their respective categories.
