Skip to main content

Cornelis Technical Documentation

7.2.5. Performance Monitoring

This section provides information on the performance of the fabric using data from the Performance Monitoring tool as well as the Performance Manager parameters.

7.2.5.1. Monitoring Fabric Performance

The opatop command allow you to start up the Fabric Performance Monitoring TUI so that you can monitor the performance of the fabric.

The Fabric Performance Monitor TUI displays performance, congestion, and statistical information about a fabric. Fabric information is divided into two main starting points for analyzing fabric traffic:

  • Performance (bandwidth utilization): Can identify over-utilized areas (bottlenecks) and under-utilized areas (potentially misconfigured).

  • Statistics: Can identify problems in fabric hardware or configuration, as well as congestion and other performance situations.

This section describes:

  • The TUI menus used to gather Fabric Performance data.

  • What to do with the data you have gathered.

7.2.5.1.1. Accessing the Fabric Performance Monitor

The Fabric Performance Monitor allows you to monitor performance, congestion, and statistics information in a fabric.

Using the opatop Command

To start up the Fabric Performance Monitor from the command prompt, perform the following steps:

  1. Log in to the server as root.

  2. At the command prompt, enter opatop.

    The Fabric Performance Monitor Summary screen is displayed.

    opatop: Img: 10s @ Wed Sep 14 11:29:52 2016, Live
    Summary:  SW:     0 Ports: SW:     0  HFI:     2       Link:     1
              SM:     1 Node NRsp:     0 Skip:     0 Port NRsp:     0 Skip:     0
                        AvgMBps   MinMBps   MaxMBps   AvgKPps   MinKPps   MaxKPps
    0 All         Int         0         0         0         0         0         0
          Integ:min  Congst:min  SmaCong:min  Bubble:min  Secure:min  Routing:min
    1 HFIs        Int         0         0         0         0         0         0
          Integ:min  Congst:min  SmaCong:min  Bubble:min  Secure:min  Routing:min
    2 SWs         No ports in group
    
    
    
    
    
        Master-SM: LID: 0x0001 Port: 1   Priority: 0  State: Master
                   Name: phcppriv10 hfi1_0
                   PortGUID: 0x0011750101575300
     Secondary-SM: none
    
    
    
    Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help |
    sS Pmcfg Imginfo View 0-n:
    
From the Cornelis FastFabric OPA Tools Menu

To start up the Fabric Performance Monitor menu from the Cornelis FastFabric OPA Tools menu, perform the following steps:

  1. Log in to the server as root.

  2. At the command prompt, enter opafastfabric.

    The Cornelis FastFabric OPA Tools menu is displayed.

    Cornelis FastFabric OPA Tools
    Version: X.X.X.X.X
    
       1) Chassis Setup/Admin
       2) Externally Managed Switch Setup/Admin
       3) Host Setup
       4) Host Verification/Admin
       5) Fabric Monitoring
    
       X) Exit (or Q)
    
  3. At the cursor, type 5.

    The FastFabric OPA Fabric Monitoring menu is displayed.

     FastFabric OPA Fabric Monitoring Menu
    
    0) Fabric Performance Monitoring             [ Skip  ]
    
    P) Perform the Selected Actions              N) Select None
    X) Return to Previous Menu (or ESC or Q)
    
    Table 19. FastFabric OPA Fabric Monitoring Menu Descriptions

    Menu Item

    Description

    0) Fabric Performance Monitoring

    Allows you to access the TUI that monitors the performance, congestion, and statistics information about a fabric.

    Associated CLI Command: opatop



  4. Type 0 to toggle to the [Perform] option.

  5. Type P to perform the operation.

    The Fabric Performance Monitor information is displayed.

    opatop: Img: 10s @ Fri Sep 16 11:35:24 2016, Live
    Summary:  SW:     0 Ports: SW:     0  HFI:     2       Link:     1
              SM:     1 Node NRsp:     0 Skip:     0 Port NRsp:     0 Skip:     0
                        AvgMBps   MinMBps   MaxMBps   AvgKPps   MinKPps   MaxKPps
    0 All         Int         0         0         0         0         0         0
          Integ:min  Congst:min  SmaCong:min  Bubble:min  Secure:min  Routing:min
    1 HFIs        Int         0         0         0         0         0         0
          Integ:min  Congst:min  SmaCong:min  Bubble:min  Secure:min  Routing:min
    2 SWs         No ports in group
    
    
    
    
    
        Master-SM: LID: 0x0001 Port: 1   Priority: 0  State: Master
                   Name: phcppriv10 hfi1_0
                   PortGUID: 0x0011750101575300
     Secondary-SM: none
    
    
    
    Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help |
    sS Pmcfg Imginfo View 0-n:
    
7.2.5.1.2. How to Use the Fabric Performance Monitor TUI

The Fabric Performance Monitor TUI allows you to view and interact with live performance data.

Reading the TUI Screens

The figure below shows the major sections common to all Fabric Performance Monitor TUI screens.

Figure 83. Fabric Performance Monitor TUI Screen (Example)
Fabric Performance Monitor TUI Screen (Example)


Table 20. Fabric Performance Monitor TUI Descriptions

Section of Screen

Description

opatop

Refers to the CLI command that initiates the Fabric Performance Monitoring TUI.

NOTE: opatop may be used interchangeably with Fabric Performance Monitoring TUI within this manual.

Image Identification

Displays the following image (Img) information:

  • Image interval (II): The time over which this image data is relevant.

    • For in-memory images, this value is equal to the PM Sweep Interval.

    • For images stored on disk (Short Term History), the interval is equal to the sum of all the intervals for each image compounded into the composite (disk) image.

    NOTE: The interval can change when transitioning between images stored in memory and images stored on disk.

  • Timestamp for the image being displayed in the format Day Month Date HR:MIN:SEC YYYY (example, Wed Sep 14 11:29:52 2016)

    If a Live image is not being displayed, the current time ('Now:') is also shown.

  • Type of image

    • Live

    • Hist (History)

    • Bkmk (Bookmark)

Screen-Specific Information

Displays information and layout of the selected screen.

NOTE: Each screen is different and will be discussed in subsequent sections.

Common Input Commands

Displays the common input commands that appear on every screen and performs the same action.

  • Q/q – Quit program

  • u – Up to the previous screen

  • L – Select Live image

  • r – Navigate reverse 1 sweep

  • R – Navigate reverse 5 sweeps

  • f – Navigate forward 1 sweep

  • F – Navigate forward 5 sweeps

  • t – Navigate to a specific time

  • b – Select (previously) bookmarked image

  • B – Bookmark the currently selected image

  • U – Unbookmark the image

  • ? – Help provides information about the screen contents and input commands.

Commands are case insensitive except where specifically noted otherwise.

The ENTER key must be pressed after multi-character commands and for Quit.

Screen-Specific Input Commands

Displays the screen-specific commands.



Navigating the Screens

The Fabric Performance Monitoring TUI allows you to access various screens in a hierarchical manner to examine the state of a fabric. Through the screen-specific commands, each screen will provide access to the next screen or back to the parent screen.

The Fabric Performance Monitoring TUI screen navigational hierarchy is shown below.

Figure 84. Fabric Performance Monitoring TUI Navigation
Fabric Performance Monitoring TUI Navigation


As an example, if you want to navigate from the Group Info Sel screen to the Group BW Stats screen, perform the following steps:

  1. The Group Info Sel screen is shown below.

    opatop: Img: 10s @ Thu Sep 22 15:44:47 2016, Live
    Group Info Sel: HFIs
    Int NumPorts: 2  Rate Min: 100g  Max: 100g
    Ext NumPorts: 0
      Group Performance (P)
      Group Statistics (S)
      Group Config (C)
    
    
    
    Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help | P S C:
    

    The selections for the next level of screens are displayed as:

      Group Performance (P)
      Group Statistics (S)
      Group Config (C)
    

    The menu options are shown in the screen-specific commands as:

    Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help | P S C:
  2. From the Group Info Sel screen, enter P.

    The Group BW Stats screen is displayed.

    opatop: Img: 10s @ Thu Sep 22 15:52:27 2016, Live
    Group Performance: HFIs   Criteria: Util-High  Number: 10
    Int:  TotMBps  AvgMBps  MinMBps  MaxMBps      TotKPps  AvgKPps  MinKPps  MaxKPps
                0        0        0        0            0        0        0        0
         Buckt 0+%   10+%   20+%   30+%   40+%   50+%   60+%   70+%   80+%   90+%
                 2      0      0      0      0      0      0      0      0      0
         NoResp Int Ports: PMA:      0  Topo:      0
    
                          Max       0+%      25+%      50+%      75+%     100+%
    Int Congestion          0         2         0         0         0         0
    
    
    
    
    Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help | cC N0-n Detail:
    
  3. Type u (lowercase) to return to the Group Info Sel screen.

  4. Type u (lowercase) to return to the Summary screen.

Important

To switch between Port and Virtual Fabric Grouping screens, press V at the Summary screen and navigate through the hierarchy.

Viewing the Fabric Performance Monitoring Summary Screen

The top-level Summary screen shows the basic fabric configuration information as well as performance and statistics information. This is the initial screen you see when you start up the TUI.

After looking at the Summary screen you can decide which area of the fabric (performance or statistics) and which port group or virtual fabric most warrants investigation, and can then drill down into that area.

To view the Fabric Performance Monitoring Summary screen, perform the following steps:

  1. Log in to the server as root.

  2. At the command prompt, enter opatop.

    The Summary screen is displayed.

    opatop: Img: 10s @ Wed Sep 14 11:29:52 2016, Live
    Summary:  SW:     0 Ports: SW:     0  HFI:     2       Link:     1
              SM:     1 Node NRsp:     0 Skip:     0 Port NRsp:     0 Skip:     0
                        AvgMBps   MinMBps   MaxMBps   AvgKPps   MinKPps   MaxKPps
    0 All         Int         0         0         0         0         0         0
          Integ:min  Congst:min  SmaCong:min  Bubble:min  Secure:min  Routing:min
    1 HFIs        Int         0         0         0         0         0         0
          Integ:min  Congst:min  SmaCong:min  Bubble:min  Secure:min  Routing:min
    2 SWs         No ports in group
    
    
    
    
    
        Master-SM: LID: 0x0001 Port: 1   Priority: 0  State: Master
                   Name: phcppriv10 hfi1_0
                   PortGUID: 0x0011750101575300
     Secondary-SM: none
    
    
    
    Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help |
    sS Pmcfg Imginfo View 0-n:
    
  3. To change to the Virtual Fabrics (VF) Summary screen, type V.

    The VF Summary screen is shown as in the example below.

    opatop: Img: 10s @ Thu Sep 22 15:20:07 2016, Live
    Summary:  SW:     0 Ports: SW:     0  HFI:     2       Link:     1
              SM:     1 Node NRsp:     0 Skip:     0 Port NRsp:     0 Skip:     0
                        AvgMBps   MinMBps   MaxMBps   AvgKPps   MinKPps   MaxKPps
    0 Admin       Int         0         0         0         0         0         0
         Integ:min  Congst:min  SmaCong:min  Bubble:min  Secure:min  Routing:min
    
    
    
    
    
        Master-SM: LID: 0x0001 Port: 1   Priority: 0  State: Master
                   Name: phcppriv10 hfi1_0
                   PortGUID: 0x0011750101575300
     Secondary-SM: none
    
    
    
    Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help |
    sS Pmcfg Imginfo View 0-n:
    
Summary Screen Field Descriptions

The table below describes the Summary screen field descriptions.

Table 21. Summary Screen Field Descriptions

Field

Description

Fabric Configuration Information

Fabric configuration information includes:

  • Numbers of links

  • Numbers of switches (SW)

  • Numbers of SMs

  • Numbers of ports

  • Primary SM details

  • Secondary SM details (if present)

Performance and Statistics for Each Port Group

Fabric performance and statistics are presented based on port groupings and virtual fabrics grouping:

For Port Groups:

  • All - In the All group, all ports are Internal because, by definition, the neighbor port must be in the All group.

  • SuperNICs - In the SuperNICs groups, all neighbor ports are outside the group, so statistics are contained in the Send and Receive subgroups.

  • SWs - In the SWs group, neighbor ports are either outside the group (SuperNIC) or inside the group (another switch), so statistics are contained in all three subgroups. A special case for a switch port is the special switch port 0, which is always considered internal to the SWs group.

For Virtual Fabrics Group:

  • Admin

    Default

These groups provide a natural subdivision of the ports in a fabric for analysis.

For each group, the following statistics are reported:

  • Average MBps (megabytes per second)

  • Minimum MBps

  • Maximum MBps

  • Average KPps (kilopackets per second)

  • Minimum KPps

  • Maximum KPps

  • Status indicator

Performance Utilization

Performance Utilization for each port group is divided into up to three subgroups based on whether a port's neighbor port is in its group:

  • Internal - If a port's neighbor port is in its group, all performance statistics are contained in the Internal subgroup.

  • Send - If a port's neighbor is not in its group, statistics for data leaving the port (group) are contained in the Send subgroup.

  • Receive - If a port's neighbor is not in its group, statistics for data entering the port are contained in the Receive subgroup.

Statistics Categories

The statistics categories are:

  • Integ – Integrity

  • Congst – Congestion

  • Bubble – Idles due to congestion

  • SmaCongSMA Congestion

  • Secure – Security

  • Routing – Routing

Statistics categories are each based on one or more port counters. Each statistics category’s status indicator is shown at one of five values/colors based on the category value as compared to a threshold value:

  • Minimum – green

  • Low – blue

  • Moderate – cyan

  • Warning – yellow

  • OVER – red



Viewing the PM Configuration

The PM Configuration screen displays information as provided by the PM.

Note

  • The PM Configuration screen is the same for VF and non-VF.

  • The PM Configuration screen has no screen-specific input commands.

To view PM Configuration, perform the following steps:

  1. Log in to the server as root.

  2. At the command prompt, enter opatop.

    The Summary screen is displayed.

  3. Type p.

    The PM Configuration screen is displayed as shown in the example below.

    opatop: Img: 10s @ Thu Sep 22 15:23:17 2016, Live
    PM Config:
     Sweep Interval: 10 sec  PM Flags(0x33):
       ProcessHFICntrs=On ProcessVLCntrs=On ClrDataCntrs=Off Clr64bitErrCntrs=Off
       Clr32bitErrCntrs=On Clr8bitErrCntrs=On
     Max Clients:        3
     Total Images:      10   Freeze Images: 5         Freeze Lease: 60 seconds
     Ctg Thresholds: Integrity:       100  Congestion:     100
                     SmaCongest:      100  Bubble:         100
                     Security:         10  Routing:        100
     Integrity Wts:  Link Qual:        40  Uncorrectable:  100
                     Link Downed:      25  Rcv Errors:     100
                     Excs Bfr Ovrn:   100  FM Config Err:  100
                     Link Err Reco:   100  Loc Link Integ:   0
                     Lnk Wdth Dngd:   100
     Congest Wts:    Cong Discards:   100  Rcv FECN:         5
                     Rcv BECN:          1  Mark FECN:       25
                     Xmit Time Cong    25  Xmit Wait:       10
     PM Memory Size: 169 MB (169295080 bytes)
     PMA MADs: MaxAttempts:     3 MinRespTimeout:  35 RespTimeout:  250
     Sweep: MaxParallelNodes:  10 PmaBatchSize:     2 ErrorClear:     7
    
    Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help |
    
  4. Type u (lowercase) to return to the Summary Screen.

PM Configuration Screen Field Descriptions

The table below describes the PM Configuration screen field descriptions.

Table 22. PM Configuration Field Descriptions

Field

Description

Sweep Interval

The time over the image data is relevant. Default is 10 seconds.

NOTE: Normally, the opatop interval should be set to a value ≥ Sweep Interval.

PM Flags

Shows whether PM Flags are On or Off for:

  • ProcessHFICntrs

  • ProcessVLCntrs

  • ClrDataCntrs

  • Clr64bitErrCntrs

  • Clr32bitErrCntrs

  • Clr8bitErrCntrs

Max Clients

Maximum clients.

Total Images

  • Freeze Images

  • Freeze Lease time

Ctg Thresholds

Category thresholds:

  • Integrity – Integrity

  • Congestion – Congestion

  • Bubble - Idles due to congestion

  • SmaCongestSMA Congestion

  • Security – Security

  • Routing – Routing

Integrity Wts

Integrity weights:

  • Link Qual

  • Uncorrectable

  • Link Downed

  • Rcv Errors

  • Excs Bfr Ovrn

  • FM Config Err

  • Link Err Reco

  • Loc Link Integ

  • Lnk Wdth Dngd

Congest Wts

Congestion weights:

  • Cong Discards

  • Rcv FECN

  • Rcv BECN

  • Mark FECN

  • Xmit Time Cong

  • Xmit Wait

PM Memory Size

Size of the PM memory footprint in MB and bytes.

PMA MADs

PMA MADs retry/timeout:

  • MaxAttempts

  • MinRespTimeout

  • RespTimeout

Sweep

Sweep information:

  • MaxParallelNodes

  • PmaBatchSize

  • ErrorClear



Viewing Image Information

The Image Information screen shows the image information as provided by the PM.

Note

  • The Image Information screen is the same for VF and non-VF.

  • The PM Configuration screen has no screen-specific input commands.

To view Image Information, perform the following steps:

  1. Log in to the server as root.

  2. At the command prompt, enter opatop.

    The Summary screen is displayed.

  3. Type I.

    The Image Info screen is displayed as shown in the example below.

    opatop: Img: IIs @ Day Month Date HR:MIN:SEC YYYY, Live
    Image Inopatop: Img: 10s @ Thu Sep 22 16:51:58 2016, Live
    Image Info:
     Sweep Start: Thu Sep 22 16:51:58 2016
     Sweep Duration: 0.001 Seconds
     Image Interval: 10 Seconds
    
     Num SW-Ports:       0  HFI-Ports:       2
     Num SWs:            0  Num Links:       1  Num SMs:         1
    
     Num NRsp Nodes:       0  Ports:       0  Unexpected Clear Ports: 0
     Num Skip Nodes:       0  Ports:       0
    
        Master-SM: LID: 0x0001 Port: 1   Priority: 0  State: Master
                   Name: phcppriv10 hfi1_0
                   PortGUID: 0x0011750101575300
     Secondary-SM: none
    
    
    
    
    
    
    Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help |
    
  4. Type u (lowercase) to return to the Summary Screen.

Image Information Screen Field Descriptions

The following table describes the Image Information screen field descriptions.

Table 23. Image Information Field Descriptions

Field

Description

Sweep Start

Timestamp for the start of the sweep

Sweep Duration

Length of time for the sweep

Image Interval

The time over the image data is relevant. Default is 10 seconds.

Num [Ports]

Number of ports in each group:

  • SW-Ports

  • SuperNIC-Ports

Num SWs

Number of switches

Node Information

Node information including:

  • No response nodes

  • Skipped nodes

Port Information

Port information including:

  • No response ports

  • Skipped ports

  • Unexpected clear ports

SM Information

Primary and secondary SM details

  • LID

  • Port

  • Priority

  • State

  • Name

  • PortGUID



Viewing Bandwidth Utilization

For each valid performance data subgroup, the Bandwidth Utilization screen displays the total, average, minimum, and maximum MBps and KPps. For each subgroup, ten performance 'buckets' count the number of ports whose 'MBps compared to link rate' value corresponds to that bucket. This provides an indication of how the data rate of the group compares to its potential.

To view bandwidth utilization, perform the following steps:

  1. Log in to the server as root.

  2. At the command prompt, enter opatop.

    The Summary screen is displayed.

  3. Determine which set of statistics you want to view:

    • To view Group information, continue to the next step.

    • To view VF information, type V.

  4. Type the number for the specific group statistics that you want to view:

    For Port Group:

    • 0 – All

    • 1SuperNICs

    • 2 – SWs

    For VF Group:

    • 0 – Default

    • 1 – Admin

    The Info Select screen is displayed as shown in the following example.

    opatop: Img: 10s @ Fri Sep 23 09:44:49 2016, Live
    Group Info Sel: HFIs
    Int NumPorts: 2  Rate Min: 100g  Max: 100g
    Ext NumPorts: 0
      Group Performance (P)
      Group Statistics (S)
      Group Config (C)
    
    
    
    
    
    Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help | P S C:
    
  5. Type P.

    The Bandwidth (BW) Util screen is displayed as shown in the following example.

    opatop: Img: 10s @ Fri Sep 23 09:46:09 2016, Live
    Group BW Util: HFIs   Criteria: Util-High  Number: 10
    Int:  TotMBps  AvgMBps  MinMBps  MaxMBps      TotKPps  AvgKPps  MinKPps  MaxKPps
                0        0        0        0            0        0        0        0
         Buckt 0+%   10+%   20+%   30+%   40+%   50+%   60+%   70+%   80+%   90+%
                 2      0      0      0      0      0      0      0      0      0
         NoResp Int Ports: PMA:      0  Topo:      0
    
                          Max       0+%      25+%      50+%      75+%     100+%
    Int Congestion          0         2         0         0         0         0
    
    
    
    
    
    Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help | cC N0-n Detail:
    
  6. To set the BW stats Criteria for the focus query, type c (lowercase) to scroll forward or C (uppercase) to scroll in reverse to select one of the following choices:

    • Util-High – Bandwidth Utilization (highest first)

    • UtlPkt-Hi – Packet Utilization (highest first)

    • Util-Low – Bandwidth Utilization (lowest first)

    • VF-Ut-Hi – VF Bandwidth Utilization (highest first)

    • VF-Pkt-Hi – VF Packet Utilization (highest first)

    • VF-Ut-Low – VF Bandwidth Utilization (lowest first)

  7. To change the Number of entries in the BW stats list, type N and enter the target number of entries; then press Enter.

  8. Type D to initiate the group focus query and access the detailed Group Focus screen (refer to Viewing Focus Information.)

  9. Type u (lowercase) for each screen you've accessed until you are back to the screen you want.

Bandwidth Statistics Screen Field Descriptions

The table below describes the bandwidth screen field descriptions.

Table 24. Bandwidth Statistics Field Descriptions

Field

Description

Group Name

Name of the group examined.

For Port Groups:

  • All - In the All group, all ports are Internal because, by definition, the neighbor port must be in the All group.

  • SuperNICs - In the SuperNICs groups, all neighbor ports are outside the group, so statistics are contained in the Send and Receive subgroups.

  • SWs - In the SWs group, neighbor ports are either outside the group (SuperNIC) or inside the group (another switch), so statistics are contained in all three subgroups. A special case for a switch port is the special switch port 0, which is always considered internal to the SWs group.

For Virtual Fabrics Group:

  • Admin

  • Default

Criteria

Focus criterion for Group Focus screen:

  • Util-High – Bandwidth Utilization (highest first)

  • UtlPkt-Hi – Packet Utilization (highest first)

  • Util-Low – Bandwidth Utilization (lowest first)

Focus criterion for VF Group Focus screen:

  • VF-Ut-Hi – VF Bandwidth Utilization (highest first)

  • VF-Pkt-Hi – VF Packet Utilization (highest first)

  • VF-Ut-Low – VF Bandwidth Utilization (lowest first)

Number

Number of ports for a group focus query.

Performance Data Subgroup

Performance statistics for each port group are further divided into up to three subgroups based on whether a port's neighbor port is in its group:

  • Internal - If a port's neighbor port is in its group, all performance statistics are contained in the Internal subgroup.

  • Send - If a port's neighbor is not in its group, statistics for data leaving the port (group) are contained in the Send subgroup

  • Receive - If a port's neighbor is not in its group, statistics for data entering the port are contained in the Receive subgroup.

Statistics

For each group, the following statistics are reported:

  • Average MBps

  • Minimum MBps

  • Maximum MBps

  • Average KPps

  • Minimum KPps

  • Maximum KPps

  • Status indicator

Performance Buckets

Count the number of ports whose 'MBps compared to link rate' value corresponds to that bucket. This provides an indication of how the data rate of the group compares to its potential.

Ten buckets from 0+% to 90+%, in 10% increments

NoResp Ports

No Response Ports per subgroup:

  • PMA - PMA failures are port counter query failures during the PM Sweep.

  • Topo - Topology errors are failures caused by encountering missing neighbor information in the topology.

Congestion buckets

Provides context (from the Statistics Screen)

  • Max

  • 0+%

  • 25+%

  • 50+%

  • 75+%

  • 100+%



Viewing Statistics Category

The Statistics Category screen displays statistics for a port group.

To view the statistics category, perform the following steps:

  1. Log in to the server as root.

  2. At the command prompt, enter opatop.

    The Summary screen is displayed.

  3. Determine which set of statistics you want to view:

    • To view Group information, continue to the next step.

    • To view VF information, type V.

  4. Type the number for the specific group statistics that you want to view:

    For Port Group:

    • 0 – All

    • 1SuperNICs

    • 2 – SWs

    For VF Group:

    • 0 – Default

    • 1 – Admin

    The Info Select screen is displayed as shown in the following example.

    opatop: Img: 10s @ Fri Sep 23 09:44:49 2016, Live
    Group Info Sel: HFIs
    Int NumPorts: 2  Rate Min: 100g  Max: 100g
    Ext NumPorts: 0
      Group Performance (P)
      Group Statistics (S)
      Group Config (C)
    
    
    
    
    
    Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help | P S C:
    
  5. Type S.

    The Category (Ctg) Stats screen is displayed as shown in the following example.

    opatop: Img: 10s @ Fri Sep 23 11:55:09 2016, Live
    Group Ctg Stats: HFIs  Criteria: Integ  Number: 10
    Int                   Max       0+%      25+%      50+%      75+%     100+%
        Integrity           0         2         0         0         0         0
        Congestion          0         2         0         0         0         0
        SmaCongest          0         2         0         0         0         0
        Bubble              0         2         0         0         0         0
        Security            0         2         0         0         0         0
        Routing             0         2         0         0         0         0
        Utilization:     0.0%  Discards:   0.0%
    
    
    
    
    
    Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help | cC N0-n Detail:
    
  6. To set the category stats Criteria for the focus query, type c (lowercase) to scroll forward or C (uppercase) to scroll in reverse to select one of the following choices:

    • Integrity category (highest first)

    • Congestion category (highest first)

    • SmaCongestion category (highest first)

    • Bubble category (highest first)

    • Security category (highest first)

    • Routing category (highest first)

    • VF Congestion category (highest first)

    • VF Bubble category (highest first)

  7. To change the Number of entries in the Err Stats list, type N and enter the target number of entries; then press Enter.

  8. Type D to initiate the group focus query and access the detailed Group Focus screen (refer to Viewing Focus Information).

  9. Type u (lowercase) for each screen you've accessed until you are back to the screen you want.

Statistics Screen Field Descriptions

The following table describes the bandwidth screen field descriptions.

Table 25. Statistics Field Descriptions

Field

Description

Group Name

Name of the group examined.

For Port Groups:

  • All - In the All group, all ports are Internal because, by definition, the neighbor port must be in the All group.

    All ports are Internal.

  • SuperNICs - In the SuperNICs groups, all neighbor ports are outside the group, so statistics are contained in the Send and Receive subgroups.

    All ports are External.

  • SWs - In the SWs group, neighbor ports are either outside the group (SuperNIC) or inside the group (another switch), so statistics are contained in all three subgroups. A special case for a switch port is the special switch port 0, which is always considered internal to the SWs group.

    Ports are Internal and External.

For Virtual Fabrics Group:

  • Admin

  • Default

Criteria (Statistics Categories)

Focus criteria/statistics categories:

  • Integrity

    • Link Quality Indicator

    • Link Width Downgrade

    • Local Link Integrity Errors

    • Port Receive Errors

    • Excessive Buffer Overrun Errors (neighbor port)

    • Link Error Recovery

    • Link Downed

    • Uncorrectable Errors

    • FM Config Errors

  • Congestion/VF Congestion

    • Port Transmit Wait

    • Switch Port Congestion

    • Port Receive FECN (neighbor port)

    • Port Receive BECN (only from FIs)

    • Port Transmit Time Congestion

    • Port Mark FECN

  • SmaCongestion - The counters included in the SMA Congestion category are the VL 15 counters equivalent to the port counters in the Congestion category.

  • Bubble/VF Bubble

    • Port Transmit Wasted Bandwidth

    • Port Transmit Wait Data

    • Port Receive Bubble (neighbor port)

  • Security

    • Port Receive Constraint Errors (neighbor port)

    • Port Transmit Constraint Errors

  • Routing

    • Port Receive Switch Relay Errors

The integrity and congestion error values are calculated by using a weighted sum. The weights for each and the threshold value for each error category can be seen in the PM Configuration screen (Viewing the PM Configuration).

Number

Number of entries for a group focus query.

Performance Data Subgroup

Performance statistics for each port group are further divided into up to three subgroups based on whether a port's neighbor port is in its group:

  • Internal - If a port's neighbor port is in its group, all performance statistics are contained in the Internal subgroup.

  • Send - If a port's neighbor is not in its group, statistics for data leaving the port (group) are contained in the Send subgroup.

  • Receive - If a port's neighbor is not in its group, statistics for data entering the port are contained in the Receive subgroup.

Int or Ext

Location of the port in relation to the group.

  • Int – The port's neighbor port is in its group (internal).

  • Ext – The port's neighbor port is not in its group (external).

Category buckets

For each subgroup within a category, there are five histogram buckets. Each bucket has a width of 25% (0+%, 25+%, etc.) with the last bucket width for beyond the threshold (100+%). A bucket is used to measure the number of ports whose category value, when compared to the threshold, falls within the range of the bucket. This provides an indication of how counter rates compare to their thresholds.

  • Max

  • 0+%

  • 25+%

  • 50+%

  • 75+%

  • 100+%

Utilization

Percent of error utilization; aids congestion analysis.

Discards

Percent of errors discarded; aids congestion analysis.



Viewing Configuration Information

The Configuration screen displays a list of the ports in a group, including the LID, port number, port GUID, and NodeDesc for each.

To view configuration information, perform the following steps:

  1. Log in to the server as root.

  2. At the command prompt, enter opatop.

    The Summary screen is displayed.

  3. Determine which set of statistics you want to view:

    • To view Group information, continue to the next step.

    • To view VF information, type V.

  4. Type the number for the specific group statistics that you want to view:

    For Port Group:

    • 0 – All

    • 1SuperNICs

    • 2 – SWs

    For VF Group:

    • 0 – Default

    • 1 – Admin

    The Info Select screen is displayed as shown in the example below.

    opatop: Img: 10s @ Fri Sep 23 09:44:49 2016, Live
    Group Info Sel: HFIs
    Int NumPorts: 2  Rate Min: 100g  Max: 100g
    Ext NumPorts: 0
      Group Performance (P)
      Group Statistics (S)
      Group Config (C)
    
    
    
    
    
    Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help | P S C:
    
  5. Type C.

    The Config screen is displayed as shown in the example below.

    opatop: Img: 10s @ Fri Sep 23 12:07:29 2016, Live
    Group Config: HFIs  NumPorts: 2
      Ix  LIDx Port   Node GUID 0x   NodeDesc
        0 0001   1  0011750101575300 phcppriv10 hfi1_0
        1 0002   1  001175010157E443 phcppriv11 hfi1_0
    
    
    
    
    
    Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help | sS P0-n:
    
    
  6. Type s (lowercase) to scroll forward or S (uppercase) to scroll backward through multiple screens of a long port list.

  7. Type P and enter the target Ix number; then press Enter to view the Port Stats screen for the specified Ix (refer to Viewing Port Statistics).

  8. Type u (lowercase) for each screen you've accessed until you are back to the screen you want.

Configuration Information Screen Field Descriptions

The table below describes the Configuration screen field descriptions.

Table 26. Configuration Information Field Descriptions

Field

Description

Group Name

Name of the group examined.

For Port Groups:

  • All - In the All group, all ports are Internal because, by definition, the neighbor port must be in the All group.

  • SuperNICs - In the SuperNICs groups, all neighbor ports are outside the group, so statistics are contained in the Send and Receive subgroups.

  • SWs - In the SWs group, neighbor ports are either outside the group (SuperNIC) or inside the group (another switch), so statistics are contained in all three subgroups. A special case for a switch port is the special switch port 0, which is always considered internal to the SWs group.

For Virtual Fabrics Group:

  • Admin

  • Default

NumPorts

Number of ports returned in the group configuration query.

Ix

An index value that is used to select a port to view in the Port Stats screen.

LIDx

LID information

Port

Port Index.

Node GUID 0x

Global Unique Identifier (GUID) for the Node.

NodeDesc

Description of the node.



Viewing Focus Information

The Focus information screen displays a list of the ports within a group, including the LID, port number, focus criterion, port GUID, and NodeDesc of each. If the port has a neighbor port, the same information is displayed for the neighbor.

Note

The Focus information screen is the same for VF and non-VF.

To view focus information, perform the following steps:

  1. Log in to the server as root.

  2. At the command prompt, enter opatop.

    The Summary screen is displayed.

  3. Determine which set of statistics you want to view:

    • To view Group information, continue to the next step.

    • To view VF information, type V.

  4. Type the number for the specific group statistics that you want to view:

    For Port Group:

    • 0 – All

    • 1SuperNICs

    • 2 – SWs

    For VF Group:

    • 0 – Default

    • 1 – Admin

    The Info Select screen is displayed.

  5. Determine the Information Select menu to access:

    • To view the Focus information screen for BW Summary, type P.

    • To view the Focus information screen for Err Summary, type S.

  6. Determine the Criteria for the focus query:

    • To set the BW stats Criteria for the focus query, type c (lowercase) to scroll forward or C (uppercase) to scroll in reverse to select one of the following choices:

      • Util-High – Bandwidth Utilization (highest first)

      • UtlPkt-Hi – Packet Utilization (highest first)

      • Util-Low – Bandwidth Utilization (lowest first)

      • VF-Ut-Hi – VF Bandwidth Utilization (highest first)

      • VF-Pkt-Hi – VF Packet Utilization (highest first)

      • VF-Ut-Low – VF Bandwidth Utilization (lowest first)

    • To set the category stats Criteria for the focus query, type c (lowercase) to scroll forward or C (uppercase) to scroll in reverse to select one of the following choices:

      • Integrity category (highest first)

      • Congestion category (highest first)

      • SmaCongestion category (highest first)

      • Bubble category (highest first)

      • Security category (highest first)

      • Routing category (highest first)

      • VF Congestion category (highest first)

      • VF Bubble category (highest first)

  7. Type D.

    The Focus information screen is displayed as shown in the example below.

    opatop: Img: 10s @ Fri Sep 23 13:03:09 2016, Live
    Group Focus: HFIs   GrpNumPorts: 2  NumPorts: 1  Number: 10
      Ix  Util-High LIDx Port   Node GUID 0x   NodeDesc
        0       0.0 0001   1  0011750101575300 phcppriv10 hfi1_0
      <->       0.0 0002   1  001175010157E443 phcppriv11 hfi1_0
    
    
    
    
    Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help | sS cC N0-n P0-n:
    
  8. To change the criteria after accessing this screen, type c (lowercase) to scroll forward or C (uppercase) to scroll in reverse to select one of the following choices:

    • Util-High – Bandwidth Utilization (highest first)

    • UtlPkt-Hi – Packet Utilization (highest first)

    • Util-Low – Bandwidth Utilization (lowest first)

    • VF-Ut-Hi – VF Bandwidth Utilization (highest first)

    • VF-Pkt-Hi – VF Packet Utilization (highest first)

    • VF-Ut-Low – VF Bandwidth Utilization (lowest first)

    • Integrity category (highest first)

    • Congestion category (highest first)

    • SmaCongestion category (highest first)

    • Bubble category (highest first)

    • Security category (highest first)

    • Routing category (highest first)

    • VF Congestion category (highest first)

    • VF Bubble category (highest first)

  9. To change the Number of entries in the focus list, type N and enter the target number of entries; then press Enter.

  10. Type s (lowercase) to scroll forward or S (uppercase) to scroll backward through multiple screens of a long port list.

  11. Type P and enter the target Ix number; then press Enter to view the detailed Port Stats screen (refer to Viewing Port Statistics).

  12. Type u (lowercase) for each screen you've accessed until you are back to the screen you want.

Focus Information Screen Field Descriptions

The table below describes the Focus screen field descriptions.

Table 27. Focus Information Field Descriptions

Field

Description

Group Name

Name of the group examined.

For Port Groups:

  • All - In the All group, all ports are Internal because, by definition, the neighbor port must be in the All group.

  • SuperNICs - In the SuperNICs groups, all neighbor ports are outside the group, so statistics are contained in the Send and Receive subgroups.

  • SWs - In the SWs group, neighbor ports are either outside the group (SuperNIC) or inside the group (another switch), so statistics are contained in all three subgroups. A special case for a switch port is the special switch port 0, which is always considered internal to the SWs group.

For Virtual Fabrics Group:

  • Admin

  • Default

GrpNumPorts

Number of ports selected, as determined by the combination of group, criteria, and requested ports.

NumPorts

Number of ports returned in the group configuration query.

Number

Number of ports for a group focus query.

Ix

An index value that is used to select a port to view in the Port Stats screen.

Criteria

Limits the focus to specific port statistics.

For BW stats (bandwidth statistics):

  • Util-High – Bandwidth Utilization (highest first)

  • UtlPkt-Hi – Packet Utilization (highest first)

  • Util-Low – Bandwidth Utilization (lowest first)

  • VF-Ut-Hi – VF Bandwidth Utilization (highest first)

  • VF-Pkt-Hi – VF Packet Utilization (highest first)

  • VF-Ut-Low – VF Bandwidth Utilization (lowest first)

For Ctg stats (category statistics):

  • Integrity category (highest first)

  • Congestion category (highest first)

  • SmaCongestion category (highest first)

  • Bubble category (highest first)

  • Security category (highest first)

  • Routing category (highest first)

  • VF Congestion category (highest first)

  • VF Bubble category (highest first)

LIDx

LID information.

Port

Port Index.

NOTE: A symbol may be present on the first character of each line related to a port. This symbol is used to indicate a non-ideal condition was observed when calculating the relevant port's data. The possible conditions are, the PM was told to ignore this port ('~'), the PM failed to query this port ('!'), and the PM topology does not know this port’s identity ('?').

Node GUID 0x

Global Unique Identifier (GUID) for the Node.

NodeDesc

Description of the node.



Viewing Port Statistics

The Port Statistics screen displays a specific port and LID's performance and statistics counters.

To view port statistics, perform the following steps:

  1. Log in to the server as root.

  2. At the command prompt, enter opatop.

    The Summary screen is displayed.

  3. Determine which set of statistics you want to view:

    • To view Group information, continue to the next step.

    • To view VF information, type V.

  4. Type the number for the specific group statistics that you want to view:

    For Port Group:

    • 0 – All

    • 1SuperNICs

    • 2 – SWs

    For VF Group:

    • 0 – Default

    • 1 – Admin

    The Info Select screen is displayed.

  5. Determine the Information Select menu to access:

    • To view the Port Stats screen for BW Summary, type P.

    • To view the Port Stats screen for Err Summary, type S.

    • To view the Port Stats screen for Configuration information, type C.

      If you are accessing the Port Stats screen from the Configuration information screen, skip to Step 7.

  6. Determine the Criteria for the focus query as described in Viewing Bandwidth Utilization or Viewing Statistics Category.

  7. Type D to access the Focus information screen.

    To make changes to the Focus information prior to accessing the Port Stats screen, refer to Viewing Focus Information.

  8. Type P and enter the target Ix number; then press Enter to view the detailed Port Stats screen.

    The Port Stats screen is displayed.

    Note

    Neighbor port and link information are available only when accessed through the Focus Information screen. It is not available through the Configuration information screen.

    opatop: Img: 10s @ Fri Sep 23 14:07:40 2016, Live
    Port Stats: HFIs  LID: 0x2 PortNum: 1 Rate: 100g MTU: 4096
    NodeDesc: phcppriv11 hfi1_0  NodeGUID: 0x001175010157E443
    Neighbor: phcppriv10 hfi1_0  LID: 0x1 PortNum: 1
     Xmit: Data:          0 MB (        63 Flits) Pkts:          1
     Recv: Data:          0 MB (        10 Flits) Pkts:          1
     Multicast: Xmit Pkts: 0           Recv Pkts: 0
     Integrity:                   | Congestion:
      Link Quality:             5 |  Cong Discards:            0
      Uncorrectable:            0 |  Rcv FECN*:                0
      Link Downed:              0 |  Rcv BECN:                 0
      Lanes Down:               0 |  Mark FECN:                0
      Rcv Errors:               0 |  Xmit Time Cong:           0
      Excs Bfr Ovrn*:           0 |  Xmit Wait:                0
      FM Conf Err:              0 | Routing and Others:
      Lnk Err Recov:            0 |  Rcv Sw Relay:             0
      Loc Lnk Integ:            0 |  Xmit Discards:            0
     Security:                    | Bubble:
      Xmit Constrain:           0 |  Xmit Wasted BW:           0
      Rcv Constrain*:           0 |  Xmit Wait Data:           0
     SmaCongestion (VL15):        |  Rcv Bubble*:              0
      Cong Discards:            0
      Xmit Wait:                0
    Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help | Neighbor |
    
  9. Type V to toggle between Port Stats and VF Port Stats screens.

    Note

    VF Port Stats information can only be accessed when you are viewing VF statistics (selected in Step 3).

    opatop: Img: 10s @ Wed May 23 15:01:45 2018, Bkmk Now:Wed May 23 15:36:50 2018
    VF Port Stats: Admin  LID: 0x1 PortNum: 1
    NodeDesc: hdwtpriv35.hd.intel.com  NodeGUID: 0x001175010165AE75
    
     Xmit: Data:          0 MB (      1575 Flits) Pkts:        155
     Recv: Data:          0 MB (     19179 Flits) Pkts:        154
    
    
     Congestion:
      Cong Discards:            0 | Rcv FECN*:                 0
      Mark FECN:                0 | Rcv BECN:                  0
      Xmit Wait:                0 | Xmit Time Cong:            0
     Bubble:
      Xmit Wasted BW:           0 | Rcv Bubble*:               0
      Xmit Wait Data:           0 |
     Routing and Others:
      Xmit Discards:            0
    
    Counters may be shared between Virtual Fabrics
    
    
    
    Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help | vV:
    
  10. Type N to switch between statistics for the port and its neighbor port.

  11. Type u (lowercase) for each screen you've accessed until you are back to the screen you want.

Port Statistics Screen Field Descriptions

The table below describes the Port Statistics screen field descriptions.

Table 28. Port Statistics Field Descriptions

Field

Description

Group Name

Name of the group examined.

For Port Groups:

  • All - In the All group, all ports are Internal because, by definition, the neighbor port must be in the All group.

  • SuperNICs - In the SuperNICs groups, all neighbor ports are outside the group, so statistics are contained in the Send and Receive subgroups.

  • SWs - In the SWs group, neighbor ports are either outside the group (SuperNIC) or inside the group (another switch), so statistics are contained in all three subgroups. A special case for a switch port is the special switch port 0, which is always considered internal to the SWs group.

For Virtual Fabrics Group:

  • Admin

  • Default

LIDx

LID information for the node.

PortNum

Port number of the node.

Rate

Link rate.

MTU

MTU, if available.

NodeDesc

Description of the node.

NodeGUID

Global Unique Identifier (GUID) for the node.

Neighbor

Description of the neighboring node.

Xmit Data

Size of the data transmitted in MB and Flits and the number of packets.

Recv Data

Size of the data received in MB and Flits and the number of packets.

Multicast: Xmit Pkts

Number of multicast packets transmitted.

Multicast: Recv Pkts

Number of multicast packets received.

Statistics Counters

  • Integrity:

    • Link Quality

    • Uncorrectable

    • Link Downed

    • Lanes Down

    • Receive Errors

    • Excessive Buffer Overrun*

    • FM Config Errors

    • Link Error Recovery

    • Local Link Integrity

  • Security:

    • Transmit Constraint

    • Receive Constraint*

  • SmaCongestion - The counters included in the SMA Congestion category are the VL 15 counters equivalent to the port counters in the Congestion category.

    • Cong Discards

    • Xmit Wait

  • Congestion:

    • Cong Discards

    • Receive FECN*

    • Receive BECN*

    • Mark FECN

    • Transmit Time Congestion

    • Transmit Wait

  • Routing and Others:

    • Receive Sw Relay

    • Transmit Discards

  • Bubble:

    • Transmit Wasted Bandwidth

    • Transmit Wait Data

    • Receive Bubble*

A trailing asterisk (*) on the counter name indicates the count will be used in computing Statistics Category information for the neighbor port.



Navigating PM Sweeps

The Fabric Performance Monitoring TUI allows you to access statistics from sequential PM sweeps (the PM keeps a history of previous sweep images) and queries the PM at a user-specified interval (10 seconds by default). Sweeps are accessed from the short term history database being recorded by the PM. This allows access to statistics from up to 24 hours in the past.

When the Fabric Performance Monitoring TUI queries for statistics for the most recent PM sweep, it is in “Live” mode. In Live mode, the data will change, at the opatop interval rate, as opatop queries new PM sweeps. At each screen (summary or detail), the data being displayed is refreshed for the current PM sweep.

A PM sweep can be in “frozen” mode. The data in a frozen sweep will not change, allowing the statistics to be examined in summary and detail screens.

The Fabric Performance Monitoring TUI allows you to navigate the focus to another sweep within the history of sweeps maintained by the PM. For the duration of focus on such a sweep, it will remain frozen. You can examine other screens for the selected image while in "Historic" mode. Navigation can be performed backward or forward, 1 or 5 sweeps at a time, to a specific time, to a bookmarked time, or back to live data.

To navigate the historical PM sweeps, perform the following steps:

  1. Navigate to the screen that you want to analyze historically.

    The date stamp below shows the time of the frozen image (highlighted in bold) and the current on-going time (highlighted in italics).

    opatop: Img: 10s @ Fri Sep 23 17:32:32 2016, HistNow:Fri Sep 23 17:33:08 2016
    Port Stats: HFIs  LID: 0x1 PortNum: 1 Rate: 100g MTU: 4096
    NodeDesc: phcppriv10 hfi1_0  NodeGUID: 0x0011750101575300
    Neighbor: phcppriv11 hfi1_0  LID: 0x2 PortNum: 1
     Xmit: Data:          0 MB (        10 Flits) Pkts:          1
     Recv: Data:          0 MB (        63 Flits) Pkts:          1
     Multicast: Xmit Pkts: 0           Recv Pkts: 0
     Integrity:                   | Congestion:
      Link Quality:             5 |  Cong Discards:            0
      Uncorrectable:            0 |  Rcv FECN*:                0
      Link Downed:              0 |  Rcv BECN:                 0
      Lanes Down:               0 |  Mark FECN:                0
      Rcv Errors:               0 |  Xmit Time Cong:           0
      Excs Bfr Ovrn*:           0 |  Xmit Wait:                0
      FM Conf Err:              0 | Routing and Others:
      Lnk Err Recov:            0 |  Rcv Sw Relay:             0
      Loc Lnk Integ:            0 |  Xmit Discards:            0
     Security:                    | Bubble:
      Xmit Constrain:           0 |  Xmit Wasted BW:           0
      Rcv Constrain*:           0 |  Xmit Wait Data:           0
     SmaCongestion (VL15):        |  Rcv Bubble*:              0
      Cong Discards:            0
      Xmit Wait:                0
    Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help | Neighbor |
    
  2. Choose from the following options:

    • Type r (lowercase) to go back one sweep at a time.

    • Type R (uppercase) to go back five sweeps at a time.

    • Type f (lowercase) to move ahead one sweep at a time.

    • Type F (uppercase) to move ahead five sweeps at a time.

    • Type t<time> and press Enter to go to a sweep at a specific time.

      Allowed input formats for <time> include:

      • # [hour(s)/minute(s)/second(s)] ago (for example, t 1 hour ago)

      • YYYY:MM:DD HH:MM:SS (for example, t 2019:07:11 12:00:12)

    • Type b to move to the most recently bookmarked image.

    • Type L to return to the Live data.

Bookmarking a Sweep

The Fabric Performance Monitoring TUI allows you to bookmark a sweep to review the information. For the duration of the Bookmark, all information is frozen. You can navigate through the various screens to review the frozen information. The sweep will remain frozen until you explicitly "Unbookmark" it.

Adding a Bookmark

Note

opatop allows only one sweep at a time to be bookmarked.

To bookmark a PM sweep, perform the following steps:

  1. Navigate to the screen you want to capture and analyze.

  2. Type B (uppercase) to bookmark the screen.

    In the Image Identification line (line 1), the Live image changes to Bkmk (bookmark) as highlighted in bold in the example screen below.

    opatop: Img: 10s @ Fri Sep 23 16:44:42 2016, Bkmk Now:Fri Sep 23 16:44:53 2016
    Summary:  SW:     0 Ports: SW:     0  HFI:     2       Link:     1
              SM:     1 Node NRsp:     0 Skip:     0 Port NRsp:     0 Skip:     0
                        AvgMBps   MinMBps   MaxMBps   AvgKPps   MinKPps   MaxKPps
    0 All         Int         0         0         0         0         0         0
          Integ:min  Congst:min  SmaCong:min  Bubble:min  Secure:min  Routing:min
    1 HFIs        Int         0         0         0         0         0         0
          Integ:min  Congst:min  SmaCong:min  Bubble:min  Secure:min  Routing:min
    2 SWs         No ports in group
    
    
    
    
    
        Master-SM: LID: 0x0001 Port: 1   Priority: 0  State: Master
                   Name: phcppriv10 hfi1_0
                   PortGUID: 0x0011750101575300
     Secondary-SM: none
    
    
    
    Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help |
    sS Pmcfg Imginfo View 0-n:
    

    The bookmark will remain until you explicitly remove it.

  3. Type L to return to the Live data.

  4. Type b (lowercase) to return to the bookmarked image.

Removing a Bookmark

To remove a bookmark from a PM sweep, perform the following steps:

  1. Type b (lowercase) to return to the bookmarked image.

  2. Type U (uppercase).

    In the Image Identification line (line 1), the Bkmk image changes back to Live (bookmark) as highlighted in bold in the example screen below.

    opatop: Img: 10s @ Fri Sep 23 16:49:52 2016, Live
    Summary:  SW:     0 Ports: SW:     0  HFI:     2       Link:     1
              SM:     1 Node NRsp:     0 Skip:     0 Port NRsp:     0 Skip:     0
                        AvgMBps   MinMBps   MaxMBps   AvgKPps   MinKPps   MaxKPps
    0 All         Int         0         0         0         0         0         0
          Integ:min  Congst:min  SmaCong:min  Bubble:min  Secure:min  Routing:min
    1 HFIs        Int         0         0         0         0         0         0
          Integ:min  Congst:min  SmaCong:min  Bubble:min  Secure:min  Routing:min
    2 SWs         No ports in group
    
    
    
    
    
        Master-SM: LID: 0x0001 Port: 1   Priority: 0  State: Master
                   Name: phcppriv10 hfi1_0
                   PortGUID: 0x0011750101575300
     Secondary-SM: none
    
    
    
    Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help |
    sS Pmcfg Imginfo View 0-n:
    
Using the opatop Command Line Options

While opatop starts the Fabric Performance Monitoring TUI, you can use the command line options as shown below:

Syntax
opatop [-v] [-q] [-h hfi] [-p port] [--timeout] [-i seconds]
Options
--help

Produces full help text.

-v/--verbose level

Specifies the verbose output level. Value is additive and includes:

1

Screen

4

STDERR opatop

16

STDERR PaClient

-q/--quiet

Disables progress reports.

-h/--hfi hfi

Specifies the SuperNIC, numbered 1..n. Using 0 specifies that the -p port port is a system-wide port number. Default is 0.

-p/--port port

Specifies the port, numbered 1..n. Using 0 specifies the first active port. Default is 0.

--timeout

Specifies the timeout (response wait time) in ms. Default is 1000 ms.

-i/--interval seconds

Interval in seconds at which PA queries are performed to refresh to the latest PA image. Default is 10 seconds.

-h and -p options permit a variety of selections:

-h 0

First active port in the system (Default).

-h 0 -p 0

First active port in the system.

-h x

First active port on SuperNIC x.

-h x -p 0

First active port on SuperNIC x.

-h 0 -p y

Port y within the system (no matter which ports are active).

-h x -p y

SuperNIC x, port y.

7.2.5.1.3. Top Level Data

Top level data refers to the high-level perspective of the possible data you gathered from PA attributes. You can "drill down" to get information that is more specific, such as a list of ports.

Fabric Configuration and PM Image Information

From the PA, you can access fabric configuration and PM Image information. This data shows general information about the PM Image, including a unique 64-bit ID that you can use to access all the data collected for this PM Image.

The ImageInfo query can provide additional information such as:

  • Basic topology information includes the number of SuperNICs, Switch Nodes, and Ports. Also, you can view topology information about the Primary (Main) SM and the Secondary (Standby) SM, if present.

  • PM sweep data includes the start time, sweep duration, and the time over which this Image is valid, as well as the number of ports and nodes that had failures and for which data was not gathered.

If You See No Response Nodes and No Response Ports

If you see No Response Nodes and No Response Ports, review the FM's Log to find out what failures are causing these ports to be unsuccessful.

Likely reasons for No Response Nodes and No Response Ports are timeouts due to port bounces or reboots. This happens because the PM sweep may already be underway and is using the most recently completed SM Sweep's topology data, which may not include the bounced port's new port state. If this only happens one time, it should be okay to ignore; but, if this is a transient or reoccurring issue, you will likely see that port or its neighbor appears to have integrity issues and should drill down and get more data on the offending ports.

If You See Unexpected Clears

If you see Unexpected Clears, review the FM's Log to find out what ports and what counters are being unexpectedly cleared.

CLI Tools such as opapmaquery and opareport can clear PMA counters and can trigger this, so check with other users first. Additionally, a reboot of the node may also reset the counters.

PM Port Group’s Performance Utilization and Statistical Data

From the PA, you can access PM Port Group performance utilization and statistical data that provides conglomerated data of all the links within a PM Port Group.

A port's Performance data will fall within one of three subgroups based upon whether both (Internal subgroup), only itself (Send subgroup), or just its neighbor (Receive subgroup) is within the PM Port Group. The performance subgroup data has three subsections: the ten-bucket utilization percentage histogram, the performance statistics, and the no response ports counters.

The Statistical data available is divided into two subgroups: Internal and External. Each subgroup has the following subsections: a five-bucket histogram and a maximum value field for each of the six PA Categories.

If You See Ports in the Higher Percentage Buckets

If you see ports in the higher percentage buckets, it means that those ports are experiencing high values of that Category. The values for each bucket represent the number of ports that are "binned" within that percentage range (bucket).

If you see ports in the higher buckets for the Integrity category, then you will need to drill down further to find out what ports are experiencing Integrity issues. Note that reboots and general fabric maintenance (such as moving systems, replacing cables, etc.) can create false positives. You may want to verify if this issue recurs after a planned interruption is over before continuing to drill down to gather more data.

If you see ports in the Congestion higher percentage buckets, then you should check whether a node is being overloaded by the jobs running or by lack of allocated resources. Also, make sure you are using an appropriate Routing algorithm for your fabric. In a more serious situation, you may have to investigate the traffic pattern of the application, ISL resources, or over-subscription in the fabric. You can drill down to find out what ports are having this congestion and identify what resources are perhaps being over utilized and need to be redistributed.

If there are ports in the SMA Congestion higher percentage buckets, then you should check the SM and verify the configuration. SMA congestion is congestion specific to SM-only traffic and should happen only under extreme conditions.

If You See PMA or Topology No Response Ports

PMA No Response Ports are the same No Response Ports from the fabric configuration data, only limited in scope to the specific PM Port Group. PMA No Response ports are usually one-offs and follow the same steps as no response ports from fabric configuration (refer to section Fabric Configuration and PM Image Information).

Topology Incomplete ports are extremely rare. These are ports that should have had an active neighbor (all but Switch Port Zero), but do not. This is usually an indication that the SM has an inaccurate topology. Forcing an SM re-sweeps may clear this error if no other errors are occurring. Otherwise, you will have to drill down, find the Neighbor Port information, and manually bounce the link.

7.2.5.1.4. Mid-Tier Data

Mid-tier data contains statistical information for the link-level perspective. The data provides a sorted list of links that you can use to drill down further to get each port's exact Port Counter values (if available). Links are sorted based on criteria such as PA Category values and utilization metrics.

Sorted Lists of Links based on Statistical Criteria

After choosing a criterion (Utilization or PA Category), a sorted list of links is formed, which is ordered by the value of the criterion for both ports of the link. Switch Port Zero evaluates as a Link with no neighbor.

Depending on the type of criterion, the number of offending ports, and their location in the fabric, several conditions are possible. While not always needed, more specific port counter values can be gathered at the per-port level when drilling down to the bottom tier.

If You See One Link With A Non-Zero Integrity Value

Having just one link with an integrity issue may have several causes.

One of the more common and benign causes is that the other side of the link bounced during the PM Sweep and appeared to be down (LinkQualityIndicator = 0 [Down]). This can also be seen in the No Response Ports values described in the sections, Fabric Configuration and PM Image Information and PM Port Group’s Performance Utilization and Statistical Data. As this is usually a one-off, it can be ignored.

If this issue is not planned and is reoccurring, the port may be experiencing a Signal Integrity issue. Degradation of Signal Integrity may have several possible causes. A quick check of the cable's connections or the quality of the cable itself may be the likely solution.

Several FastFabric tools, such as opalinkanalysis, can help you identify links that may be misconfigured or operating at slower speeds. If one end of the link is attached to a SuperNIC, you can try to access the node's syslog and see if the SuperNIC is reporting any errors. In addition, you can "drill down" further to view the individual Port Counter values.

If You See Multiple Links With A Non-Zero Integrity Value

When multiple links have a non-zero value, you will need to determine their location in the fabric and group the links by proximity and purpose (compute, storage, etc.). For any links that cannot be grouped, you can follow the same process as described in If You See One Link With A Non-Zero Integrity Value. However, links that are in close proximity or share a similar purpose may have related causes and may require a slight change to the manner in which the shared issue is debugged.

An example would be if the group of links was attached to the same switch, then there may be an issue with the connections going to and from the switch, or, more likely, the environment around the switch is contributing in some way (power, heating, etc.). In this case, you should first verify the switch's physical state is as expected. You can use a tool such as opaswitchadmin or opachassisadmin to gather data such as power supply status and temperature. If the group was all storage nodes, then perhaps the issue is related to the storage devices or software.

If You See One Link with a High Congestion Value

If you see a single link with a high congestion value and the link is an ISL, then you may need to investigate the application's design, configuration, and placement in the fabric. Next, you should check the over-subscription ratio and verify the topology of the fabric. Tools such as opareport, opaextractmissinglinks, and opaxlattopology can be used to verify the current topology against a predefined topology configuration file. If the link is attached to a SuperNIC, you may need to alter what jobs are being run on that node and see if it may be overloaded. The Congestion value for a port is adjusted based on the Utilization of the link to give a more accurate display of Congestion on the port.

However, if the link does not have a high Utilization but still has high Congestion, the link may be experiencing a more serious issue. You can "drill down" further to view the individual Port Counter values to better identify the issue.

7.2.5.1.5. Lowest Tier Data

PA Port Counters are the lowest tier of data available through the PA User interface.

Individual Port Counter Data

The lowest level in the PA is the Port Counters Data. Depending on the options of the request, response values can be a delta between itself and the previous Image or the RAW counter values obtained during a PM sweep. From these values, you can see the individual Port Counters that were used to compute the PA Categories at the higher levels.

Refer to the CN5000 Maintenance and Troubleshooting Guide, which provides information on all of the counters, their categories, and how they are computed into each category. This document also includes the rationale behind the computation and inclusion of the counters in their respective categories.