Skip to main content

Cornelis Technical Documentation

5.2.1.5. Defining a Multi-Tenant Fabric

This section provides instructions for setting up an example multi-tenant fabric on a CN5000 Omni-Path switch.

The multi-tenant fabric in the context of this example is in a CSP. A CSP needs to provide groups of servers to different customers that share a common fabric. Each customer needs full root access to their servers, and no customer should be able to read or write (or even see) any components of another customer.

For more details, refer to Virtual Fabrics for Multi-Tenancy.

5.2.1.5.1. Create Tenant vFabrics

Note

This procedure assumes that you have already created your QOSGroups.

  1. Create the Admin nodes.

    For more than one Admin node, the main Admin node will be the primary Fabric Manager. You can have as many Admin nodes as you like.

    Note

    Commands are run from the main Admin node unless otherwise specified.

  2. Check existing vFabric information from the main Admin node.

    [root@FM_host ~]# opareport -qQ -o vfinfo
       vFabrics:
       vFabric Index: 0   Name: Default
       PKey: 0x8001   SL: 0 Select: 0x0   PktLifeTimeMult: 1
       MaxMtu: unlimited  MaxRate: unlimited   Options: 0x00 
       QOS: Disabled  PreemptionRank: 0  HoQLife:   8 ms
       vFabric Index: 1   Name: Admin
       PKey: 0x7fff   SL: 0 Select: 0x1: PKEY  PktLifeTimeMult: 1
       MaxMtu: unlimited  MaxRate: unlimited   Options: 0x01: Security
       QOS: Disabled  PreemptionRank: 0  HoQLife:   8 ms
       2 VFs

    Note

    The Default vFabric has a PKey of 0x8001.

  3. List contents of /etc/opa-fm sub-directories dgs (device groups) and vfs (virtual fabrics)

    [root@FM_host opa-fm]# ls dgs/
    [root@FM_host opa-fm]# ls vfs/
    

    Since no vFabrics have been created, both directories are currently empty.

  4. Select nodes and PKeys for each tenant vFabric to be created. 

    In this example, two tenant vFabrics, Tenant1 and Tenant2, are created each with two nodes. The PKeys for these vFabrics are 0x11 and 0x12.

    Note

    • You can create as few or as many vFabrics with as many member nodes as required.

    • All nodes in this section will be given generic names like host1, host2, and FM_host. FM_host refers to the main Admin host.

    For more information on PKeys, refer to PKeys in vFabrics.

  5. Back up the /etc/opa-fm/opafm.xml file by copying it to /etc/opa-fm/opafm.xml.orig.

  6. Stop any Standby Fabric Managers by logging into them and running systemctl stop opafm.

    Note

    If you want to use standby Fabric Managers (a best practice), you need to identify them as Admin nodes and copy the opafm.xml from the primary Fabric Manager to each standby Fabric Manager and restart the service.

  7. (OPTIONAL) Look at the opafmvf command usage.

    opafmvf [-v] command [arg...]
    -v/--verbose - verbose output
    
    Commands:
        create                    - create empty virtual fabric configuration
        delete                    - delete virtual fabric configuration
        add                       - add ports to virtualfabric configuration
        remove                    - remove ports from virtual fabric configuration
        reset                     - delete all auxiliary opafmvf configuration files
        commit                    - generate new configuration for OPA FM
        reload                    - inform OPA FM to read new configuration
        restart                   - stop and then start OPA FM
        exist                     - query if virtual fabric exists in the fabric
        ismember                  - query if ports are members of virtual fabric
        isnotmember               - query if ports are not members of virtual fabric
    
  8. Create vFabrics Tenant1 and Tenant2 using the opafmvf command.

    For Tenant1:

    [root@FM_host opa-fm]# opafmvf create --pkey=0x11 Tenant1

    For Tenant2:

    [root@FM_host opa-fm]# opafmvf create --pkey=0x12 Tenant2
  9. List the /etc/opa-fm/dgs and /etc/opa-fm/vfs sub-directories to ensure the XML files have been created.

    [root@FM_host]# cd /etc/opa-fm
    [root@FM_host opa-fm]# ls dgs
    opafm_dg_Tenant1.xml opafm_dg_Tenant2.xml
    
    [root@FM_host opa-fm]# ls vfs
    opafm_vf_Tenant1.xml  opafm_vf_Tenant2.xml
5.2.1.5.2. Display and Edit the Contents of vFabric XML Files
  1. View the contents of Tenant1 and Tenant2 XML files.

    [root@FM_host opa-fm]# cat /etc/opa-fm/vfs/opafm_vf_Tenant1.xml
    <VirtualFabric>
        <Name>Tenant1</Name>
        <Enable>1</Enable>
        <PKey>0x0011</PKey>
        <Security>1</Security>
        <Member>Tenant1</Member>
        <Application>Tenant_Apps</Application>
        <QOSGroup>Tenant_QOS</QOSGroup>
    </VirtualFabric>
    
    [root@FM_host opa-fm]# cat /etc/opa-fm/vfs/opafm_vf_Tenant2.xml
    <VirtualFabric>
        <Name>Tenant2</Name>
        <Enable>1</Enable>
        <PKey>0x0012</PKey>
        <Security>1</Security>
        <Member>Tenant2</Member>
        <Application>Tenant_Apps</Application>
        <QOSGroup>Tenant_QOS</QOSGroup>
    </VirtualFabric>

    Note

    The code example displays the PKeys, device group membership, and application information. It also shows that the vFabrics and security are enabled (1).

  2. Edit the /etc/opa-fm/opafm_pp.xml file to add a line for Networking.

    Note

    For vFabrics, do not edit the opafm.xml file directly. Instead, use the pre-processor opafm_pp.xml file.

    Tip

    Search for Tenant_Apps.

    [root@FM_host opa-fm]# vi /etc/opa-fm/opafm_pp.xml
    
    <!-- default application used by VFs created by opafmvf tool -->
        <Application>
          <Name>Tenant_Apps</Name>
          <IncludeApplication>Compute</IncludeApplication>
          <IncludeApplication>Networking</IncludeApplication>  (Add this line)
        </Application>
5.2.1.5.3. Add Nodes to Tenant vFabric Configurations Using Port GUIDs

Note

The LIDs and GUIDS used in this document are examples only.

  1. Find the port GUIDs of the host nodes that you want to be members of the tenant vFabrics.

    Tenant1

    [root@FM_host vfs]# opaextractlids -qQ | grep Tenant1_Host1
    0x001175010176be56;1;FI;Tenant1_Host1 hfi1_0;0x001c
    [root@FM_host vfs]# opaextractlids -qQ | grep Tenant1_Host2
    0x0011750101744e5e;1;FI;Tenant1_Host2 hfi1_0;0x000e
    

    Tenant2

    [root@FM_host opa-fm]# opaextractlids -qQ | grep Tenant2_Host1
    0x0011750101744e4e;1;FI;Tenant2_Host1 hfi1_0;0x000f
    [root@FM_host opa-fm]# opaextractlids -qQ | grep Tenant2_Host2
    0x001175010174428a;1;FI;Tenant2_Host2 hfi1_0;0x0003
  2. Add the port GUIDs of the nodes as members of the tenant vFabrics.

    Add ports to Tenant1 vFabric.

    [root@FM_host opa-fm]# opafmvf add Tenant1 0x001175010176be56 0x0011750101744e5e

    Added ports to Tenant2 vFabric.

    [root@FM_host opa-fm]# opafmvf add Tenant2 0x0011750101744e4e 0x001175010174428a
  3. Rebuild the opafm.xml file and reload the FM.

    [root@FM_host opa-fm]# opafmvf commit  
    /etc/opa-fm/opafm.xml will be overwritten!
    Do you want to continue? [y/N] y
    Processing files in  /etc/opa-fm/dgs
    Processing files in  /etc/opa-fm/vfs
    Config Check Passed!
    
    Generated new configuration for fabric manager
    
    [root@ FM_host dgs]# opafmvf reload
    Reloaded fabric manager
    
  4. From the Admin node, verify the existing vFabric information.

    [root@FM_host dgs]# opareport -qQ -o vfinfo
    vFabrics:
    vFabric Index: 0   Name: Admin
    PKey: 0x7fff   SL: 0 Select: 0x3: PKEY SL  PktLifeTimeMult: 1
    MaxMtu: unlimited  MaxRate: unlimited   Options: 0x03: Security QoS
    QOS: Bandwidth:  50% PreemptionRank: 0  HoQLife:    8 ms
    
    vFabric Index: 1   Name: Tenant1
    PKey: 0x11   SL: 1 Select: 0x3: PKEY SL  PktLifeTimeMult: 1
    MaxMtu: unlimited  MaxRate: unlimited   Options: 0x03: Security QoS
    QOS: Bandwidth:  50% PreemptionRank: 0  HoQLife:    8 ms
    
    vFabric Index: 2   Name: Tenant2
    PKey: 0x12   SL: 1 Select: 0x3: PKEY SL  PktLifeTimeMult: 1
    MaxMtu: unlimited  MaxRate: unlimited   Options: 0x03: Security QoS
    QOS: Bandwidth:  50%  PreemptionRank: 0  HoQLife:   8 ms
    3 VFs

    The results show the two tenant vFabrics as well as the Admin vFabric. The Default vFabric is not shown as it is not created by opafmvf.

  5. (OPTIONAL) List the contents of the Device Group (dgs) XML files for the tenant vFabrics and display the file contents.

    [root@FM_host opa-fm]# cd dgs/
    [root@FM_host dgs]# ls
    opafm_dg_Tenant1.xml  opafm_dg_Tenant2.xml  
    
    [root@FM_host dgs]# cat opafm_dg_Tenant1.xml
    <DeviceGroup>
      <Name>Tenant1</Name>
        <PortGUID>0x0011750101744e5e</PortGUID>
        <PortGUID>0x001175010176be56</PortGUID>
    </DeviceGroup>
    
    [root@FM_host dgs]# cat opafm_dg_Tenant2.xml
    <DeviceGroup>
      <Name>Tenant2</Name>
        <PortGUID>0x001175010174428a</PortGUID>
        <PortGUID>0x0011750101744e4e</PortGUID>
    </DeviceGroup>
    

    The Device Groups are populated with the host node GUIDs.

  6. (OPTIONAL) Find the LIDs for the Admin node and one host node from each of the tenant vFabrics.

    Admin Node - FM

    [root@FM_host vfs]# opareport -qQ -o lids | grep FM_host
    0x0001        0x0011750101743e91   1 FI FM_host hfi1_0
    

    Tenant1 Member

    [root@FM_host vfs]# opareport -qQ -o lids | grep Tenant1_Host1
    0x001c        0x001175010176be56   1 FI Tenant1_Host1 hfi1_0
    

    Tenant2 Member

    [root@FM_host vfs]# opareport -qQ -o lids | grep Tenant2_Host1
    0x000f        0x0011750101744e4e   1 FI Tenant2_Host1 hfi1_0
  7. (OPTIONAL) View the PKey information for the Admin node and one member of each tenant vFabric using the LIDs found in the previous step.

    Note

    The following example shows 32 PKeys for Omni-Path 100. CN5000 has 1024 PKeys.

    Admin Node - FM_host

    [root@FM_host vfs]# opasaquery -o pkey -l 0x01
    LID: 0x00000001 PortNum:  1 BlockNum: 0 
          0-  7:  0x0000  0x7fff  0xffff  0x0000  0x0000  0x0000  0x0000  0x0000
          8- 15:  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000
         16- 23:  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000
         24-  31: 0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000
    

    Tenant1_Host1

    [root@FM_host vfs]# opasaquery -o pkey -l 0x1c
    LID: 0x0000000b PortNum:  1 BlockNum:  0 
          0-  7:  0x8011  0x7fff  0xffff  0x0000  0x0000  0x0000  0x0000  0x0000
          8- 15:  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000
         16- 23:  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000
         24-  31: 0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000
    

    Tenant2_Host1

    [root@FM_host vfs]# opasaquery -o pkey -l 0x0f
    LID: 0x0000000a PortNum:  1 BlockNum:  0 
          0-  7:  0x8012  0x7fff  0xffff  0x0000  0x0000  0x0000  0x0000  0x0000
          8- 15:  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000
         16- 23:  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000
         24- 31:  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000
    

    At this point, the tenant vFabric host nodes are full members of their own vFabrics (most significant bit set, 80xx) and the Admin vFabric (most significant bit set, ffff). Note that:

    • Full members (0xffff) override limited members (0x7fff).

    • Tenant PKeys appear first in the table and are therefore the default PKeys for those vFabrics.

5.2.1.5.4. Create the Storage vFabric

To create a Storage vFabric, associate host node port GUIDs to the vFabric configuration and then assign Limited Membership.

  1. Create the storage vFabric configuration with a PKey.

    Tip

    Make the storage vFabric PKey different from the tenant vFabric PKeys. In this example, the storage PKey is 0x0041, whereas the tenant PKeys are 0x0011 and 0x0012.

    [root@FM_host vfs]# opafmvf create --pkey=0x41 Storage
    Created virtual fabric configuration 'Storage'
    
  2. Verify that storage XML files appear in the Device Groups (dgs) and vFabrics (vfs) directories.

    [root@FM_host opa-fm]# ls dgs
    opafm_dg_Storage.xml  opafm_dg_Tenant1.xml  opafm_dg_Tenant2.xml
    [root@FM_host opa-fm]# ls vfs
    opafm_vf_Storage.xml  opafm_vf_Tenant1.xml  opafm_vf_Tenant2.xml 
    
  3. Find the GUIDs for the host nodes that you want to use for the storage vFabric.

    root@FM_host vfs]# opaextractlids -qQ | grep Storage_Host1
    0x0011750101743ed0;1;FI;Storage_Host1 hfi1_0;0x0004
    [root@FM_host vfs]# opaextractlids -qQ | grep Storage_Host2
    0x00117501017443a8;1;FI;Storage_Host2 hfi1_0;0x0005
    
  4. Add the node port GUIDs to storage vFabric configuration.

    [root@FM_host opa-fm]# opafmvf add Storage 0x0011750101743ed0 0x00117501017443a8
    Added ports to virtual fabric configuration 'Storage'
  5. Add the tenant Nodes as Limited Members of the storage vFabric.

    1. Navigate to the vFabrics (/etc/opa-fm/vfs) directory.

    2. Edit the storage vFabric file opafm_vf_Storage.xml.

    3. Add Tenant1 and Tenant2 as Limited Members.

      [root@FM_host opa-fm]# cd /etc/opa-fm/vfs
      [root@FM_host vfs]# ls
      opafm_vf_Storage.xml  opafm_vf_Tenant1.xml  opafm_vf_Tenant2.xml  
      [root@FM_host vfs]# vi opafm_vf_Storage.xml
      <VirtualFabric>
          <Name>Storage</Name>
          <Enable>1</Enable>
          <PKey>0x0041</PKey>
          <Security>1</Security>
          <Member>Storage</Member>
          <LimitedMember>Tenant1</LimitedMember> 
          <LimitedMember>Tenant2</LimitedMember> 
          <Application>Tenant_Apps</Application>
          <QOSGroup>Tenant_QOS</QOSGroup>
      </VirtualFabric>

      In the code example, the storage vFabric is enabled (1) and has a PKey of 0x0041. Security is enabled and has the storage device group as a Full Member. Tenant1 and Tenant2 device groups are Limited Members.

      Note

      Limited Members in a vFabric can only communicate with Full Members, but not with other Limited Members.

    4. Save and exit the file.

  6. Rebuild the opafm.xml file and reload the FM.

    [root@FM_host vfs]# opafmvf commit
    /etc/opa-fm/opafm.xml will be overwritten!
    Do you want to continue? [y/N] y
    Processing files in  /etc/opa-fm/dgs
    Processing files in  /etc/opa-fm/vfs
    Config Check Passed!
    Generated new configuration for fabric manager
    
    [root@ FM_host dgs]# opafmvf reload
    Reloaded fabric manager  
    
  7. From the Admin node, verify that the storage vFabric is displayed in the existing vFabric information.

    [root@FM_host vfs]# opareport -qQ -o vfinfo
    vFabrics:
    vFabric Index: 0   Name: Admin
    PKey: 0x7fff   SL: 0 Select: 0x3: PKEY SL  PktLifeTimeMult: 1
    MaxMtu: unlimited  MaxRate: unlimited   Options: 0x03: Security QoS
    QOS: Bandwidth:  50% PreemptionRank: 0  HoQLife:    8 ms
    
    vFabric Index: 1   Name: Tenant1
    PKey: 0x11   SL: 1 Select: 0x3: PKEY SL  PktLifeTimeMult: 1
    MaxMtu: unlimited  MaxRate: unlimited   Options: 0x03: Security QoS
    QOS: Bandwidth:  50% PreemptionRank: 0  HoQLife:    8 ms
    
    vFabric Index: 2   Name: Tenant2
    PKey: 0x12   SL: 1 Select: 0x3: PKEY SL  PktLifeTimeMult: 1
    MaxMtu: unlimited  MaxRate: unlimited   Options: 0x03: Security QoS
    QOS: Bandwidth:  50%  PreemptionRank:0  HoQLife:    8 ms
     
    vFabric Index: 3   Name: Storage
    PKey: 0x41   SL: 1 Select: 0x3: PKEY SL  PktLifeTimeMult: 1
    MaxMtu: unlimited  MaxRate: unlimited   Options: 0x03: Security QoS
    QOS: Bandwidth:  50% PreemptionRank: 0  HoQLife:    8 ms
    
    4 VFs
  8. (OPTIONAL) View the PKey information for the storage vFabric host nodes.

    1. Find the LIDs and view PKey information for the storage vFabric host nodes Storage_Host1 and Storage_Host2.

      [root@FM_host vfs]# opaextractlids -qQ | grep Storage_Host1
      0x0011750101743ed0;1;FI;Storage_Host1 hfi1_0;0x0004
      [root@FM_host vfs]# opaextractlids -qQ | grep Storage_Host2
      0x00117501017443a8;1;FI;Storage_Host2 hfi1_0;0x0005
      
    2. Using the LIDs found in the previous step, view the PKey information for the storage vFabric hosts Storage_Host1 and Storage_Host2.

      [root@FM_host vfs]# opasaquery -o pkey -l 0x04
      LID: 0x00000003 PortNum:  1 BlockNum:  0
            0-   7:  0x8041  0x7fff  0xffff  0x0000  0x0000  0x0000  0x0000  0x0000
            8-  15:  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000
           16-  23:  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000
           24-  31:  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000
      
      [root@FM_host vfs]# opasaquery -o pkey -l 0x05
      LID: 0x00000004 PortNum:  1 BlockNum:  0
            0-   7:  0x8041  0x7fff  0xffff  0x0000  0x0000  0x0000  0x0000  0x0000
            8-  15:  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000
           16-  23:  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000
           24-  31:  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000
      

      At this time, both Storage_Host1 and Storage_Host2 are Full Members of the storage vFabric as well as Full Members of the Admin vFabric.

5.2.1.5.5. Change the IP Address for IB Interfaces on Storage Nodes

Note

This procedure is provided as part of this example. Normally, IB interfaces are set up during fabric installation. Refer to the CN5000 Fabric Installation Guide.

Note

The IP addresses specified in this example are defined as follows and may differ from your own.

It is important to note that the non-storage and storage nodes should be in two different subnetworks.

  • For all non-storage nodes, the ib0 IP addresses are xxx.yyy.10.zzz.

  • For storage nodes, the ib0 IP addresses are xxx.yyy.41.zzz.

This procedure is OS-dependent.

  1. On the storage host nodes (Storage_Host1 and Storage_Host2), edit /etc/NetworkManager/system-connections/ifcfg-ib0 and change the IP addresses from XXX.YYY.10.x to XXX.YYY.41.x.

    # IPoIB Integration test configuration for Storage_Host1
    DEVICE=ib0
    TYPE=Infiniband
    BOOTPROTO=static
    IPADDR=XXX.YYY.41.217
    NETMASK=255.255.255.0
    NETWORK=XXX.YYY.41.0
    BROADCAST=XXX.YYY.41.255
    ONBOOT=yes
    CONNECTED_MODE=no 
    [root@Storage_Host1 network-scripts]#
    
    # IPoIB Integration test configuration for Storage_Host2
    DEVICE=ib0
    TYPE=Infiniband
    BOOTPROTO=static
    IPADDR=XXX.YYY.41.218
    NETMASK=255.255.255.0
    NETWORK=XXX.YYY.41.0
    BROADCAST=XXX.YYY.41.255
    ONBOOT=yes
    CONNECTED_MODE=no 
    [root@Storage_Host2 network-scripts]#
  2. Check Interfaces on both Storage vFabric host nodes with the Linux ip a command.

    [root@FM_host vfs]# ssh Storage_Host1 ip a | grep ib0
    4: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group
    default qlen 256
        inet XXX.YYY.41.217/24 brd XXX.YYY.41.255 scope global noprefixroute ib0
    
    [root@FM_host vfs]# ssh Storage_Host2 ip a | grep ib0
    4: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group
    default qlen 256
        inet XXX.YYY.41.218/24 brd XXX.YYY.41.255 scope global noprefixroute ib0
    

Note

To take effect, both hosts may require the command systemctl restart NetworkManager.service or a reboot. This may be disruptive to the network.

5.2.1.5.6. Create Additional IB Interfaces on Tenant vFabric Nodes

Note

This procedure is provided as part of this example. Normally, IB interfaces are set up during fabric installation. Refer to the CN5000 Fabric Installation Guide.

  1. Configure additional Interfaces ib0.8041 on the host node members of all the tenant vFabrics for communication with the Storage vFabric by running the following four commands on all of them.

    Note

    8041 is used in this case because it is the PKey of the Storage vFabric. You can choose to use any other value.

    Example on Tenant1_Host1

    [root@Tenant1_Host1 ~]# nmcli connection add type infiniband con-name ib0.8041 ifname ib0
    Connection 'ib0.8041' (d0573b67-3641-4154-931a-5d93f530e204) successfully added.
    
    [root@Tenant1_Host1 ~]# nmcli connection modify ib0.8041 connection.interface-name 
    ib0.8041 parent ib0 infiniband.p-key 0x8041
    
    [root@Tenant1_Host1 ~]# nmcli connection modify ib0.8041 ipv4.addresses XXX.YYY.41.213/24
    
    [root@Tenant1_Host1 ~]# nmcli connection modify ib0.8041 ipv4.method manual
    
  2. Verify that the interfaces were created on each host by running the Linux command ip a

    Note

    If you have a utility that can run multiple remote commands in parallel like pdsh, you can verify them all at once.

    [root@Tenant1_Host1 pt2pt]# ip a | grep 8041
    6: ib0.8041@ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group
    default qlen 256
        inet XXX.YYY.41.213/24 brd XXX.YYY.41.255 scope global noprefixroute ib0.8041
    

    It may take up to 30 seconds for Network Manager to synchronize and for the IP address to display with the ip a command. If ib0.8041 is not there after a minute, run the following:

    # systemctl restart NetworkManager.service

    If that does not work, a reboot may be necessary.

    Note

    If you make a mistake, delete the interface as follows and start again:

    [root@em107 ~]# nmcli connection delete ib0.8041
    Connection 'ib0.8041' (6c0e04dc-aff1-4ea1-b42b-1df67553b04e) successfully 
    deleted.
  3. Ensure all ib0.8041 (storage) interfaces on tenant hosts are connected.

    Note

    If you do not have a utility that can issue commands to groups of hosts in parallel like pdsh, run ip a on one host at a time.

    [root@FM_host vfs]# pdsh -w root@common_hostname2[13-16] nmcli connection show | 
    grep 8041 | sort
    Tenant1_Host1: ib0.8041     ea2dc7d7-313b-4fdb-977d-7efa50029861  infiniband ib0.8041
    Tenant1_Host2: ib0.8041     0aeaefbf-ec7e-42ee-9683-d588837ce054  infiniband ib0.8041
    Tenant2_Host1: ib0.8041     e2c4547d-ad41-4bfe-b587-3b4cf6a17c76  infiniband ib0.8041
    Tenant2_Host2: ib0.8041     110c2228-9930-45fe-8af0-3cc911fa4dc3  infiniband ib0.8041
    
  4. Ensure all interfaces were created, are up, and have IP addresses.

    [root@FM_host vfs]# pdsh -w root@ root@common_hostname2[13-16] ip a | grep
    ib0.8041 | sort
    Tenant1_Host1: 6: ib0.8041@ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state 
    UP group default qlen 256
    Tenant1_Host1:    inet XXX.YYY.41.213/24 brd XXX.YYY.41.255 scope global noprefixroute 
    ib0.8041
    Tenant1_Host2: 5: ib0.8041@ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state 
    UP group default qlen 256
    Tenant1_Host2:    inet XXX.YYY.41.214/24 brd XXX.YYY.41.255 scope global noprefixroute 
    ib0.8041
    Tenant2_Host1: 5: ib0.8041@ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state 
    UP group default qlen 256
    Tenant2_Host1:    inet XXX.YYY.41.215/24 brd XXX.YYY.41.255 scope global noprefixroute 
    ib0.8041
    Tenant2_Host2: 5: ib0.8041@ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state 
    UP group default qlen 256
    Tenant2_Host2:     inet XXX.YYY.41.216/24 brd XXX.YYY.41.255 scope global noprefixroute 
    ib0.8041
    

Each tenant node now has the following:

  • An IPoIB interface on its own Default vFabric, which is ib0.

  • An interface on the storage vFabric, which is ib0.0841.

This is why storage nodes need to be on their own IP subnet of xxx.yyy.41.zzz. Linux needs to know that traffic for the storage nodes needs to be routed to ib0.8041, and it knows this because storage nodes ib0 interfaces have the address xxx.zzz.41.zzz compared to its other tenant nodes that have xxx.yyy.10.zzz.

5.2.1.5.7. Secure the Admin vFabric
  1. Find the GUID for the host node you chose to be the main Admin node.

    [root@FM_host vfs]# opareport -qQ -o lids | grep FM_host
    0x0001        0x0011750101743e91   1 FI FM_host hfi1_0
    
  2. Create the AdminNodes device group in the /etc/opa-fm/dgs (device groups) directory.

    1. Copy the Storage device group to AdminNodes.

      Note

      This task can contain multiple Admin nodes but, for simplicity, we are only using one.

      [root@FM_host dgs]# cd /etc/opa-fm/dgs
      [root@FM_host dgs]# ls
      opafm_dg_Storage.xml  opafm_dg_Tenant1.xml  opafm_dg_Tenant2.xml  
      [root@FM_host dgs]# cp opafm_dg_Storage.xml opafm_dg_AdminNodes.xml
      [root@FM_host dgs]# ls
      opafm_dg_AdminNodes.xml  opafm_dg_Storage.xml  opafm_dg_Tenant1.xml  
      opafm_dg_Tenant2.xml
    2. Edit the file and replace the Storage device group host nodes GUIDS with the AdminNodes host node GUIDs, then change the name from Storage to AdminNodes.

      Again, you can have as many admin nodes as needed. We are only using one for our example.

      Original file copied from Storage:

      [root@FM_host dgs]# vi opafm_dg_AdminNodes
      <DeviceGroup>
          <Name>Storage</Name>
          <PortGUID>0x0011750101747a47</PortGUID> 
          <PortGUID>0x0011750101747e96</PortGUID>
      </DeviceGroup>
      

      Modified AdminNodes file:

      <DeviceGroup>
          <Name>AdminNodes</Name>
          <PortGUID>0x0011750101743e91</PortGUID> 
      </DeviceGroup>  
      
  3. Rebuild the opafm.xml file and restart the FM.

    [root@FM_host dgs]# opafmvf commit
    /etc/opa-fm/opafm.xml will be overwritten!
    Do you want to continue? [y/N] y
    Processing files in  /etc/opa-fm/dgs
    Processing files in  /etc/opa-fm/vfs
    Config Check Passed!
    Generated new configuration for fabric manager
    [root@FM_host dgs]# opafmvf reload

    The AdminNodes device group has been created and loaded.

  4. Edit the Admin partition section in the opafm_pp.xml file.

    Note

    Do not edit the opafm.xml file directly. Edit the preprocessor file instead.

    1. In a text editor, search for the Admin partition (/Admin partition).

    2. Comment out <Member>AllMgmtAllowed</Member> and uncomment <Member>AdminNodes</Member>.

    3. Remove any extra wording.

    4. Save and close the file.

    The following code text shows an example of the process.

    [root@FM_host dgs]# cd /etc/opa-fm
    [root@FM_host opa-fm]# ls
    dgs  opafm.orig  opafm_pp.xml  opafm.xml  opafm.xml.check  opafm.xml.orig  vfs
    [root@FM_host opa-fm]# vi opafm_pp.xml

    Original File:

    <!-- The Admin partition -->                                                            
        <!-- This gives non-SMs limited privileges -->
        <VirtualFabric>
          <Name>Admin</Name>
          <Enable>1</Enable>
          <PKey>0x7fff</PKey> <!-- must be OPA Management PKey -->
          <Security>1</Security>
          <QOSGroup>LowPriority</QOSGroup>
          <!-- <QOSGroup>HighPriority</QOSGroup> can make admin High priority -->
          <Member>HFIDirectConnect</Member> <!-- Both HFIs directly connected -->
          <Member>AllMgmtAllowed</Member> 
          <Member>AllSWE0s</Member> <!-- so chassis CMU can access leafs & spines-->
          <!-- <Member>AdminNodes</Member> add more FF/admin nodes if desired -->    
          <LimitedMember>All</LimitedMember>  
          <Application>SA</Application>
          <Application>PA</Application>
          <Application>PM</Application>
          <!-- add other applications if desired -->
          <!-- <MaxMTU>2048</MaxMTU> can reduce MTU, SA only uses MADs -->
        </VirtualFabric>

    Modified File:

    <!-- The Admin partition -->                                                             
        <!-- This gives non-SMs limited privileges -->
        <VirtualFabric>
          <Name>Admin</Name>
          <Enable>1</Enable>
          <PKey>0x7fff</PKey> <!-- must be OPA Management PKey -->
          <Security>1</Security>
          <QOSGroup>LowPriority</QOSGroup>
          <!-- <QOSGroup>HighPriority</QOSGroup> can make admin High priority -->
          <Member>HFIDirectConnect</Member> <!-- Both HFIs directly connected -->
          <!-- <Member>AllMgmtAllowed</Member> -->  
          <Member>AllSWE0s</Member> <!-- so chassis CMU can access leafs & spines-->
          <Member>AdminNodes</Member>               
          <LimitedMember>All</LimitedMember>                                              
          <Application>SA</Application>
          <Application>PA</Application>
          <Application>PM</Application>
          <!-- add other applications if desired -->
          <!-- <MaxMTU>2048</MaxMTU> can reduce MTU, SA only uses MADs -->
        </VirtualFabric
  5. Rebuild the opafm.xml file and reload the FM.

    [root@FM_host dgs]# opafmvf commit
    /etc/opa-fm/opafm.xml will be overwritten!
    Do you want to continue? [y/N] y
    Processing files in  /etc/opa-fm/dgs
    Processing files in  /etc/opa-fm/vfs
    Config Check Passed!
    Generated new configuration for fabric manager
    [root@FM_host dgs]# opafmvf reload

    Once committed and the FM restarted, only members of the AdminNodes device group will be Full Members and have full administrative privileges.

5.2.1.5.8. Test vFabrics for Multi-Tenants
  1. Get the LIDs for the Admin node, one member of each tenant vFabric node, and one from the storage vFabric node.

    [root@FM_host dgs]# opareport -qQ -o lids | grep FM_host
    0x0001        0x0011750101743e91   1 FI FM_host hfi1_0
    [root@FM_host dgs]# opareport -qQ -o lids | grep Tenant1_Host2   
    0x000e        0x0011750101744e5e   1 FI Tenant1_Host2 hfi1_0
    [root@FM_host dgs]# opareport -qQ -o lids | grep Tenant2_Host2
    0x0003        0x001175010174428a   1 FI Tenant2_Host2 hfi1_0
    [root@FM_host dgs]# opareport -qQ -o lids | grep Storage_Host2
    0x0005        0x00117501017443a8   1 FI Storage_Host2 hfi1_0
    
  2. View the PKey information for the nodes using the LIDs from the previous step and pay special attention to memberships.

    [root@FM_host dgs]# opasaquery -o pkey -l 0x01           (Admin FM_Host)
    LID: 0x00000001 PortNum:  1 BlockNum:  0
       0-  7:  0x0000  0x7fff  0xffff  0x0000  0x0000  0x0000  0x0000  0x0000 (Admin Full)
       8- 15:  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000
      16- 23:  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000
      24- 31:  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000
    [root@FM_host dgs]# opasaquery -o pkey -l 0x0e           (Tenant1 VFabric Host)
    LID: 0x0000000e PortNum:  1 BlockNum:  0
       0-  7:  0x8011  0x7fff  0x0000  0x0041  0x0000  0x0000  0x0000  0x0000 (Tenant1 Full) 
       8- 15:  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000 (Admin Limited) 
      16- 23:  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000 (Storage Limited)
      24- 31:  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000
    [root@FM_host dgs]# opasaquery -o pkey -l 0x03           (Tenant2 VFabric Host)
    LID: 0x00000003 PortNum:  1 BlockNum:  0
       0-  7:  0x8012  0x7fff  0x0000  0x0041  0x0000  0x0000  0x0000  0x0000 (Tenant2 Full)
       8- 15:  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000 (Admin Limited)
      16- 23:  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000 (Storage Limited)
      24- 31:  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000
    [root@FM_host dgs]# opasaquery -o pkey -l 0x05           (Storage VFabric Host)
    LID: 0x00000005 PortNum:  1 BlockNum:  0
       0-  7:  0x8041  0x7fff  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000 (Storage Full)
       8- 15:  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000 (Admin Limited)
      16- 23:  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000
      24- 31:  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x000
  3. View one host node from the Admin, tenant, and storage vFabrics, as well one not belonging to any vFabric.

    If Mgmt is True (enabled) for any vFabric other than the Admin vFabric, it is not secure.

    [root@FM_host opa-fm]# opainfo | grep Mgmt
       LCRC       Act: 14-bit    En: 14-bit,16-bit,48-bit   Mgmt: True   (Admin Node)
    [root@FM_host opa-fm]# ssh Tenant1_Host1 opainfo | grep Mgmt
       LCRC       Act: 14-bit    En: 14-bit,16-bit,48-bit   Mgmt: False  (Tenant1 Node)
    [root@FM_host opa-fm]# ssh Tenant2_Host2 opainfo | grep Mgmt
       LCRC       Act: 14-bit    En: 14-bit,16-bit,48-bit   Mgmt: False  (Tenant2 Node)
    [root@FM_host opa-fm]# ssh Storage_Host1 opainfo | grep Mgmt
       LCRC       Act: 14-bit    En: 14-bit,16-bit,48-bit   Mgmt: False  (Storage Node)
    [root@FM_host opa-fm]# ssh Non-VFabric host opainfo | grep Mgmt
       LCRC       Act: 14-bit    En: 14-bit,16-bit,48-bit   Mgmt: False  (Non-VF Node) 
  4. From a tenant node, run an opa command to access fabric information.

    [root@Tenant_Host1 ~]# opaextractlids
    Getting All Node Records...
    Processed      0 of    
    19 Nodes...                                             
    SA PortInfo query Failed: FPROTECTION
    opaextractlids: Unable to get lids report
    
    Usage: opaextractlids [--help]|[opareport options]
       --help - produce full help text
       [opareport options] - options will be passed to opareport.

    As expected, the command failed. It was unable to see the names of the nodes belonging to other tenants. Only the Admin node with Mgmt set to True can run every command in the fabric. All other nodes are secure and can only run specific commands and talk to specific hosts.

Run Ping Tests Between vFabrics
  1. Ping the Storage vFabric host nodes from a Tenant2 member over the ib0.8041 interface.

    [root@Tenant2_Host2 ~]# ping XXX.YYY.41.218        (Storage Host2 IP Address)
    PING XXX.YYY.41.218 (XXX.YYY.41.218) 56(84) bytes of data.
    64 bytes from XXX.YYY.41.218: icmp_seq=1 ttl=64 time=1.09 ms
    64 bytes from XXX.YYY.41.218: icmp_seq=2 ttl=64 time=0.134 ms
    64 bytes from XXX.YYY.41.218: icmp_seq=3 ttl=64 time=0.157 ms
    ^C
    --- XXX.YYY.41.218 ping statistics ---
    3 packets transmitted, 3 received, 0% packet loss, time 2064ms
    rtt min/avg/max/mdev = 0.134/0.459/1.086/0.443 ms
    [root@Tenant2_Host2 ~]# ping XXX.YYY.41.217        (Storage Host1 IP Address)
    PING XXX.YYY.41.217 (XXX.YYY.41.217) 56(84) bytes of data.
    64 bytes from XXX.YYY.41.217: icmp_seq=1 ttl=64 time=0.925 ms
    64 bytes from XXX.YYY.41.217: icmp_seq=2 ttl=64 time=0.120 ms
    64 bytes from XXX.YYY.41.217: icmp_seq=3 ttl=64 time=0.146 ms
    64 bytes from XXX.YYY.41.217: icmp_seq=4 ttl=64 time=0.133 ms
    ^C
    --- XXX.YYY.41.217 ping statistics ---
    4 packets transmitted, 4 received, 0% packet loss, time 3090ms
    rtt min/avg/max/mdev = 0.120/0.331/0.925/0.343 ms
    

    The pings were successful because the tenant device groups are Limited Members of the Storage vFabric and, as such, can talk to Full Members.

  2. Ping a member of the Tenant1 vFabric from another member of the Tenant1 vFabric over the regular ib0 interface.

    [root@Tenant1_Host1 ~]# ping XXX.YYY.10.214
    PING XXX.YYY.10.214 (XXX.YYY.10.214) 56(84) bytes of data.
    64 bytes from XXX.YYY.10.214: icmp_seq=1 ttl=64 time=0.344 ms
    64 bytes from XXX.YYY.10.214: icmp_seq=2 ttl=64 time=0.195 ms
    64 bytes from XXX.YYY.10.214: icmp_seq=3 ttl=64 time=0.137 ms
    64 bytes from XXX.YYY.10.214: icmp_seq=4 ttl=64 time=0.201 ms
    ^C
    --- XXX.YYY.10.214 ping statistics ---
    4 packets transmitted, 4 received, 0% packet loss, time 3088ms
    rtt min/avg/max/mdev = 0.137/0.219/0.344/0.076 ms

    The ping was successful because both members are Full Members of the same vFabric and share the same PKey.

  3. Ping a member of the Tenant2 vFabric from a member of the Tenant1 vFabric over the ib0 (.10) interface.

    [root@Tenant1_Host1 ~]# ping XXX.YYY.10.215
    PING XXX.YYY.10.215 (XXX.YYY.10.215) 56(84) bytes of data.
    From XXX.YYY.10.213 icmp_seq=1 Destination Host Unreachable
    From XXX.YYY.10.213 icmp_seq=2 Destination Host Unreachable
    From XXX.YYY.10.213 icmp_seq=3 Destination Host Unreachable
    ^C
    --- XXX.YYY.10.215 ping statistics ---
    6 packets transmitted, 0 received, +3 errors, 100% packet loss, time 5159ms
    pipe 3

    The ping was unsuccessful because Tenant1 and Tenant2 are different vFabrics and have different PKeys.

  4. Ping one Tenant1 host node from the Tenant2 host node over the Storage vFabric in which both are Limited Members.

    [root@Tenant2_Host2 ~]# ping XXX.YYY.41.213
    PING XXX.YYY.41.213 (XXX.YYY.41.213) 56(84) bytes of data.
    From XXX.YYY.41.216 icmp_seq=1 Destination Host Unreachable
    From XXX.YYY.41.216 icmp_seq=2 Destination Host Unreachable
    From XXX.YYY.41.216 icmp_seq=3 Destination Host Unreachable
    ^C
    --- XXX.YYY.41.213 ping statistics ---
    6 packets transmitted, 0 received, +3 errors, 100% packet loss, time 5154ms
    pipe 3

    The ping over the Storage vFabric was unsuccessful because Tenant 1 and Tenant2 are different vFabrics with different PKeys. Also, both device groups are Limited Members of the Storage vFabric and cannot talk to one another.

  5. Ping the Tenant 2 Host1 node from Tenant2 Host2 over the Storage vFabric.

    [root@Tenant2_Host2 ~]# ping XXX.YYY.41.215
    PING XXX.YYY.41.215 (XXX.YYY.41.215) 56(84) bytes of data.
    From XXX.YYY.41.216 icmp_seq=1 Destination Host Unreachable
    From XXX.YYY.41.216 icmp_seq=2 Destination Host Unreachable
    From XXX.YYY.41.216 icmp_seq=3 Destination Host Unreachable
    From XXX.YYY.41.216 icmp_seq=4 Destination Host Unreachable
    ^C
    --- XXX.YYY.41.215 ping statistics ---
    7 packets transmitted, 0 received, +6 errors, 100% packet loss, time 6184ms
    pipe 3

    The ping over the Storage vFabric was unsuccessful because both members of the Tenant2 vFabric are Limited Members of the Storage vFabric and cannot talk to one another over the Storage vFabric even though they are Full Members of the same Tenant2 vFabric.

Run MPI Tests

Note

To run these tests, you must have an MPI implementation and a set of benchmarks installed.

The following example uses OpenMPI (version 4.1) and OSU Benchmarks.

  1. Run a point-to-point latency MPI test between two hosts in the Tenant2 vFabric using the correct Tenant2 PKey.

    [root@FM_host ~]# source /usr/mpi/gcc/openmpi-*-hfi/bin/mpivars.sh
    [root@ FM_host ~]# cd /usr/mpi/gcc/openmpi-*-hfi/tests/osu-micro-benchmarks-*/mpi/pt2pt
    [root@FM_host dgs]# mpirun -x FI_PROVIDER=opx -map-by node -host Tenant1_host1, 
    Tenant1_host2 --allow-run-as-root -np 2  --mca mtl ofi --mca btl self,vader 
    --bind-to core -x FI_OPX_PKEY=0x8012 ./osu_latency
    # OSU MPI Latency Test v3.8
    # Size          Latency (us)
    0                       0.94
    1                       0.93
    2                       0.92
    4                       0.90
    8                       0.89
    16                      0.89
    32                      0.93
    64                      0.94
    128                     1.01
    256                     0.99
    512                     1.05
    1024                    1.1

    This test passed because it was run between two Full Members of the same vFabric sharing the same PKey.

  2. Run a point-to-point bandwidth MPI job between the same two hosts from the previous step using the correct Tenant2 PKey.

    Note

    The following example shows performance numbers for Omni-Path 100. CN5000 performance numbers are higher.

    [root@FM_host pt2pt]# mpirun -x FI_PROVIDER=opx -map-by node -host 
    Tenant2_Host1,Tenant2_Hosts2 --allow-run-as-root -np 2  --mca mtl ofi 
    --mca btl self,vader --bind-to core -x FI_OPX_PKEY=0x8003 ./osu_bw
    # OSU MPI Bandwidth Test v3.8
    # Size      Bandwidth (MB/s)
    1                       7.52
    2                      14.20
    4                      28.26
    8                      57.04
    16                    116.23
    32                    200.60
    64                    383.48
    128                   754.77
    256                  1370.03
    512                  2499.68
    1024                 3467.85
    2048                 4466.57
    4096                 5700.89
    8192                 6802.12
    16384                6081.35
    32768               10130.56
    65536               11027.14
    131072              11491.86
    262144              11840.59
    524288              12095.70
    1048576             12236.88
    2097152             12314.66
    4194304             12349.98 

    This test passed because it was run between two Full Members of the same vFabric sharing the same PKey.

  3. Run a point-to-point bandwidth MPI job between a host node from Tenant2 and a host from Tenant1 using the PKey from Tenant2.

    [root@FM_host pt2pt]# mpirun -x FI_PROVIDER=opx -map-by node -host 
    Tenant2_Host1,Tenant1_Host2 --allow-run-as-root -np 2 --mca mtl ofi 
    --mca btl self,vader --bind-to core -x FI_OPX_PKEY=0x8012 ./osu_latency
    
    [Tenant1_Host2:04541] [[6040,1],1] selected pml ob1, but peer [[6040,1],0] 
    on Tenant2_Host2 selected pml cm
    --------------------------------------------------------------------------
    MPI_INIT has failed because at least one MPI process is unreachable
    from another.  This *usually* means that an underlying communication
    plugin -- such as a BTL or an MTL -- has either not loaded or not
    allowed itself to be used.  Your MPI job will now abort.
    
    You may wish to try to narrow down the problem;
    
     * Check the output of ompi_info to see which BTL/MTL plugins are
       available.
     * Run your application with MPI_THREAD_SINGLE.
     * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
       if using MTL-based communications) to see exactly which
       communication plugins were considered and/or discarded.
    --------------------------------------------------------------------------
    [Tenant1_host1:04541] *** An error occurred in MPI_Init
    [Tenant1_host1:04541] *** reported by process [395837441,1]
    [Tenant1_host1:04541] *** on a NULL communicator
    [Tenant1_host1:04541] *** Unknown error
    [Tenant1_host1:04541] *** MPI_ERRORS_ARE_FATAL (processes in this communicator 
    [Tenant1_host1:04541] ***   will now abort, and potentially your MPI job)
    

    This test failed because it was run between one Full Member of the Tenant2 vFabric (Tenant2_Host1), which has the correct PKey (0x8012) and one from Tenant1 (Tenant1_Host2), which does not.

  4. Run a point-to-point MPI latency test between two hosts in the Tenant2 vFabric without specifying a PKey.

    [root@em101 pt2pt]# mpirun -x FI_PROVIDER=opx -map-by node -host Tenant2_Host1,
    Tenant2_Hosts2 --allow-run-as-root -np 2 --mca mtl ofi --mca btl self,vader 
    --bind-tocore ./osu_latency      
    # OSU MPI Latency Test v3.8
    # Size          Latency (us)
    --------------------------------------------------------------------------
    At least one pair of MPI processes are unable to reach each other for
    MPI communications.  This means that no Open MPI device has indicated
    that it can be used to communicate between these processes. This is
    an error; Open MPI requires that all MPI processes be able to reach
    each other.  This error can sometimes be the result of forgetting to
    specify the "self" BTL.
      Process 1 ([[5632,1],0]) is on host: Tenant2_host1
      Process 2 ([[5632,1],1]) is on host: Tenant2_host2
      BTLs attempted: self
    Your MPI job is now going to abort; sorry.
    --------------------------------------------------------------------------
    [Tenant2_host1:06048] *** An error occurred in MPI_Barrier
    [Tenant2_host1:06048] *** reported by process [369098753,0]
    [Tenant2_host1:06048] *** on communicator MPI_COMM_WORLD
    [Tenant2_host1:06048] *** MPI_ERR_INTERN: internal error
    [Tenant2_host1:06048] *** MPI_ERRORS_ARE_FATAL (processes in this communicator 
    [Tenant2_host1:06048] ***    will now abort, and potentially your MPI job)
    [Tenant2_host2:05291] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
    [FM_host:199519] 1 more process has sent help message help-mca-bml-r2.txt / unreachable proc
    [FM_host:199519] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error
    messages

    This test failed because Omni-Path looks for the default PKey, FI_OPX_PKEY=0x8001, and it is not available on these nodes. The PKey must be explicitly called out using -x FI_OPX_PKEY=0x8012.

  5. Run a point-to-point MPI latency test between two hosts in the Storage vFabric.

    [root@em101 pt2pt]# mpirun -x FI_PROVIDER=opx -map-by node -host Storage_Host1,
    Storage_Host2 --allow-run-as-root -np 2 --mca mtl ofi --mca btl self,vader 
    --bind-to core -x FI_OPX_PKEY=0x8041 ./osu_latency
    # OSU MPI Latency Test v3.8
    # Size          Latency (us)
    0                       1.44
    1                       1.43
    2                       1.41
    4                       1.41
    8                       1.39
    16                      1.39
    32                      1.44
    64                      1.44
    128                     1.47
    256                     1.49
    512                     1.55
    1024                    1.67

    This test passed because it was run between two Full Members of the same Storage vFabric that share the PKey.

  6. Run a point-to-point MPI latency test between one host in the Storage vFabric and one host in the Tenant1 vFabric.

    [root@FM_host pt2pt]# mpirun -x FI_PROVIDER=opx -map-by node -host Storage_host1,
    Tenant1_host1 --allow-run-as-root -np 2 --mca mtl ofi --mca btl self,vader 
    --bind-to core -x FI_OPX_PKEY=0x8041 ./osu_latency
    [Tenant1_host1:04675] [[4241,1],1] selected pml ob1, but peer [[4241,1],0] on 
    Storage_host1 selected pml cm
    --------------------------------------------------------------------------
    MPI_INIT has failed because at least one MPI process is unreachable
    from another.  This *usually* means that an underlying communication
    plugin -- such as a BTL or an MTL -- has either not loaded or not
    allowed itself to be used.  Your MPI job will now abort.
    You may wish to try to narrow down the problem;
     * Check the output of ompi_info to see which BTL/MTL plugins are
       available.
     * Run your application with MPI_THREAD_SINGLE.
     * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
       if using MTL-based communications) to see exactly which
       communication plugins were considered and/or discarded.
    --------------------------------------------------------------------------
    [Tenant1_host1:04675] *** An error occurred in MPI_Init
    [Tenant1_host1:04675] *** reported by process [277938177,1]
    [Tenant1_host1:04675] *** on a NULL communicator
    [Tenant1_host1:04675] *** Unknown error
    [Tenant1_host1:04675] *** MPI_ERRORS_ARE_FATAL (processes in this communicator 
    [Tenant1_host1:04675] ***    will now abort, and potentially your MPI job)
    

    This test failed because only Storage_Host1 and Storage_Host2 are Full Members of the Storage vFabric. All tenant device groups are Limited Members of the Storage vFabric; they are not allowed to be part of an MPI run with that PKey.

    Note

    mpirun uses SSH (typically over Ethernet) to launch the processes on each node in the job.

    In a very secure site, MPI would probably fail earlier because SSH would fail due to SSH keys, or more likely Ethernet VLANs, which would be configured to further isolate the tenants.

    Note

    Since you stopped the FM service on the standby FM(s) at the start of this procedure, take the time to reconfigure them in your fabric.

    1. Copy the /etc/opa-fm/opafm.xml file from the primary FM to the standby node(s) that you configured as Admin nodes in Secure the Admin vFabric.

    2. Log in to the standby node(s) and run systemctl start opafm to run the standby FM service.