Solaris InfiniBand SW stack short summary

I’m been investigating InfiniBand(RDMA) things on Solaris 10/11.

My ultimate goal is to realize fast and reliable InfiniBand + ZFS storage on top of (Oracle) Solaris 11 or OpenIndiana.

Following is the memo of my survey of InfiniBand stack status on Solaris 10/11.

At this time, OpenIndiana and Nexenta is not based on Solaris 11 kernel/kernel modules(and will never), so many things will fallback to Solaris 10 case.

OFED

– OFED ported to Solaris 11 is based on OFED 1.5.3
– OFED ported to OpenIndiana seems based on OFED 1.3. OFED 1.5.3 grabbed from Oracle Solaris 11 doesn’t work on OpenIndiana 151a.

Kernel/kernel module components(10/11)

– IPoIB
– SDP
– SRP
– uDAPL?
– umad, uverbs, ucma

All these components are kernel component, so you don’t need to install open-fabrics package(OFED upper layer library ported to Solaris).

IPoIB performance on OpenIndiana 151a

Measured with netperf, on AMD AthlonII Neo + IB SDR

1 GbE : 110 MB/s
IB SDR : 620 MB/s

Theoretical peak of IB SDR is around 900 MB/s, so the number of IB SDR will increase if you have much more better CPU.

SRP performance on OpenIndiana 151a

Measured with hdparm against a file created onto /tmp filesystem(ramdisk), on AMD AthlonII Neo + IB SDR

IB SDR + SRP : 558.94 MB/s

IB SRP seems slower than IPoIB, even though measurement situation is not same.
Will need an investigation further.

Solaris as a InfiniBand-ready storage

On current OpenIndiana 151a, you can’t use OFED upper layer tools, e.g. ibstat, ib_read_bw.
Also, you can’t do a RDMA programming using RDMA-CM with same programming API in Linux.
But you can use IPoIB and SRP.
SDP also might work, but I haven’t confirmed it yet.

Thus, to use OpenIndiana as a InfiniBand + ZFS storage, current solution goes to deploing a storage system with IPoIB or SRP.

You might not able to use IPoIB-CM to get a better network performance.

33 thoughts on “Solaris InfiniBand SW stack short summary

  1. Hi,
    your results are good, I can not get more then 400-450MB/s throughput using openIndiana 151a + Linux srp initiator (ofed 1.5.3).
    Local writes to zfs pool are 2.5-2.8GB/s, each host (both target and initiator) has Mellanox 4x DDR 20Gbps adapters, now connected back-to-back (tried using TopSpin 7000D DDR switch with same results). Tried both datagram and connected modes, results are the same. Looks like ib hardware is not working at full speed. Can you imagine what can be a problem ?

    Thanks in advance!

    1. Yes, on DDR, you should get at least 1.4 GB/s or more. > 400-450MB/s

      But unfortunately I have no idea why so SRP is slow…

      What is the number if you use ramdisk? Do you mean “2.5-2.8GB/s of local write” is by creating ZFS volume on top of ramdisk?

      Do you have enough memory?

      Also, I’ve heard that SRP initiator in OFED 1.5.3.x are not stable. If you update OFED 1.5.4 you might get better SRP performance.
      (Just FYI, I have experienced some SRP initiator stability problem. Sending 1 GB over data frequently causes disconnection of SRP. I’m using OFED 1.5.3.2 for Linux client).

      1. No, 2.6-2.8 is on 12 drive raidz2 + ssd (zil/l2arc).
        The results on using a ramdisk are the same (400-450), so that’s why I’m thinking of some kind of not fully working srp.
        Memory in my storage node is 16Gb ECC DDR3.

        Will try to ompile 1.5.4 openfabrics ofed.

      2. Hmm… its’ curious…

        How about IPoIB performance?

        Another thing hit to mind is that SRP implementation or IB HCA driver in OpenIndiana(Illumos) is not mature or might have a bug. The IB kernel development is stopped since 2010(as you know, death of OpenSolaris). Since then, there’s no IB kernel/driver update by OpenIndiana/Illumos community…

  2. I have no rdma_bw and rdma_lat test suites on openIndiana, how can I test ipoib performance ? Netperf / iperf ?

      1. Just tested IPoIB with iperf, iperf -s on storage (1.1.1.2) iperf -c 1.1.1.2

        Client connecting to 1.1.1.2, TCP port 5001
        TCP window size: 193 KByte (default)
        ————————————————————
        [ 3] local 1.1.1.1 port 47357 connected with 1.1.1.2 port 5001
        [ ID] Interval Transfer Bandwidth
        [ 3] 0.0-10.0 sec 5.14 GBytes 4.41 Gbits/sec

        Strange, but this can be the problem. I can not pull even 8Gbit of effective IB bandwidth with TCP overhead.

  3. Hmm… it seems your PCIex link of IB HCA is limited to 500MB/s in Solaris.

    Did you get same performance if you use Linux on the machine?

    Could you please check PCIex link status with a tool like lspci -vvv in linux?(I don’t know a specific tool to check PCIex device status in Solaris)

    Also, please try to update your mobo’s BIOS and IB HCA’s firmware.

    1. It seems to me that I’ve found one problem. I was in data center today and saw interesting thing, when I write 20GB to zfs locally with dd I see no led blinks and local speed is 2.5Gb/s it means nothing is written to disks. Then I saw that my pool is configured to use compression. When I switched it to off, speed is dropped to 750MB/s and all led lights on array blinks like crazy 🙂 So it means that my local speed is 700-750 not the 2.5Gb/s. Lets assume one of my drives can write 100MB/s, effective drives are 9, raidz2 + 1 spare, so I can not achieve 2.5 even with blazing fast zil.

      Now if I have local 750Mb/s I still must achieve not less than 750MB/s over IB.

      lspci -vvv output:

      0d:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX VPI PCIe 2.0 2.5GT/s – IB DDR / 10GigE] (rev a0)
      Subsystem: Mellanox Technologies MT25418 [ConnectX VPI PCIe 2.0 2.5GT/s – IB DDR / 10GigE]
      Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
      Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
      Latency: 0, Cache Line Size: 64 bytes
      Interrupt: pin A routed to IRQ 16
      Region 0: Memory at fc300000 (64-bit, non-prefetchable) [size=1M]
      Region 2: Memory at d8000000 (64-bit, prefetchable) [size=8M]
      Capabilities: [40] Power Management version 3
      Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
      Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
      Capabilities: [48] Vital Product Data
      Not readable
      Capabilities: [9c] MSI-X: Enable+ Count=256 Masked-
      Vector table: BAR=0 offset=0007c000
      PBA: BAR=0 offset=0007d000
      Capabilities: [60] Express (v2) Endpoint, MSI 00
      DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited
      ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
      DevCtl: Report errors: Correctable- Non-Fatal- Fatal+ Unsupported-
      RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
      MaxPayload 256 bytes, MaxReadReq 512 bytes
      DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
      LnkCap: Port #8, Speed 2.5GT/s, Width x8, ASPM L0s, Latency L0 unlimited, L1 unlimited
      ClockPM- Surprise- LLActRep- BwNot-
      LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-
      ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
      LnkSta: Speed 2.5GT/s, Width x8, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
      DevCap2: Completion Timeout: Range ABCD, TimeoutDis+
      DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
      LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
      Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
      Compliance De-emphasis: -6dB
      LnkSta2: Current De-emphasis Level: -6dB
      Kernel driver in use: mlx4_core
      Kernel modules: mlx4_en, mlx4_core

      2.5GT/s means only 2.5Gbps over this pcie lane ?

      1. 2.5 GT/s x 8 = 20 Gbps DDR. So the number is good.

        I will prepare binary of rdma_bw for OpenIndiana, to test IB performance in your environment. Please wait a bit.

        Also, please try multiple write/read. e.g,

        $ dd if=/dev/zero of=1G.1 bs=1G count=1 &
        $ dd if=/dev/zero of=1G.2 bs=1G count=1 &
        $ dd if=/dev/zero of=1G.3 bs=1G count=1 &
        $ dd if=/dev/zero of=1G.4 bs=1G count=1 &

        Consolidated number should be better than the number when using one dd, and in most case consolidated number should be as same as peak bandwidth of IB.

  4. Thanks for the great blog.

    I was just about to have a go at setting up IB on OpenSolaris and see if SRP target or iSCSI via comstar would be possible. (IB support under FreeBSD is rather limited regarding SRP/etc eventhough ZFS works like a charm). But I guess I’d better try Nexenta with a howto at hand.

    1. Hello Pntagruel,

      Good to hear of it.

      Many people already confirmed OpenIndiana + SRP, OpenIndiana + COMSTAR + (IPoIB or iSER) configuration works well. So you will got success.

      1. OpenIndiana + SRP is up and running ;).
        Too bad the old beat up PATA disk is showing it’s age, but serves perfectly as proof of principle (20 MB/sec write and 63 MB/sec read).
        Next stop a 4 disk (WDAC SATA 320 GB) RAIDO ZFS pool.

  5. To find out the max. throughput for the IB HCA under OpenIndiana in conjunction with SRP target, I created a RAMdisk and used sbdadm/stmfadm to present it.
    Max. read is on par with a prior test on a linux SRP RAMdisk target (+900MB/sec), write however is only ± 460 MB/sec on OpenIndiana which is roughly half of what the linux target was able to do (+900MB/sec as well).
    Any suggestions regarding tuning?

    1. The same was in my tests. Linux srp initiator to Linux srp target results were 800-900MB/s (SDR IB), thus on OpenIndiana target with (DDR IB) and Linux initiator (DDR IB) 380-400MB/s.

      1. Hmm… I have no idea… How much memory do you guys have?

        Also, please try OpenIndiana + iSER. Recent Linux OFED has iSER, so you can try OI + iSER Linux + iSER.

  6. Sorry for the delay.

    The test rig has 8 GiB of RAM (GA-EX38-DQ6, E7400).
    The ZFS server (ZFSGuru) has 16 Gib of RAM (X9SCM-F, E3-1230).

    Due to other pressing matters I have not tried iSER nor searched for an OI+iSER howto/user experience blog

  7. Today I tested with another machine with only 4 GiB of RAM (Asus P5N7A-VM, E6600), just for the fun of it.

    – Win 7×64 connected to a Linux (Ubuntu) SRP target, read and write at least 900MB/sec and above.

    – Win 7×64 connected to a OpenIndiana SRP target, read is at least 900MB/sec and above, write however is limited at approx. 450~460 MB/sec.

    The SRP target was a 2 GiB file created in a RAMdisk and sbdadm/stmfadm where used to present it. The IB HCA used in the SRP target machine was a MHEA28-2TC with 256 MB RAM on-board.
    The IB HCA used in the windows machine was a MHEA28-XTC.

    While google’ing for howto’s and info regarding OI and iSER (which gave little info regarding how to setup iSER) I found out that iSER will most likely serve me no real use. Apparently Windows doesnot support iSER and is limited to SRP.
    I guess I’ll have to see if Windows 7 support NFSoRDMA and get OI do NFSoRDMA.

    1. Thanks for detailed report. But still I cannot figure out why the performance on Solaris target is slow…

      iSER driver is not provided in WinOFED. More to say, recent WinOFED 3.0 package drops support of SRP intiator. You might need to compile SRP intiator from source code if you use WinOFED 3.0.

      I don’t know NFSoRDMA is supported in Win7.

      1. I’d have to check using Linux. Windows doesn’t do multiple streams nor has an easy way to setup a RAMdisk to provide adequate bandwidth.

      2. Getting things to run according to my wishes and to rule out limitations imposed by hdd’s was somewhat of a challenge. Out of the box OI will only let you use 25% of your RAM as a ramdisk which is quite limiting. As before the solaris manuals and a bit of googling help to get around this problem and persuade OI by some config editing to allow for a bigger chunk of RAM.
        SRP target functionality was checked from a Win7 machine, final testing was done from an Ubuntu 11.10 machine (Asus M3A, AMD 64 X2 5000+ BE, 4 Gig RAM). SRP initiator setup on the linux box took some file editing but was easy to do, dmesg saw the newly added device and some fdisk-ing an mkfs later mount added a 4 GB SRP target to the local storage.

        Write testing:
        A single thread of dd:
        dd if=/dev/zero of=/SRP/file1.img bs=450M count=1 &

        1+0 records in
        1+0 records out
        471859200 bytes (472 MB) copied, 1.35134 s, 349 MB/s

        Two threads of dd:

        1+0 records in
        1+0 records out
        471859200 bytes (472 MB) copied, 1.62561 s, 290 MB/s
        1+0 records in
        1+0 records out
        471859200 bytes (472 MB) copied, 1.64829 s, 286 MB/s
        ——————
        sum: 576 MB/s

        Four threads of dd:
        1+0 records in
        1+0 records out
        471859200 bytes (472 MB) copied, 4.68419 s, 101 MB/s
        1+0 records in
        1+0 records out
        471859200 bytes (472 MB) copied, 4.69119 s, 101 MB/s
        1+0 records in
        1+0 records out
        471859200 bytes (472 MB) copied, 4.89475 s, 96.4 MB/s
        1+0 records in
        1+0 records out
        471859200 bytes (472 MB) copied, 4.90767 s, 96.1 MB/s
        ——————
        sum: 394.5 MB/s

        Eight threads of dd: roughly 340 MB/s.

        So Linux write speed can be slightly better than Windows but is mostly on par. Multiple threads of dd really do not get the speed boost wanted.
        I guess OI really has a write ceiling of some 500 to 550 MB/sec.

        Read testing was fun 😉
        A single Read thread of dd:

        dd if=/SRP/file1.img of=/dev/null bs=1M
        450+0 records in
        450+0 records out
        471859200 bytes (472 MB) copied, 0.5343 s, 883 MB/s

        Eight threads of dd read:
        8 times
        450+0 records in
        450+0 records out
        471859200 bytes (472 MB) copied, 1.91815 s, 246 MB/s
        471859200 bytes (472 MB) copied, 2.03162 s, 232 MB/s
        471859200 bytes (472 MB) copied, 2.03728 s, 232 MB/s
        471859200 bytes (472 MB) copied, 2.09143 s, 226 MB/s
        471859200 bytes (472 MB) copied, 2.11844 s, 223 MB/s
        471859200 bytes (472 MB) copied, 2.13628 s, 221 MB/s
        471859200 bytes (472 MB) copied, 2.15088 s, 219 MB/s
        471859200 bytes (472 MB) copied, 2.14709 s, 220 MB/s
        ——————
        sum: 1819 MB/s

        Bottomline: multiple dd reads from the SRP target files approaches the theoretical throughput of 20 Gbit which is the max for the MHEA28-XTC /MHEA28-2TC IB HCA’s used.
        So I should be a happy camper, but I am not, write speed is still weak.
        Some would say 500 MB/sec is quite respectable and hard to achieve with a HDD based ZFS storage pool. Yes true, you have a point, but I still feel cheated knowing I am not using the IB HCA to it’s fullest potential.

      3. To rule out any problem with the MHEA28-2TC IB HCA, I installed Ubuntu 11.10 on the former OI test rig.

        It turns out Linux is perfectly able to deliver +900MB/sec Read and Write from a MHEA28-2TC SRP-target to a Win7 x64 box (MHEA28-XTC IB HCA) .

        Conclusion:
        OpenIndiana has to be the limiting factor with regards to an IB presented SRP-target.

  8. Thanks for the extended info.

    Seems like it has been a wrong bet to get into an IB setup (atleast as a dedicated link between the file server and the windows box which is my intention [video processing purposes] ). At least I didn’t blow loads of cash on the setup. Time to switch to fc or 10 Gbit Ethernet.

    I don’t know by heart which is the WinOFED version I am running. I’ll check and perhaps compile the srp initiator by hand if I have to switch to it (guess I am using an older version since SRP is working without a hitch). I’ll have to do some reading to confirm whether or not SRP target will work on a ZFS pool (as in are there any problems regarding snapshot’s file integrity, etc.).

    But let’s not get too gloomy, SMB 2.2 might sport SMBoRDMA (according to this link: http://www.google.nl/url?sa=t&rct=j&q=%22windows+7%22+%22NFS+over+RDMA%22&source=web&cd=12&ved=0CCsQFjABOAo&url=http%3A%2F%2Fpop.cmg.org%2Fregions%2Fmspcmg%2FAgendaSpringMar2012files%2FSMB2CMG2012.pdf&ei=aRt6T_X5A8SEOof02PEN&usg=AFQjCNHjaFskz6vSziQ0Mu6LoZsXYVj4pQ&cad=rja ) so we only have to wait for it to show up in OI and switch to the ‘lovely’ Win8 (shrugs in dismay at the memory of the open beta of Win8’s GUI, YUCK).

    1. FYI, SMB2.2/RDMA is only supported for server edition of Windows8. If you want to use Win8 as a client PC but connected to the RDMA storage server, still you need server edition of Win8 for client PC.

      SMB2.2 seems a ‘open’ spec, so someone(e.g. samba.org team) might contribute a SMB2.2/RDMA implementation for Win8 client and Linux/Solaris target, but only God knows…

  9. help
    i installed and can see the driver, managed to install that but now the MTU is wrong and cant get the card to work with ipoib?? i can set static ip but it wont work. each light is up, says the link is up, can see the GUID on fabric from other nodes. But for life of cant get it to work. Any help, doc on install setup on openindiana or solaris 11 would be great… any one….? please?

Leave a reply to syoyo Cancel reply