Loop with explicit predication, that eventually uses implicit predication, runs slower than the same loop incorrectly written as using implicit predication

XMLWordPrintable

    • Type: Enhancement
    • Resolution: Unresolved
    • Priority: Not Prioritized
    • Code Generation Tools
    • CODEGEN-12812
    • C7000_4.1.0.LTS
    • default
    • When made runnable, the version claimed to be "slower" runs faster. Perhaps another input could expose the issue, but we need that to be provided.

      The attached source test case has a loop where the innermost statements are ...

      #if defined(FAST)
                  // Non-valid segments will be set to 0
                  *__SA0ADV(float16, out1) = __vload_pred(valid, (const float16*) &in1[offset0]);
                  *__SA0ADV(float16, out1) = __vload_pred(valid, (const float16*) &in1[offset1]);
                  *__SA1ADV(float16, out2) = __vload_pred(valid, (const float16*) &in2[offset0]);
                  *__SA1ADV(float16, out2) = __vload_pred(valid, (const float16*) &in2[offset1]);
      #elif defined(SLOW)
      	    vpred sa0_pred = __SA0_VPRED(float16);
                  __vstore_pred(sa0_pred, __SA0ADV(float16, out1), __vload_pred(valid, (const float16*) &in1[offset0]));
      
                  sa0_pred = __SA0_VPRED(float16);
                  __vstore_pred(sa0_pred, __SA0ADV(float16, out1), __vload_pred(valid, (const float16*) &in1[offset1]));
      
                  vpred sa1_pred = __SA1_VPRED(float16);
                  __vstore_pred(sa1_pred, __SA1ADV(float16, out2), __vload_pred(valid, (const float16*) &in2[offset0]));
      
                  sa1_pred = __SA1_VPRED(float16);
                  __vstore_pred(sa1_pred, __SA1ADV(float16, out2), __vload_pred(valid, (const float16*) &in2[offset1]));
      #else
      #error FAST or SLOW must be defined
      #endif
      

      The FAST version is written incorrectly. It relies on the implicit predication of the C7000 SA0 and SA1. The SLOW version is written correctly. The predication is made explicit, and it relies on the compiler to use the implicit predication of the C7000 SA0 and SA1.

      Code generated for the FAST version completely unrolls the inner loop. The resulting loop does not software pipeline, because it contains too many instructions. Code generated for the SLOW version does not unroll the inner loop, but software pipelines it. The implicit predication of the C7000 SA0 and SA1 is used. The final effect is that the FAST version runs faster than the SLOW version.

            Assignee:
            TI User
            Reporter:
            TI User
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:

                Connection: Intermediate to External PROD System
                EXTSYNC-4787 - Loop with explicit predication, tha...
                SYNCHRONIZED
                • Last Sync Date: