Loading...

XML

Word

Printable

Type: Enhancement
Resolution: Unresolved
Priority: Not Prioritized

Product:
Code Generation Tools
Internal ID:
CODEGEN-12812
Forum URL:
https://e2e.ti.com/support/processors-group/processors/f/791/t/1402115
Found In Release:
C7000_4.1.0.LTS
Affected Platform/Device:
default
Decline Reason:
When made runnable, the version claimed to be "slower" runs faster. Perhaps another input could expose the issue, but we need that to be provided.

The attached source test case has a loop where the innermost statements are ...

#if defined(FAST)
            // Non-valid segments will be set to 0
            *__SA0ADV(float16, out1) = __vload_pred(valid, (const float16*) &in1[offset0]);
            *__SA0ADV(float16, out1) = __vload_pred(valid, (const float16*) &in1[offset1]);
            *__SA1ADV(float16, out2) = __vload_pred(valid, (const float16*) &in2[offset0]);
            *__SA1ADV(float16, out2) = __vload_pred(valid, (const float16*) &in2[offset1]);
#elif defined(SLOW)
	    vpred sa0_pred = __SA0_VPRED(float16);
            __vstore_pred(sa0_pred, __SA0ADV(float16, out1), __vload_pred(valid, (const float16*) &in1[offset0]));

            sa0_pred = __SA0_VPRED(float16);
            __vstore_pred(sa0_pred, __SA0ADV(float16, out1), __vload_pred(valid, (const float16*) &in1[offset1]));

            vpred sa1_pred = __SA1_VPRED(float16);
            __vstore_pred(sa1_pred, __SA1ADV(float16, out2), __vload_pred(valid, (const float16*) &in2[offset0]));

            sa1_pred = __SA1_VPRED(float16);
            __vstore_pred(sa1_pred, __SA1ADV(float16, out2), __vload_pred(valid, (const float16*) &in2[offset1]));
#else
#error FAST or SLOW must be defined
#endif

The FAST version is written incorrectly. It relies on the implicit predication of the C7000 SA0 and SA1. The SLOW version is written correctly. The predication is made explicit, and it relies on the compiler to use the implicit predication of the C7000 SA0 and SA1.

Code generated for the FAST version completely unrolls the inner loop. The resulting loop does not software pipeline, because it contains too many instructions. Code generated for the SLOW version does not unroll the inner loop, but software pipelines it. The implicit predication of the C7000 SA0 and SA1 is used. The final effect is that the FAST version runs faster than the SLOW version.

Assignee:: TI User
Reporter:: TI User
Votes:: 0 Vote for this issue
Watchers:: 2 Start watching this issue

Created:: 03/Sep/24 2:36 PM
Updated:: 12/Sep/24 4:08 PM

Connection: Intermediate to External PROD System

EXTSYNC-4787 - Loop with explicit predication, tha...
SYNCHRONIZED

Last Sync Date:

12/Sep/24 16:08 PM

Details

Description

Attachments

Activity

People

Dates

Sync Status