Why this code is not efficient?

Question

Why this code is not efficient?

asked Feb 18, 2022 in Education by JackTerrance

I want to improve the next code, calculating the mean: void calculateMeanStDev8x8Aux(cv::Mat* patch, int sx, int sy, int& mean, float& stdev) { unsigned sum=0; unsigned sqsum=0; const unsigned char* aux=patch->data + sy*patch->step + sx; for (int j=0; j< 8; j++) { const unsigned char* p = (const unsigned char*)(j*patch->step + aux ); //Apuntador al inicio de la matrix for (int i=0; i<8; i++) { unsigned f = *p++; sum += f; sqsum += f*f; } } mean = sum >> 6; int r = (sum*sum) >> 6; stdev = sqrtf(sqsum - r); if (stdev < .1) { stdev=0; } } I also improved the next loop with NEON intrinsics: for (int i=0; i<8; i++) { unsigned f = *p++; sum += f; sqsum += f*f; } This is the code improved for the other loop: int32x4_t vsum= { 0 }; int32x4_t vsum2= { 0 }; int32x4_t vsumll = { 0 }; int32x4_t vsumlh = { 0 }; int32x4_t vsumll2 = { 0 }; int32x4_t vsumlh2 = { 0 }; uint8x8_t f= vld1_u8(p); // VLD1.8 {d0}, [r0] //int 16 bytes /8 elementos int16x8_t val = (int16x8_t)vmovl_u8(f); //int 32 /4 elementos *2 int32x4_t vall = vmovl_s16(vget_low_s16(val)); int32x4_t valh = vmovl_s16(vget_high_s16(val)); // update 4 partial sum of products vectors vsumll2 = vmlaq_s32(vsumll2, vall, vall); vsumlh2 = vmlaq_s32(vsumlh2, valh, valh); // sum 4 partial sum of product vectors vsum = vaddq_s32(vall, valh); vsum2 = vaddq_s32(vsumll2, vsumlh2); // do scalar horizontal sum across final vector sum += vgetq_lane_s32(vsum, 0); sum += vgetq_lane_s32(vsum, 1); sum += vgetq_lane_s32(vsum, 2); sum += vgetq_lane_s32(vsum, 3); sqsum += vgetq_lane_s32(vsum2, 0); sqsum += vgetq_lane_s32(vsum2, 1); sqsum += vgetq_lane_s32(vsum2, 2); sqsum += vgetq_lane_s32(vsum2, 3); But it is more or less 30 ms more slow. Does anyone know why? All the code is working right. JavaScript questions and answers, JavaScript questions pdf, JavaScript question bank, JavaScript questions and answers pdf, mcq on JavaScript pdf, JavaScript questions and solutions, JavaScript mcq Test , Interview JavaScript questions, JavaScript Questions for Interview, JavaScript MCQ (Multiple Choice Questions)

1 Answer

Related questions

0 votes

Q: Why is this code for multiple histograms on axes not working?

I am getting errors for the code when running it on a 'tips' dataset but I can run it on a tulips dataset ... . Am I missing something? Select the correct answer from above options...

asked Jan 19, 2022 in Education by JackTerrance

0 votes

Q: C++ Code keeps crashing after a validation

I have written a piece of code to validate a keyword, it validates and makes sure that the word is 5 letters long and ... the problem, as it works fine without it. The code: cout...

asked Apr 7, 2022 in Education by JackTerrance

0 votes

Q: Making portable code

With all the fuss about opensource projects, how come there is still not a strong standard that enables you ... Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...

asked Mar 27, 2022 in Education by JackTerrance

0 votes

Q: C++ Unit Testing Legacy Code: How to handle #include?

I've just started writing unit tests for a legacy code module with large physical dependencies using the ... Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...

asked Mar 25, 2022 in Education by JackTerrance

0 votes

Q: Guidelines to improve your code

Locked. This question and its answers are locked because the question is off-topic but has historical ... Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...

asked Mar 25, 2022 in Education by JackTerrance

0 votes

Q: C++ Unit Testing Legacy Code: How to handle #include?

I've just started writing unit tests for a legacy code module with large physical dependencies using the ... Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...

asked Mar 24, 2022 in Education by JackTerrance

0 votes

Q: Is it possible to subclass a C struct in C++ and use pointers to the struct in C code?

Is there a side effect in doing this: C code: struct foo { int k; }; int ret_foo(const struct ... , JavaScript Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...

asked Mar 13, 2022 in Education by JackTerrance

0 votes

Q: ________ allows you to insert debugging code into a function a specific places

________ allows you to insert debugging code into a function a specific places (a) debug() (b) trace() ( ... Debugging of R Programming Select the correct answer from above options...

asked Feb 16, 2022 in Education by JackTerrance

0 votes

Q: The line of code in R language should begin with a ________________

The line of code in R language should begin with a ________________ (a) Hash symbol (b) Alphabet (c) ... Debugging of R Programming Select the correct answer from above options...

asked Feb 15, 2022 in Education by JackTerrance

0 votes

Q: __________ package is used to speed up data frame management code.

__________ package is used to speed up data frame management code. (a) Data.table (b) Dplyr (c) Table ( ... Debugging of R Programming Select the correct answer from above options...

asked Feb 12, 2022 in Education by JackTerrance

0 votes

Q: R code can be tested using _________________ package.

R code can be tested using _________________ package. (a) Dplyr (b) Hadley's testthat (c) SKLearn (d) ... Debugging of R Programming Select the correct answer from above options...

asked Feb 12, 2022 in Education by JackTerrance

0 votes

Q: Which of the following code create a n item vector of random normal deviates?

Which of the following code create a n item vector of random normal deviates? (a) x1...

asked Feb 12, 2022 in Education by JackTerrance

0 votes

Q: Which of the following code will drop the nth column?

Which of the following code will drop the nth column? (a) new...

asked Feb 12, 2022 in Education by JackTerrance

0 votes

Q: Which of the following code drop the ith and jth column?

Which of the following code drop the ith and jth column? (a) new...

asked Feb 12, 2022 in Education by JackTerrance

0 votes

Q: Which of the following code create n samples of size “size” with probability prob from the binomial?

Which of the following code create n samples of size “size” with probability prob from the binomial? (a) z...

asked Feb 11, 2022 in Education by JackTerrance

JackTerrance · Answer 1 · 2022-02-18T04:29:59+0000

Add to Lundin. Yes, instruction sets like ARM where you have a register based index or some reach with an immediate index you might benefit encouraging the compiler to use indexing. Also though the ARM for example can increment its pointer register in the load instruction, basically *p++ in one instruction. it is always a toss up using p[i] or p[i++] vs *p or *p++, some instruction sets are much more obvious which path to take. Likewise your index. if you are not using it counting down instead of up can save an instruction per loop, maybe more. Some might do this: inc reg cmp reg,#7 bne loop_top If you were counting down though you might save an instruction per loop: dec reg bne loop_top or even one processor I know of decrement_and_jump_if_not_zero loop_top The compilers usually know this and you dont have to encourage them. BUT if you use the p[i] form where the memory read order is important, then the compiler cant or at least should not arbitrarily change the order of the reads. So for that case you would want to have the code count down. So I tried all of these things: unsigned fun1 ( const unsigned char *p, unsigned *x ) { unsigned sum; unsigned sqsum; int i; unsigned f; sum = 0; sqsum = 0; for(i=0; i<8; i++) { f = *p++; sum += f; sqsum += f*f; } //to keep the compiler from optimizing //stuff out x[0]=sum; return(sqsum); } unsigned fun2 ( const unsigned char *p, unsigned *x ) { unsigned sum; unsigned sqsum; int i; unsigned f; sum = 0; sqsum = 0; for(i=8;i--;) { f = *p++; sum += f; sqsum += f*f; } //to keep the compiler from optimizing //stuff out x[0]=sum; return(sqsum); } unsigned fun3 ( const unsigned char *p, unsigned *x ) { unsigned sum; unsigned sqsum; int i; sum = 0; sqsum = 0; for(i=0; i<8; i++) { sum += (unsigned)p[i]; sqsum += ((unsigned)p[i])*((unsigned)p[i]); } //to keep the compiler from optimizing //stuff out x[0]=sum; return(sqsum); } unsigned fun4 ( const unsigned char *p, unsigned *x ) { unsigned sum; unsigned sqsum; int i; sum = 0; sqsum = 0; for(i=8; i;i--) { sum += (unsigned)p[i-1]; sqsum += ((unsigned)p[i-1])*((unsigned)p[i-1]); } //to keep the compiler from optimizing //stuff out x[0]=sum; return(sqsum); } with both gcc and llvm (clang). And of course both unrolled the loop since it was a constant. gcc, for each of the experiments produce the same code, in cases a subtle register mix change. And I would argue a bug as at least one of them the reads were not in the order described by the code. gcc solutions for all four were this, with some read reordering, notice the reads being out of order from the source code. If this were against hardware/logic that relied on the reads being in the order described by the code, you would have a big problem. 00000000 <fun1>: 0: e92d05f0 push {r4, r5, r6, r7, r8, sl} 4: e5d06001 ldrb r6, [r0, #1] 8: e00a0696 mul sl, r6, r6 c: e4d07001 ldrb r7, [r0], #1 10: e02aa797 mla sl, r7, r7, sl 14: e5d05001 ldrb r5, [r0, #1] 18: e02aa595 mla sl, r5, r5, sl 1c: e5d04002 ldrb r4, [r0, #2] 20: e02aa494 mla sl, r4, r4, sl 24: e5d0c003 ldrb ip, [r0, #3] 28: e02aac9c mla sl, ip, ip, sl 2c: e5d02004 ldrb r2, [r0, #4] 30: e02aa292 mla sl, r2, r2, sl 34: e5d03005 ldrb r3, [r0, #5] 38: e02aa393 mla sl, r3, r3, sl 3c: e0876006 add r6, r7, r6 40: e0865005 add r5, r6, r5 44: e0854004 add r4, r5, r4 48: e5d00006 ldrb r0, [r0, #6] 4c: e084c00c add ip, r4, ip 50: e08c2002 add r2, ip, r2 54: e082c003 add ip, r2, r3 58: e023a090 mla r3, r0, r0, sl 5c: e080200c add r2, r0, ip 60: e5812000 str r2, [r1] 64: e1a00003 mov r0, r3 68: e8bd05f0 pop {r4, r5, r6, r7, r8, sl} 6c: e12fff1e bx lr the index for the loads and subtle register mixing was the only difference between functions from gcc, all of the operations were the same in the same order. llvm/clang: 00000000 : 0: e92d41f0 push {r4, r5, r6, r7, r8, lr} 4: e5d0e000 ldrb lr, [r0] 8: e5d0c001 ldrb ip, [r0, #1] c: e5d03002 ldrb r3, [r0, #2] 10: e5d08003 ldrb r8, [r0, #3] 14: e5d04004 ldrb r4, [r0, #4] 18: e5d05005 ldrb r5, [r0, #5] 1c: e5d06006 ldrb r6, [r0, #6] 20: e5d07007 ldrb r7, [r0, #7] 24: e08c200e add r2, ip, lr 28: e0832002 add r2, r3, r2 2c: e0882002 add r2, r8, r2 30: e0842002 add r2, r4, r2 34: e0852002 add r2, r5, r2 38: e0862002 add r2, r6, r2 3c: e0870002 add r0, r7, r2 40: e5810000 str r0, [r1] 44: e0010e9e mul r1, lr, lr 48: e0201c9c mla r0, ip, ip, r1 4c: e0210393 mla r1, r3, r3, r0 50: e0201898 mla r0, r8, r8, r1 54: e0210494 mla r1, r4, r4, r0 58: e0201595 mla r0, r5, r5, r1 5c: e0210696 mla r1, r6, r6, r0 60: e0201797 mla r0, r7, r7, r1 64: e8bd41f0 pop {r4, r5, r6, r7, r8, lr} 68: e1a0f00e mov pc, lr much easier to read and follow, perhaps thinking about a cache and getting the reads all in one shot. llvm in at least one case got the reads out of order as well. 00000144 : 144: e92d40f0 push {r4, r5, r6, r7, lr} 148: e5d0c007 ldrb ip, [r0, #7] 14c: e5d03006 ldrb r3, [r0, #6] 150: e5d02005 ldrb r2, [r0, #5] 154: e5d05004 ldrb r5, [r0, #4] 158: e5d0e000 ldrb lr, [r0] 15c: e5d04001 ldrb r4, [r0, #1] 160: e5d06002 ldrb r6, [r0, #2] 164: e5d00003 ldrb r0, [r0, #3] Yes, for averaging some values from ram, order is not an issue, moving on. So the compiler choose the unrolled path and didnt care about the micro-optmizations. because of the size of the loop both choose to burn a bunch of registers holding one of the loaded values per loop then either performing the adds from those temporary reads or the multiplies. if we increase the size of the loop a little I would expect to see sum and sqsum accumulations within the unrolled loop as it runs out of registers, or the threshold will be reached where they choose not to unroll the loop. If I pass the length in, and replace the 8's in the code above with that passed in length, forcing the compiler to make a loop out of this. You sorta see the optimizations, instructions like this are used: a4: e4d35001 ldrb r5, [r3], #1 And being arm they do a modification of the loop register in one place and branch if not equal a number of instructions later...because they can. Granted this is a math function, but using float is painful. And using multplies is painful, divides are much worse, fortunately a shift was used. and fortunately this was unsigned so that you could use the shift (the compiler would/should have known to use an arithmetic shift if available if you used a divide against a signed number). So basically focus on micro-optmizations of the inner loop since it gets run multiple times, and if this can be changed so it becomes shifts and adds, if possible, or arranging the data so you can take it out of the loop (if possible, dont waste other copy loops elsewhere to do this) const unsigned char* p = (const unsigned char*)(j*patch->step + aux ); you could get some speed. I didnt try it but because it is a loop in a loop the compiler probably wont unroll that loop... Long story short, you might get some gains depending on the instruction set against a dumber compiler, but this code is not really bad so the compiler can optimize it as well as you can.