Discussion:
Benchmarking Genode TrustZone
Tiago Brito
2016-06-21 13:30:36 UTC
Permalink
Hi, I want to benchmark the execution of a function running in the secure
world of the TZ_VMM scenario in the i.MX53 QSB.

I have added a syscall to Linux which allows me to trigger a world switch
from a user program running in Linux. In this program I have a function
which allocates a buffer and processes it (each buffer position is changed
in some way). This same function is coded inside TZ_VMM.

This is what I'm testing:

1. Inside my user program in Linux I use gettimeofday before and after
the execution of the function in order to get the amount of milliseconds in
between. This is my NW test.
2. Inside my user program in Linux I use gettimeofday to get the start
time, then I execute the syscall which in turn does a world switch. Then
the function is executed inside the SW and it returns to the user program
inside Linux. After this I call another gettimeofday in order to get the
amount of milliseconds of execution.

The problem is that test 1 is giving me about 90 ms of real time execution,
but test 2 gives me about 40 ms.

I suspect it might be a problem with Linux virtualization in the TZ_VMM
example, which may be causing a drift in Linux's clock once it loses
control to the SW. What I mean is, when there isn't a syscall triggering
the SMC, Linux can count time just fine, but once the control is lost to
the secure world the clock inside Linux becomes inconsistent and doesn't
count time while the secure world is executing. Is this right?

Since I really need to benchmark a scenario similar to this I think that
the best alternative is to offload the time functionality to Genode (SW). I
create another syscall which is responsible for starting a timer inside
Genode, then I call the SMC syscall which processes the buffer in the SW,
then I call the time syscall again and check the difference. When I want to
benchmark the NW function I follow the same steps as before. Will this work
as intended?

I'm thinking that this alternative may suffer from the same problem as
before if Genode's time clock becomes inconsistent whenever Linux is being
executed in NW.

Do you know any other way to benchmark a world switch + processing + world
switch scenario? Is there any timer I can execute inside TZ_VMM?

Thanks in advance, Tiago
Stefan Kalkowski
2016-06-23 08:33:02 UTC
Permalink
Hello Tiago,
Post by Tiago Brito
Hi, I want to benchmark the execution of a function running in the secure
world of the TZ_VMM scenario in the i.MX53 QSB.
I have added a syscall to Linux which allows me to trigger a world switch
from a user program running in Linux. In this program I have a function
which allocates a buffer and processes it (each buffer position is changed
in some way). This same function is coded inside TZ_VMM.
1. Inside my user program in Linux I use gettimeofday before and after
the execution of the function in order to get the amount of milliseconds in
between. This is my NW test.
2. Inside my user program in Linux I use gettimeofday to get the start
time, then I execute the syscall which in turn does a world switch. Then
the function is executed inside the SW and it returns to the user program
inside Linux. After this I call another gettimeofday in order to get the
amount of milliseconds of execution.
The problem is that test 1 is giving me about 90 ms of real time execution,
but test 2 gives me about 40 ms.
Well, I do not know how big your buffer is, and how computing intensive
the operation, but in general it is not irrational that a computing
intensive task executed in the secure world is completed faster than in
the normal world, given our experimental TrustZone VMM/hypervisor. Due
to the fact, that the secure world immediately receives any secure IRQ,
e.g., during the normal world buffer processing, which might cause
probably expensive world-switches. In contrast to this when the
secure-world is executing it is not "disturbed" by normal world IRQs,
which means: no additional world-switches.
Nevertheless, it does not explain supposedly the mighty gap of 50ms.
Post by Tiago Brito
I suspect it might be a problem with Linux virtualization in the TZ_VMM
example, which may be causing a drift in Linux's clock once it loses
control to the SW. What I mean is, when there isn't a syscall triggering
the SMC, Linux can count time just fine, but once the control is lost to
the secure world the clock inside Linux becomes inconsistent and doesn't
count time while the secure world is executing. Is this right?
That is totally right, as I've described above, Linux won't get any IRQs
as long as the secure world is executing.
Post by Tiago Brito
Since I really need to benchmark a scenario similar to this I think that
the best alternative is to offload the time functionality to Genode (SW). I
create another syscall which is responsible for starting a timer inside
Genode, then I call the SMC syscall which processes the buffer in the SW,
then I call the time syscall again and check the difference. When I want to
benchmark the NW function I follow the same steps as before. Will this work
as intended?
It sounds quite expensive, but should work in general.
Post by Tiago Brito
I'm thinking that this alternative may suffer from the same problem as
before if Genode's time clock becomes inconsistent whenever Linux is being
executed in NW.
No, Genode's timer service will work consitently, because its secure IRQ
is prioritized higher than Linux normal world IRQs.
Post by Tiago Brito
Do you know any other way to benchmark a world switch + processing + world
switch scenario? Is there any timer I can execute inside TZ_VMM?
Well, in theory if you need a specified latency of IRQs in the normal
world, you need to guarantee that it is executed regularily.
Therefore, you would need to turn your synchronous secure-world call
into an asynchronous one. By now, the normal world won't be executed
until the call returns. That means in the asynchronous case, the "SMC"
call would return immediately, and for the response to the normal world
the VMM must instead inject an IRQ into the normal world.
Moreover, the normal world's execution context must not be prioritized
lower than the secure world's component that does the buffer processing.
However, this way you would turn the whole scenario into a fundamental
different execution model with a lot of implications regarding security
and liveliness. For example, the VMM cannot count on the shared memory's
consitency due to the normal world being executed in parallel, or a
higher priority of the VM can lead to starving, secure components.

To sum it up, if its "just" for the measurements, I would not change the
fundamental setup being in your position.

Regards
Stefan
Post by Tiago Brito
Thanks in advance, Tiago
------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
_______________________________________________
genode-main mailing list
https://lists.sourceforge.net/lists/listinfo/genode-main
--
Stefan Kalkowski
Genode Labs

http://www.genode-labs.com/ · http://genode.org/
Norman Feske
2016-06-23 09:16:00 UTC
Permalink
Hi Tiago,
Post by Tiago Brito
I'm thinking that this alternative may suffer from the same problem as
before if Genode's time clock becomes inconsistent whenever Linux is
being executed in NW.
Do you know any other way to benchmark a world switch + processing +
world switch scenario? Is there any timer I can execute inside TZ_VMM?
have you considered the use of a performance counter for measuring
low-level code paths? For reference, you may take a look at the
'timestamp' function for ARM:


https://github.com/genodelabs/genode/blob/master/repos/os/include/spec/arm_v7/trace/timestamp.h

Compared to the other time sources, the counter is precise while having
very little overhead. The exact meaning of the counter value may depend
on the platform. E.g., on the Raspberry Pi where I used it, the counter
increases every 64 clock cycles.

As far as I know, the feature must be explicitly enabled by adding the
following line to your <build-dir>/etc/specs.conf:

SPECS += perf_counter

Be aware that further (TZ configuration) steps may be required to expose
the counter to the normal world.

Cheers
Norman
--
Dr.-Ing. Norman Feske
Genode Labs

http://www.genode-labs.com · http://genode.org

Genode Labs GmbH · Amtsgericht Dresden · HRB 28424 · Sitz Dresden
Geschäftsführer: Dr.-Ing. Norman Feske, Christian Helmuth
Tiago Brito
2016-06-23 13:40:10 UTC
Permalink
Thanks for the replies, it was helpful!

I wasn't using the optimization flag -O3 on both the code running in the NW
and SW. Now I am and the times are pretty similar between the NW execution
and the SW execution on the example I was testing.

Now I'm testing another example and I'm getting some interesting results.
The code above represents an image transformation. I'm going through every
position in an array of integers and changing que new array values with a
slight modification from the old values:

// start timer here
for(i = 0; i < size; i++) {
color = oldp[i];
alpha = (color >> 24) & 0xff;
red = (color >> 16) & 0xff;
green = (color >> 8) & 0xff;
blue = color & 0xff;
lum = (int) (red * 0.299 + green * 0.587 + blue * 0.114);
newp[i] = (alpha << 24) | (lum << 16) | (lum << 8) | lum;
}
// end timer here
// check timer diff and print result

I'm testing this same exact code on both the Secure and Nonsecure domains.
In the NW I'm getting about 155 ms of execution time, which for that buffer
and transformation seems ok. On the other hand, the SW is giving me about
610 ms of execution time.

I can't seem to find a reasonable explanation for this time difference,
since the code running in both scenarios is exactly the same. The secure
code is running inside the TZ_VMM example.

Do you have an ideia on what might be happening here?

Thanks in advance, Tiago
Post by Norman Feske
Hi Tiago,
Post by Tiago Brito
I'm thinking that this alternative may suffer from the same problem as
before if Genode's time clock becomes inconsistent whenever Linux is
being executed in NW.
Do you know any other way to benchmark a world switch + processing +
world switch scenario? Is there any timer I can execute inside TZ_VMM?
have you considered the use of a performance counter for measuring
low-level code paths? For reference, you may take a look at the
https://github.com/genodelabs/genode/blob/master/repos/os/include/spec/arm_v7/trace/timestamp.h
Compared to the other time sources, the counter is precise while having
very little overhead. The exact meaning of the counter value may depend
on the platform. E.g., on the Raspberry Pi where I used it, the counter
increases every 64 clock cycles.
As far as I know, the feature must be explicitly enabled by adding the
SPECS += perf_counter
Be aware that further (TZ configuration) steps may be required to expose
the counter to the normal world.
Cheers
Norman
--
Dr.-Ing. Norman Feske
Genode Labs
http://www.genode-labs.com · http://genode.org
Genode Labs GmbH · Amtsgericht Dresden · HRB 28424 · Sitz Dresden
GeschÀftsfÌhrer: Dr.-Ing. Norman Feske, Christian Helmuth
------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
_______________________________________________
genode-main mailing list
https://lists.sourceforge.net/lists/listinfo/genode-main
Christian Helmuth
2016-06-23 14:23:29 UTC
Permalink
Hello Tiago,
Post by Tiago Brito
// start timer here
for(i = 0; i < size; i++) {
color = oldp[i];
alpha = (color >> 24) & 0xff;
red = (color >> 16) & 0xff;
green = (color >> 8) & 0xff;
blue = color & 0xff;
lum = (int) (red * 0.299 + green * 0.587 + blue * 0.114);
newp[i] = (alpha << 24) | (lum << 16) | (lum << 8) | lum;
}
// end timer here
// check timer diff and print result
I'm testing this same exact code on both the Secure and Nonsecure domains.
In the NW I'm getting about 155 ms of execution time, which for that buffer
and transformation seems ok. On the other hand, the SW is giving me about
610 ms of execution time.
I can't seem to find a reasonable explanation for this time difference,
since the code running in both scenarios is exactly the same. The secure
code is running inside the TZ_VMM example.
Did you check that the generated binary code is similar? Did you try
to measure only the run time of the for-loop in both worlds?

Regards
--
Christian Helmuth
Genode Labs

http://www.genode-labs.com/ · http://genode.org/
https://twitter.com/GenodeLabs · /ˈdʒiː.nəʊd/

Genode Labs GmbH · Amtsgericht Dresden · HRB 28424 · Sitz Dresden
Geschäftsführer: Dr.-Ing. Norman Feske, Christian Helmuth
Tiago Brito
2016-06-23 15:32:18 UTC
Permalink
I did not check if the binary code is similar, but I did measure just the
for-loop in both worlds and the times are those I described previously.

On the other hand, this code, which I used as test code before (previous
messages in this post) does have a similar execution time in both worlds:

void bench(int n) {
int buf[1024];
int i, j, k, r = 0;
for (i = 0; i < 1024; i++) {
buf[i] = 0;
}
for (j = 0; j < n; j++)
for (i = 0; i < 1024; i++)
for (k = 0; k < 1024; k++)
buf[i] = buf[i]+j+k;

for (i = 0; i < 1024; i++) {
r += buf[i];
}
PINF("Ended Bench %d - %d", (int)buf[0], r);
}

I tested this with n = 100000 and it showed an execution time of about 500
ms in both worlds.

2016-06-23 15:23 GMT+01:00 Christian Helmuth <
Post by Stefan Kalkowski
Hello Tiago,
Post by Tiago Brito
// start timer here
for(i = 0; i < size; i++) {
color = oldp[i];
alpha = (color >> 24) & 0xff;
red = (color >> 16) & 0xff;
green = (color >> 8) & 0xff;
blue = color & 0xff;
lum = (int) (red * 0.299 + green * 0.587 + blue * 0.114);
newp[i] = (alpha << 24) | (lum << 16) | (lum << 8) | lum;
}
// end timer here
// check timer diff and print result
I'm testing this same exact code on both the Secure and Nonsecure
domains.
Post by Tiago Brito
In the NW I'm getting about 155 ms of execution time, which for that
buffer
Post by Tiago Brito
and transformation seems ok. On the other hand, the SW is giving me about
610 ms of execution time.
I can't seem to find a reasonable explanation for this time difference,
since the code running in both scenarios is exactly the same. The secure
code is running inside the TZ_VMM example.
Did you check that the generated binary code is similar? Did you try
to measure only the run time of the for-loop in both worlds?
Regards
--
Christian Helmuth
Genode Labs
http://www.genode-labs.com/ · http://genode.org/
https://twitter.com/GenodeLabs · /ˈdʒiː.nəʊd/
Genode Labs GmbH · Amtsgericht Dresden · HRB 28424 · Sitz Dresden
GeschÀftsfÌhrer: Dr.-Ing. Norman Feske, Christian Helmuth
------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
_______________________________________________
genode-main mailing list
https://lists.sourceforge.net/lists/listinfo/genode-main
Christian Helmuth
2016-06-23 16:20:56 UTC
Permalink
Hello Tiago,
Post by Tiago Brito
I did not check if the binary code is similar, but I did measure just the
for-loop in both worlds and the times are those I described previously.
You really should compare the binary code as example that's slower in
SW uses floating arithmetics unless I'm mistaken. If the code is
similar and the execution time differs much, there may be an issue
with FPU handling in SW.

Greets
--
Christian Helmuth
Genode Labs

http://www.genode-labs.com/ · http://genode.org/
https://twitter.com/GenodeLabs · /ˈdʒiː.nəʊd/

Genode Labs GmbH · Amtsgericht Dresden · HRB 28424 · Sitz Dresden
Geschäftsführer: Dr.-Ing. Norman Feske, Christian Helmuth
Tiago Brito
2016-06-27 11:16:34 UTC
Permalink
Hi again, I'm using the linaro's arm-linux-gnueabihf 5.3 toolchain to
compile the application running on top of the NW linux, but I want to try
and compile it using the same compiler version used for genode since the
use of different compilers is probably interfering with my benchmark
measurements (it's the only variant).

I tried using the GCC present in genode's toolchain but I'm having several
No such file or directory errors with stdio.h and string.h includes. My
application also uses sockets, so the necessary include files are used and
may cause similar errors in the compilation process.

Do you have any suggestion on how to solve this and get the same basic
for-loop to show similar performance results in both the Normal and Secure
World execution contexts?

Thanks, Tiago

2016-06-23 17:20 GMT+01:00 Christian Helmuth <
Post by Stefan Kalkowski
Hello Tiago,
Post by Tiago Brito
I did not check if the binary code is similar, but I did measure just the
for-loop in both worlds and the times are those I described previously.
You really should compare the binary code as example that's slower in
SW uses floating arithmetics unless I'm mistaken. If the code is
similar and the execution time differs much, there may be an issue
with FPU handling in SW.
Greets
--
Christian Helmuth
Genode Labs
http://www.genode-labs.com/ · http://genode.org/
https://twitter.com/GenodeLabs · /ˈdʒiː.nəʊd/
Genode Labs GmbH · Amtsgericht Dresden · HRB 28424 · Sitz Dresden
GeschÀftsfÌhrer: Dr.-Ing. Norman Feske, Christian Helmuth
------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
_______________________________________________
genode-main mailing list
https://lists.sourceforge.net/lists/listinfo/genode-main
Tiago Brito
2016-06-28 09:55:33 UTC
Permalink
Hi, so after comparing the binaries I realized that my NW application is
using hard float instruction, unlike it's SW counter part.

I changed my NW application toolchain for one which support soft float in
order to check if my measurements are consistent. And in some way the
execution time gap between both for-loops decreased significantly, but
there's still a 100 ms gap between both execution times.

Before the NW was measuring 155 ms and the SW was measuring 610 ms. Now the
NW is measuring 500 ms whilst the SW is measuring the same 610 ms as before.

My theory is that Genode's scheduling might be delaying the SW execution. I
say this because I added a print in the resume function of the thread
scheduler and it prints several times when the SW for-loop is executing.

What I want to ask is, is my theory plausible? Would the SW scheduler delay
the execution by 100 ms? It seems a bit too much time...
What can I do to shorten this time gap between both executions?

Thanks in advance, Tiago
Post by Tiago Brito
Hi again, I'm using the linaro's arm-linux-gnueabihf 5.3 toolchain to
compile the application running on top of the NW linux, but I want to try
and compile it using the same compiler version used for genode since the
use of different compilers is probably interfering with my benchmark
measurements (it's the only variant).
I tried using the GCC present in genode's toolchain but I'm having several
No such file or directory errors with stdio.h and string.h includes. My
application also uses sockets, so the necessary include files are used and
may cause similar errors in the compilation process.
Do you have any suggestion on how to solve this and get the same basic
for-loop to show similar performance results in both the Normal and Secure
World execution contexts?
Thanks, Tiago
2016-06-23 17:20 GMT+01:00 Christian Helmuth <
Post by Stefan Kalkowski
Hello Tiago,
Post by Tiago Brito
I did not check if the binary code is similar, but I did measure just
the
Post by Tiago Brito
for-loop in both worlds and the times are those I described previously.
You really should compare the binary code as example that's slower in
SW uses floating arithmetics unless I'm mistaken. If the code is
similar and the execution time differs much, there may be an issue
with FPU handling in SW.
Greets
--
Christian Helmuth
Genode Labs
http://www.genode-labs.com/ · http://genode.org/
https://twitter.com/GenodeLabs · /ˈdʒiː.nəʊd/
Genode Labs GmbH · Amtsgericht Dresden · HRB 28424 · Sitz Dresden
GeschÀftsfÌhrer: Dr.-Ing. Norman Feske, Christian Helmuth
------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
_______________________________________________
genode-main mailing list
https://lists.sourceforge.net/lists/listinfo/genode-main
Stefan Kalkowski
2016-06-28 10:45:14 UTC
Permalink
Hi Tiago,
Post by Tiago Brito
Hi, so after comparing the binaries I realized that my NW application is
using hard float instruction, unlike it's SW counter part.
I changed my NW application toolchain for one which support soft float in
order to check if my measurements are consistent. And in some way the
execution time gap between both for-loops decreased significantly, but
there's still a 100 ms gap between both execution times.
Before the NW was measuring 155 ms and the SW was measuring 610 ms. Now the
NW is measuring 500 ms whilst the SW is measuring the same 610 ms as before.
My theory is that Genode's scheduling might be delaying the SW execution. I
say this because I added a print in the resume function of the thread
scheduler and it prints several times when the SW for-loop is executing.
What I want to ask is, is my theory plausible? Would the SW scheduler delay
the execution by 100 ms? It seems a bit too much time...
What can I do to shorten this time gap between both executions?
By default, if you do not configure any CPU quota and priority, the
kernel will schedule round-robin. As long as the "normal" world is not
stopped during the calculation within your "secure" Genode component, it
will be executed side-by-side. But I wonder why it is not stopped - did
you changed the execution model?

If you want to ensure that your specific calculation routine is always
executed when it is runnable, you have to add a:

<resource name="CPU" quantum="100"/>

in its start node within the XML configuration of the init component.
This will give 100% of the CPU quota to your component. But be aware
that you can easily starve other components this way, as long as your
component never blocks.

Regards
Stefan
Post by Tiago Brito
Thanks in advance, Tiago
Post by Tiago Brito
Hi again, I'm using the linaro's arm-linux-gnueabihf 5.3 toolchain to
compile the application running on top of the NW linux, but I want to try
and compile it using the same compiler version used for genode since the
use of different compilers is probably interfering with my benchmark
measurements (it's the only variant).
I tried using the GCC present in genode's toolchain but I'm having several
No such file or directory errors with stdio.h and string.h includes. My
application also uses sockets, so the necessary include files are used and
may cause similar errors in the compilation process.
Do you have any suggestion on how to solve this and get the same basic
for-loop to show similar performance results in both the Normal and Secure
World execution contexts?
Thanks, Tiago
2016-06-23 17:20 GMT+01:00 Christian Helmuth <
Post by Stefan Kalkowski
Hello Tiago,
Post by Tiago Brito
I did not check if the binary code is similar, but I did measure just
the
Post by Tiago Brito
for-loop in both worlds and the times are those I described previously.
You really should compare the binary code as example that's slower in
SW uses floating arithmetics unless I'm mistaken. If the code is
similar and the execution time differs much, there may be an issue
with FPU handling in SW.
Greets
--
Christian Helmuth
Genode Labs
http://www.genode-labs.com/ · http://genode.org/
https://twitter.com/GenodeLabs · /ˈdʒiː.nəʊd/
Genode Labs GmbH · Amtsgericht Dresden · HRB 28424 · Sitz Dresden
Geschäftsführer: Dr.-Ing. Norman Feske, Christian Helmuth
------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
_______________________________________________
genode-main mailing list
https://lists.sourceforge.net/lists/listinfo/genode-main
------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
_______________________________________________
genode-main mailing list
https://lists.sourceforge.net/lists/listinfo/genode-main
--
Stefan Kalkowski
Genode Labs

http://www.genode-labs.com/ · http://genode.org/
Loading...