Page faults in managed dataspaces

Discussion:

Denis Huber

2016-09-19 13:01:05 UTC

Dear Genode Community,

I want to implement a mechanism to monitor the access of a component to
its address space.

My idea is to implement a monitoring component which provides managed
dataspaces to a target component. Each managed dataspace has several
designated dataspaces (allocated, but not attached, and with a fixed
location in the managed dataspace). I want to use several dataspaces to
control the access range of the target component.

Whenever the target component accesses an address in the managed
dataspace, a page fault is triggered, because the managed dataspace has
no dataspaces attached to it. The page fault is caught by a custom page
fault handler. The page fault handler attaches the designated dataspace
into the faulting managed dataspace and resolves the page fault.

To test my concept I implemented a prototypical system with a monitoring
component (called "ckpt") [1] and a target component [2].

[1]
https://github.com/702nADOS/genode-CheckpointRestore-SharedMemory/blob/b502ffd962a87a5f9f790808b13554d6568f6d0b/src/test/concept_session_rm/server/main.cc
[2]
https://github.com/702nADOS/genode-CheckpointRestore-SharedMemory/blob/b502ffd962a87a5f9f790808b13554d6568f6d0b/src/test/concept_session_rm/client/main.cc

The monitoring component provides a service [3] to receive a Thread
capability to pause the target component before detaching the dataspace
and resume after detaching and to provide a managed dataspace to the client.

[3]
https://github.com/702nADOS/genode-CheckpointRestore-SharedMemory/tree/b502ffd962a87a5f9f790808b13554d6568f6d0b/include/resource_session

The monitoring component runs a main loop which pauses the client's main
thread and detaches all attached dataspaces from the managed dataspace.
The target component also runs a main loop which prints (reads) a number
from the managed dataspace to the console and increments (writes) it in
the managed dataspaces.

The run script is found here [4].

[4]
https://github.com/702nADOS/genode-CheckpointRestore-SharedMemory/blob/b502ffd962a87a5f9f790808b13554d6568f6d0b/run/concept_session_rm.run

The scenario works for the first 3 iterations of the monitoring
component: Every 4 seconds it detaches the dataspaces from the managed
dataspace and afterwards resolves the page faults by attaching the
dataspaces back. After the 3. iteration, the target component accesses
the theoretically empty managed dataspaces, but does not trigger a page
fault. In fact, it reads and writes to the designated dataspaces as if
it was attached.

By running the run script I get the following output:
[init -> target] Initialization started
[init -> target] Requesting session to Resource service
[init -> ckpt] Initialization started
[init -> ckpt] Creating page fault handler thread
[init -> ckpt] Announcing Resource service
[init -> target] Sending main thread cap
[init -> target] Requesting dataspace cap
[init -> target] Attaching dataspace cap
[init -> target] Initialization ended
[init -> target] Starting main loop
Genode::Pager_entrypoint::entry()::<lambda(Genode::Pager_object*)>:Could
not resolve pf=6000 ip=10034bc
[init -> ckpt] Initialization ended
[init -> ckpt] Starting main loop
[init -> ckpt] Waiting for page faults
[init -> ckpt] Handling page fault: READ_FAULT pf_addr=0x00000000
[init -> ckpt] attached sub_ds0 at address 0x00000000
[init -> ckpt] Waiting for page faults
[init -> target] 0
[init -> target] 1
[init -> target] 2
[init -> target] 3
[init -> ckpt] Iteration #0
[init -> ckpt] valid thread
[init -> ckpt] detaching sub_ds_cap0
[init -> ckpt] sub_ds_cap1 already detached
Genode::Pager_entrypoint::entry()::<lambda(Genode::Pager_object*)>:Could
not resolve pf=6000 ip=10034bc
[init -> ckpt] Handling page fault: READ_FAULT pf_addr=0x00000000
[init -> ckpt] attached sub_ds0 at address 0x00000000
[init -> ckpt] Waiting for page faults
[init -> target] 4
[init -> target] 5
[init -> target] 6
[init -> target] 7
[init -> ckpt] Iteration #1
[init -> ckpt] valid thread
[init -> ckpt] detaching sub_ds_cap0
[init -> ckpt] sub_ds_cap1 already detached
[init -> target] 8
[init -> target] 9
[init -> target] 10
[init -> target] 11
[init -> ckpt] Iteration #2
[init -> ckpt] valid thread
[init -> ckpt] sub_ds_cap0 already detached
[init -> ckpt] sub_ds_cap1 already detached
[init -> target] 12
[init -> target] 13

As you can see: After "iteration #1" ended, no page fault was caused,
although the target component printed and incremented the integer stored
in the managed dataspace.

Could it be, that the detach method was not executed correctly?

Kind regards
Denis

------------------------------------------------------------------------------

Denis Huber

2016-09-24 16:20:03 UTC

Permalink

Dear Genode Community,

perhaps the wall of text is a bit discouraging to tackle the problem.
Let me summaries the important facts of the scenario:

* Two components 'ckpt' and 'target'
* ckpt shares a thread capability of target's main thread
* ckpt shares a managed dataspace with target
* this managed dataspace is initially empty

target's behaviour:
* target periodically reads and writes from/to the managed dataspace
* target causes page faults (pf) which are handled by ckpt's pf handler
thread
* pf handler attaches a pre-allocated dataspace to the managed
dataspace and resolves the pf

ckpt's behaviour:
* ckpt periodically detaches all attached dataspaces from the managed
dataspace

Outcome:
After two successful cycles (pf->attach->detach -> pf->attach->detach)
the target does not cause a pf, but reads and writes to the managed
dataspace although it is (theoretically) empty.

I used Genode 16.05 with a foc_pbxa9 build. Can somebody help me with my
problem? I actually have no idea what could be the problem.

Kind regards,
Denis

Post by Denis Huber
Dear Genode Community,
I want to implement a mechanism to monitor the access of a component to
its address space.
My idea is to implement a monitoring component which provides managed
dataspaces to a target component. Each managed dataspace has several
designated dataspaces (allocated, but not attached, and with a fixed
location in the managed dataspace). I want to use several dataspaces to
control the access range of the target component.
Whenever the target component accesses an address in the managed
dataspace, a page fault is triggered, because the managed dataspace has
no dataspaces attached to it. The page fault is caught by a custom page
fault handler. The page fault handler attaches the designated dataspace
into the faulting managed dataspace and resolves the page fault.
To test my concept I implemented a prototypical system with a monitoring
component (called "ckpt") [1] and a target component [2].
[1]
https://github.com/702nADOS/genode-CheckpointRestore-SharedMemory/blob/b502ffd962a87a5f9f790808b13554d6568f6d0b/src/test/concept_session_rm/server/main.cc
[2]
https://github.com/702nADOS/genode-CheckpointRestore-SharedMemory/blob/b502ffd962a87a5f9f790808b13554d6568f6d0b/src/test/concept_session_rm/client/main.cc
The monitoring component provides a service [3] to receive a Thread
capability to pause the target component before detaching the dataspace
and resume after detaching and to provide a managed dataspace to the client.
[3]
https://github.com/702nADOS/genode-CheckpointRestore-SharedMemory/tree/b502ffd962a87a5f9f790808b13554d6568f6d0b/include/resource_session
The monitoring component runs a main loop which pauses the client's main
thread and detaches all attached dataspaces from the managed dataspace.
The target component also runs a main loop which prints (reads) a number
from the managed dataspace to the console and increments (writes) it in
the managed dataspaces.
The run script is found here [4].
[4]
https://github.com/702nADOS/genode-CheckpointRestore-SharedMemory/blob/b502ffd962a87a5f9f790808b13554d6568f6d0b/run/concept_session_rm.run
The scenario works for the first 3 iterations of the monitoring
component: Every 4 seconds it detaches the dataspaces from the managed
dataspace and afterwards resolves the page faults by attaching the
dataspaces back. After the 3. iteration, the target component accesses
the theoretically empty managed dataspaces, but does not trigger a page
fault. In fact, it reads and writes to the designated dataspaces as if
it was attached.
[init -> target] Initialization started
[init -> target] Requesting session to Resource service
[init -> ckpt] Initialization started
[init -> ckpt] Creating page fault handler thread
[init -> ckpt] Announcing Resource service
[init -> target] Sending main thread cap
[init -> target] Requesting dataspace cap
[init -> target] Attaching dataspace cap
[init -> target] Initialization ended
[init -> target] Starting main loop
Genode::Pager_entrypoint::entry()::<lambda(Genode::Pager_object*)>:Could
not resolve pf=6000 ip=10034bc
[init -> ckpt] Initialization ended
[init -> ckpt] Starting main loop
[init -> ckpt] Waiting for page faults
[init -> ckpt] Handling page fault: READ_FAULT pf_addr=0x00000000
[init -> ckpt] attached sub_ds0 at address 0x00000000
[init -> ckpt] Waiting for page faults
[init -> target] 0
[init -> target] 1
[init -> target] 2
[init -> target] 3
[init -> ckpt] Iteration #0
[init -> ckpt] valid thread
[init -> ckpt] detaching sub_ds_cap0
[init -> ckpt] sub_ds_cap1 already detached
Genode::Pager_entrypoint::entry()::<lambda(Genode::Pager_object*)>:Could
not resolve pf=6000 ip=10034bc
[init -> ckpt] Handling page fault: READ_FAULT pf_addr=0x00000000
[init -> ckpt] attached sub_ds0 at address 0x00000000
[init -> ckpt] Waiting for page faults
[init -> target] 4
[init -> target] 5
[init -> target] 6
[init -> target] 7
[init -> ckpt] Iteration #1
[init -> ckpt] valid thread
[init -> ckpt] detaching sub_ds_cap0
[init -> ckpt] sub_ds_cap1 already detached
[init -> target] 8
[init -> target] 9
[init -> target] 10
[init -> target] 11
[init -> ckpt] Iteration #2
[init -> ckpt] valid thread
[init -> ckpt] sub_ds_cap0 already detached
[init -> ckpt] sub_ds_cap1 already detached
[init -> target] 12
[init -> target] 13
As you can see: After "iteration #1" ended, no page fault was caused,
although the target component printed and incremented the integer stored
in the managed dataspace.
Could it be, that the detach method was not executed correctly?
Kind regards
Denis
------------------------------------------------------------------------------
_______________________________________________
genode-main mailing list
https://lists.sourceforge.net/lists/listinfo/genode-main

------------------------------------------------------------------------------

Sebastian Sumpf

2016-09-26 08:44:59 UTC

Permalink

Hey Denis,

Post by Denis Huber
Dear Genode Community,
perhaps the wall of text is a bit discouraging to tackle the problem.
* Two components 'ckpt' and 'target'
* ckpt shares a thread capability of target's main thread
* ckpt shares a managed dataspace with target
* this managed dataspace is initially empty
* target periodically reads and writes from/to the managed dataspace
* target causes page faults (pf) which are handled by ckpt's pf handler
thread
* pf handler attaches a pre-allocated dataspace to the managed
dataspace and resolves the pf
* ckpt periodically detaches all attached dataspaces from the managed
dataspace
After two successful cycles (pf->attach->detach -> pf->attach->detach)
the target does not cause a pf, but reads and writes to the managed
dataspace although it is (theoretically) empty.
I used Genode 16.05 with a foc_pbxa9 build. Can somebody help me with my
problem? I actually have no idea what could be the problem.

You are programming against fairly untested grounds here. There still
might be bugs or corner cases in this line of code. So, someone might
have to look into things (while we are very busy right now). Your
problem is reproducible with [4] right?

By the way, your way of reporting is exceptional, the more information
and actual test code we have, the better we can debug problems. So,
please keep it this way, even though we might not read all of it at times ;)

Regards and if I find the time, I will look into your issue,

Sebastian

Post by Denis Huber

------------------------------------------------------------------------------
_______________________________________________
genode-main mailing list
https://lists.sourceforge.net/lists/listinfo/genode-main

------------------------------------------------------------------------------

Stefan Kalkowski

2016-09-26 09:13:54 UTC

Permalink

Hi Dennis,

I've looked into your code, and what struck me first was that you use
two threads in your server, which share data in between
(Resource::Client_resources) without synchronization.

I've rewritten your example server to only use one thread in a
state-machine like fashion, have a look here:

https://github.com/skalk/genode-CheckpointRestore-SharedMemory/commit/d9732dcab331cecdfd4fcc5c8948d9ca23d95e84

This way it is thread-safe, simpler (less code), and if you are adapted
to it, it becomes even easier to understand.

Nevertheless, although the possible synchronization problems are
eliminated by design, your described problem remains. I'll have a deeper
look into our attach/detach implementation of managed dataspaces, but I
cannot promise whether this will happen in short time.

Best regards
Stefan

Post by Sebastian Sumpf
Hey Denis,

You are programming against fairly untested grounds here. There still
might be bugs or corner cases in this line of code. So, someone might
have to look into things (while we are very busy right now). Your
problem is reproducible with [4] right?
By the way, your way of reporting is exceptional, the more information
and actual test code we have, the better we can debug problems. So,
please keep it this way, even though we might not read all of it at times ;)
Regards and if I find the time, I will look into your issue,
Sebastian

Post by Denis Huber

--
Stefan Kalkowski
Genode Labs

https://github.com/skalk · http://genode.org/

------------------------------------------------------------------------------

Stefan Kalkowski

2016-09-26 13:15:55 UTC

Permalink

Hi Dennis,

I further examined the issue. First, I found out that is is specific to
Fiasco.OC. If you use another kernel, e.g., Nova, with the same test, it
succeeds. So I instrumented the core component to always enter
Fiasco.OC's kernel debugger when core unmapped the corresponding managed
dataspace. When looking at the page-tables I could see that the mapping
was successfully deleted. After that I enabled all kind of loggings
related to page-faults and mapping operations. Lo and behold, after
continuing and seeing that the "target" thread continued, I re-entered
the kernel debugger and realized that the page-table entry reappeared
although the kernel did not list any activity regarding page-faults and
mappings. To me this is a clear kernel bug.

I've tried out my unofficial upgrade to revision r67 of the Fiasco.OC
kernel, and with that version it seemed to work correctly (I just tested
some rounds).

I fear the currently supported version of Fiasco.OC is buggy with
respect to the unmap call, at least the way Genode has to use it.

Regards
Stefan

Post by Stefan Kalkowski
Hi Dennis,
I've looked into your code, and what struck me first was that you use
two threads in your server, which share data in between
(Resource::Client_resources) without synchronization.
I've rewritten your example server to only use one thread in a
https://github.com/skalk/genode-CheckpointRestore-SharedMemory/commit/d9732dcab331cecdfd4fcc5c8948d9ca23d95e84
This way it is thread-safe, simpler (less code), and if you are adapted
to it, it becomes even easier to understand.
Nevertheless, although the possible synchronization problems are
eliminated by design, your described problem remains. I'll have a deeper
look into our attach/detach implementation of managed dataspaces, but I
cannot promise whether this will happen in short time.
Best regards
Stefan

Post by Sebastian Sumpf
Hey Denis,

You are programming against fairly untested grounds here. There still
might be bugs or corner cases in this line of code. So, someone might
have to look into things (while we are very busy right now). Your
problem is reproducible with [4] right?
By the way, your way of reporting is exceptional, the more information
and actual test code we have, the better we can debug problems. So,
please keep it this way, even though we might not read all of it at times ;)
Regards and if I find the time, I will look into your issue,
Sebastian

Post by Denis Huber

--
Stefan Kalkowski
Genode Labs

https://github.com/skalk · http://genode.org/

------------------------------------------------------------------------------

Denis Huber

2016-09-26 14:48:44 UTC

Permalink

Hello Stefan,

thank you for your help and finding the problem :)

Can you tell me, how I can obtain your unofficial upgrade of foc and how
I can replace Genode's standard version with it?

Kind regards,
Denis

Post by Stefan Kalkowski
Hi Dennis,
I further examined the issue. First, I found out that is is specific to
Fiasco.OC. If you use another kernel, e.g., Nova, with the same test, it
succeeds. So I instrumented the core component to always enter
Fiasco.OC's kernel debugger when core unmapped the corresponding managed
dataspace. When looking at the page-tables I could see that the mapping
was successfully deleted. After that I enabled all kind of loggings
related to page-faults and mapping operations. Lo and behold, after
continuing and seeing that the "target" thread continued, I re-entered
the kernel debugger and realized that the page-table entry reappeared
although the kernel did not list any activity regarding page-faults and
mappings. To me this is a clear kernel bug.
I've tried out my unofficial upgrade to revision r67 of the Fiasco.OC
kernel, and with that version it seemed to work correctly (I just tested
some rounds).
I fear the currently supported version of Fiasco.OC is buggy with
respect to the unmap call, at least the way Genode has to use it.
Regards
Stefan

Post by Sebastian Sumpf
Hey Denis,

You are programming against fairly untested grounds here. There still
might be bugs or corner cases in this line of code. So, someone might
have to look into things (while we are very busy right now). Your
problem is reproducible with [4] right?
By the way, your way of reporting is exceptional, the more information
and actual test code we have, the better we can debug problems. So,
please keep it this way, even though we might not read all of it at times ;)
Regards and if I find the time, I will look into your issue,
Sebastian

Post by Denis Huber

------------------------------------------------------------------------------

Stefan Kalkowski

2016-09-27 07:17:58 UTC

Permalink

Hi Dennis,

Post by Denis Huber
Hello Stefan,
thank you for your help and finding the problem :)
Can you tell me, how I can obtain your unofficial upgrade of foc and how
I can replace Genode's standard version with it?

But be warned: it is unofficial, because I started to upgrade but
stopped at some point due to timing constraints. That means certain
problems we already fixed in the older version might still exist in the
upgrade. Moreover, it is almost completely untested. Having said this,
you can find it in my repository, it is the branch called foc_update.
I've rebased it to the current master branch of Genode.

Regards
Stefan

Post by Denis Huber
Kind regards,
Denis

Post by Sebastian Sumpf
Hey Denis,

You are programming against fairly untested grounds here. There still
might be bugs or corner cases in this line of code. So, someone might
have to look into things (while we are very busy right now). Your
problem is reproducible with [4] right?
By the way, your way of reporting is exceptional, the more information
and actual test code we have, the better we can debug problems. So,
please keep it this way, even though we might not read all of it at times ;)
Regards and if I find the time, I will look into your issue,
Sebastian

Post by Denis Huber

--
Stefan Kalkowski
Genode Labs

https://github.com/skalk · http://genode.org/

------------------------------------------------------------------------------

Denis Huber

2016-09-27 08:48:39 UTC

Permalink

Again, thank you Stefan, you are a big help! :)

The test program run successfully. Is there a date, when Genode's foc
version will be upgraded?

Kind regards,
Denis

Post by Stefan Kalkowski
Hi Dennis,

Post by Denis Huber
Kind regards,
Denis

Post by Sebastian Sumpf
Hey Denis,

You are programming against fairly untested grounds here. There still
might be bugs or corner cases in this line of code. So, someone might
have to look into things (while we are very busy right now). Your
problem is reproducible with [4] right?
By the way, your way of reporting is exceptional, the more information
and actual test code we have, the better we can debug problems. So,
please keep it this way, even though we might not read all of it at times ;)
Regards and if I find the time, I will look into your issue,
Sebastian

Post by Denis Huber

------------------------------------------------------------------------------

Stefan Kalkowski

2016-09-27 09:51:46 UTC

Permalink

Hi Denis,

Post by Denis Huber
Again, thank you Stefan, you are a big help! :)

you are welcome.

Post by Denis Huber
The test program run successfully. Is there a date, when Genode's foc
version will be upgraded?

When you ask this question to the team at Genode Labs, the short answer
is no. As you can imagine, there is a lot of different work to do, to
make Genode a serious OS alternative with respect to existent,
established ones, apart from the kernel component.
For us at Genode Labs it is always a question of priorities with respect
to community efforts, paid projects and our personal agenda. With
respect to the kernel component and our own agenda: first and foremost,
we're using the Nova hypervisor currently to run Genode on our own
laptops. We try to develop further our self-written kernel library (ARM,
x86, RiscV) for core, because it best matches the Genode abstractions.
It is under our full control and completely understood by us. On the
other hand, we recently ported Genode to the sel4 kernel to attract the
community. And we did so with Fiasco.OC kernel in the past. If there is
a lot of usage "pressure" by the community we have to re-think about our
priorities.

Nevertheless, Genode is an open-source _community_ project. We always
try to motivate people to contribute to the project. I know that in the
past people from Universidad Central "Marta Abreu" de Las Villas in Cuba
used the unofficial update branch of Fiasco.OC too. But sadly I cannot
find their repository anymore. So I'm unsure whether they discontinued
their work, and how far they were gone.
I do not know whether there are still other people using
Fiasco.OC/Genode, or whether those possibly will contribute an updated
version soon.
Spoken from my perspective, it would be best when people can try to
contribute such improvements to the overall project if they use it
anyway. Otherwise, it is obvious that our small team cannot provide more
and more functionality and simultaneously hold all 3rd party projects
up-to-date.

To sum it up: I cannot provide a specific date to you, but would
encourage you and other people currently using the Fiasco.OC kernel to
try to update by yourself (using the unofficial work as ground work),
and at best to contribute the results to the mainline Genode. Of course,
we (at Genode Labs) would always try to help you in such a process.
Alternatively, you can use another kernel, or try to change our
priorities in favor of Fiasco.OC by financial help or persuading ;-).

Best regards
Stefan

Post by Denis Huber
Kind regards,
Denis

Post by Stefan Kalkowski
Hi Dennis,

Post by Denis Huber
Kind regards,
Denis

Post by Sebastian Sumpf
Hey Denis,

You are programming against fairly untested grounds here. There still
might be bugs or corner cases in this line of code. So, someone might
have to look into things (while we are very busy right now). Your
problem is reproducible with [4] right?
By the way, your way of reporting is exceptional, the more information
and actual test code we have, the better we can debug problems. So,
please keep it this way, even though we might not read all of it at times ;)
Regards and if I find the time, I will look into your issue,
Sebastian

Post by Denis Huber

--
Stefan Kalkowski
Genode Labs

https://github.com/skalk · http://genode.org/

------------------------------------------------------------------------------