Denis Huber
2016-12-03 10:51:38 UTC
Dear Genode community,
thanks to you [1], I could implement my Checkpoint/Restore mechanism on
Genode/Fiasco.OC. I also added the incremental checkpoint optimization,
to stored only changed memory regions compared to the last checkpoint
(although this is not working reliably due to a Fiasco.OC bug, which
Stefan Kalkowski found for me [2]). I also managed to checkpoint the
capability map and restore it with new badges and insert missing
capabilities into the capability space of Fiasco.OC.
[1] https://sourceforge.net/p/genode/mailman/message/35322604/
[2] https://sourceforge.net/p/genode/mailman/message/35377269/
My problem is, although I restore all RPC objects, especially the
instruction and stack pointer of the main thread, and the capability map
and space, the target component just starts its execution from the
beginning of its Component::construct function.
My approach:
For the restore phase, I use Genode's native bootstrap mechanism (i.e. I
create a Genode::Child object) until it requests a LOG session from my
Checkpoint/Restore component. I force a LOG session request in
::Constructor_component::construct() just before
"Genode::call_component_construct(env);" in
https://github.com/genodelabs/genode/blob/16.08/repos/base/src/lib/base/entrypoint.cc#L154
Until the session request several RAM dataspaces are created, among
other RPC objects, and attached to the address space. In my restore
mechanism I identify the RPC objects, which were created by the
bootstrap/startup mechanism, and only restore their state. After that
point, I recreate and restore the state of all other RPC objects which
are known by the child component. At last, I restore the capability map
and space.
During that process the mandatory CPU threads are identified (three of
them: "ep", "signal_handler", and "childs_rom_name") and restored to
their checkpointed state, especially the ip and sp registers. I did that
through the use of Cpu_thread::state(Thread_state), but without luck.
Also, although I know that the CPU threads were already started, I tried
to call Cpu_thread::start(ip, sp), but without success.
After the restoration which happens entirely during the LOG session
request of the child, my component returns with a valid session object
to the child. Now the child should continue the work from the point
where it was checkpointed, but it continues its execution right after
the LOG session request, ignoring the setting of the instruction pointer.
The source code of the restore CPU thread state is found in [3]. I used
run script [4] for the tests.
[3]
https://github.com/702nADOS/genode-CheckpointRestore-SharedMemory/blob/660a865084a4fe8524a0ccacc4bfb97f728482c9/src/rtcr/restorer.cc#L791
[4]
https://github.com/702nADOS/genode-CheckpointRestore-SharedMemory/blob/660a865084a4fe8524a0ccacc4bfb97f728482c9/run/rtcr_restore_child.run
Curiously, the child runs just as nothing happened, although its stack
area was also manipulated.
Perhaps my approach by reusing the bootstrap/startup mechanism is not
destined to work, or maybe I have missed some important points in this
mechanism. If so, please point me to the problem.
I would also consider other restoration approaches, for example, by
recreating all RPC objects manually and insert them into the capability
map/space.
What are your thoughts on my approach? Can it work? Does something else
work better?
Kind regards,
Denis
thanks to you [1], I could implement my Checkpoint/Restore mechanism on
Genode/Fiasco.OC. I also added the incremental checkpoint optimization,
to stored only changed memory regions compared to the last checkpoint
(although this is not working reliably due to a Fiasco.OC bug, which
Stefan Kalkowski found for me [2]). I also managed to checkpoint the
capability map and restore it with new badges and insert missing
capabilities into the capability space of Fiasco.OC.
[1] https://sourceforge.net/p/genode/mailman/message/35322604/
[2] https://sourceforge.net/p/genode/mailman/message/35377269/
My problem is, although I restore all RPC objects, especially the
instruction and stack pointer of the main thread, and the capability map
and space, the target component just starts its execution from the
beginning of its Component::construct function.
My approach:
For the restore phase, I use Genode's native bootstrap mechanism (i.e. I
create a Genode::Child object) until it requests a LOG session from my
Checkpoint/Restore component. I force a LOG session request in
::Constructor_component::construct() just before
"Genode::call_component_construct(env);" in
https://github.com/genodelabs/genode/blob/16.08/repos/base/src/lib/base/entrypoint.cc#L154
Until the session request several RAM dataspaces are created, among
other RPC objects, and attached to the address space. In my restore
mechanism I identify the RPC objects, which were created by the
bootstrap/startup mechanism, and only restore their state. After that
point, I recreate and restore the state of all other RPC objects which
are known by the child component. At last, I restore the capability map
and space.
During that process the mandatory CPU threads are identified (three of
them: "ep", "signal_handler", and "childs_rom_name") and restored to
their checkpointed state, especially the ip and sp registers. I did that
through the use of Cpu_thread::state(Thread_state), but without luck.
Also, although I know that the CPU threads were already started, I tried
to call Cpu_thread::start(ip, sp), but without success.
After the restoration which happens entirely during the LOG session
request of the child, my component returns with a valid session object
to the child. Now the child should continue the work from the point
where it was checkpointed, but it continues its execution right after
the LOG session request, ignoring the setting of the instruction pointer.
The source code of the restore CPU thread state is found in [3]. I used
run script [4] for the tests.
[3]
https://github.com/702nADOS/genode-CheckpointRestore-SharedMemory/blob/660a865084a4fe8524a0ccacc4bfb97f728482c9/src/rtcr/restorer.cc#L791
[4]
https://github.com/702nADOS/genode-CheckpointRestore-SharedMemory/blob/660a865084a4fe8524a0ccacc4bfb97f728482c9/run/rtcr_restore_child.run
Curiously, the child runs just as nothing happened, although its stack
area was also manipulated.
Perhaps my approach by reusing the bootstrap/startup mechanism is not
destined to work, or maybe I have missed some important points in this
mechanism. If so, please point me to the problem.
I would also consider other restoration approaches, for example, by
recreating all RPC objects manually and insert them into the capability
map/space.
What are your thoughts on my approach? Can it work? Does something else
work better?
Kind regards,
Denis