Newer
Older

Adrian Böckenkamp
committed
=========================================================================================
Steps to run a boot image in Fail* using the Bochs simulator backend:
=========================================================================================
Follow the Bochs documentation, and start your own "bochsrc" configuration file
based on the "${PREFIX}/share/doc/bochs/bochsrc-sample.txt" template (or
"/usr/share/doc/bochs/examples/bochsrc.gz" on Debian systems with Bochs installed).
1. Add your floppy/cdrom/hdd image in the floppya/ata0-master/ata0-slave
sections; configure the boot: section appropriately.
2. Comment out com1 and parport1.
3. The following Bochs configuration settings (managed in the "bochsrc" file) might
be helpful, depending on your needs:
- For "headless" experiments:
config_interface: textconfig
display_library: nogui
- For an X11 GUI:
config_interface: textconfig
display_library: x
- For a wxWidgets GUI (does not play well with Fail*'s "restore" feature):
config_interface: wx
display_library: wx
- Reduce the guest system's RAM to a minimum to reduce Fail*'s memory footprint
and save/restore overhead, e.g.:
memory: guest=16, host=16
- If you want to redirect FailBochs's output to a file using the shell's
redirection operator '>', make sure "/dev/stdout" is not used as a target
file for logging. (The Debian "bochsrc" template unfortunately does this
in two places. It suffices to comment out these entries.)
- To make Fail* terminate if something unexpected happens in a larger
campaign, be sure it doesn't "ask" in these cases, e.g.:
panic: action=fatal
error: action=fatal
info: action=ignore
debug: action=ignore
pass: action=ignore
- If you need a quick-and-dirty way to pass data from the guest system to the
outside world, and you don't want to write an experiment utilizing
GuestEvents, you can use the "port e9 hack" that prints all outbs to port

Adrian Böckenkamp
committed
0xe9 to the console:
port_e9_hack: enabled=1
- Determinism: (Fail)Bochs is deterministic regarding timer interrupts,
i.e., two experiment runs after calling simulator.restore() will count
the same number of instructions between two interrupts. Though, you
need to be careful when running (Fail)Bochs with a GUI enabled: Typing
fail-client -q<return>

Adrian Böckenkamp
committed
on the command line may lead to the GUI window receiving a "return key
released" event, resulting in a keyboard interrupt for the guest system.
This can be avoided by starting Bochs with "sleep 1; fail-client -q", by
suppressing keyboard input (CONFIG_DISABLE_KEYB_INTERRUPTS setting in
the CMake configuration), or disabling the GUI (see "headless
experiments" above).

Adrian Böckenkamp
committed
=========================================================================================
Example experiments and code snippets
=========================================================================================
Experiment "hsc-simple":
**********************************************************************
A simple standalone experiment (without a separate campaign). To compile this
experiment, the following steps are required:
1. Add "hsc-simple" to ccmake's EXPERIMENTS_ACTIVATED.
2. Enable CONFIG_EVENT_BREAKPOINTS, CONFIG_SR_RESTORE and CONFIG_SR_SAVE.
3. Build Fail* and Bochs, see "how-to-build.txt" for details.

Adrian Böckenkamp
committed
4. Enter experiment_targets/hscsimple/, bunzip2 -k *.bz2
5. Start the Bochs simulator by typing

Adrian Böckenkamp
committed
After successfully booting the eCos/hello world example, the console shows
"[HSC] breakpoint reached, saving", and a hello.state/ subdirectory appears.
You probably need to adjust the bochsrc's paths to romimage/vgaromimage.
These by default point to the locations installed by the Debian packages
"bochsbios" and "vgabios"; for example, you alternatively may use the
BIOSes supplied in "${FAIL_DIR}/simulators/bochs/bios/".
6. Compile the experiment's second step: edit
fail/src/experiments/hsc-simple/experiment.cc, and change the first "#if 1"
into "#if 0". Make an incremental build, e.g., by running
"${FAIL_DIR}/scripts/rebuild-bochs.sh -" from your ${BUILD_DIR}.
7. Back to ../experiment_targets/hscsimple/ (assuming, your are in ${FAIL_DIR}),
again run
$ fail-client -q

Adrian Böckenkamp
committed
After restoring the state, the hello world program's calculation should
yield a different result.
Experiment "coolchecksum":
**********************************************************************
An example for separate campaign/experiment implementations. To compile this
experiment, the following steps are required:
1. Run step #1 (and if you're curious how COOL_ECC_NUMINSTR in
experimentInfo.hpp was figured out, then step #2) of the experiment
(analogous to what needed to be done in case of the "hsc-simple" experiment,
see above). The experiment's target guest system can be found under
../experiment_targets/coolchecksum/.
(If you want to enable COOL_FAULTSPACE_PRUNING, step #2 is mandatory because
it generates the instruction/memory access trace needed for pruning.)
2. Build the campaign server (if it wasn't already built automatically):
$ make coolchecksum-server

Adrian Böckenkamp
committed
3. Run the campaign server: bin/coolchecksum-server
4. In another terminal, run step #3 of the experiment ("fail-client -q").

Adrian Böckenkamp
committed
Step #3 of the experiment currently runs 2000 experiment iterations and then
terminates, because Bochs has some memory leak issues. You need to re-run
fail-client for the next 2k experiments.

Adrian Böckenkamp
committed
The experiments can be significantly sped up by
a) parallelization (run more FailBochs clients and
b) a headless (and more optimized) Fail* configuration (see above).
Experiment "MHTestCampaign":
**********************************************************************
An example for separate campaign/experiment implementations.
1. Execute campaign (job server): ${BUILD_DIR}/bin/MHTestCampaign-server

Adrian Böckenkamp
committed
2. Run the FailBochs instance, in properly defined environment:

Adrian Böckenkamp
committed
=========================================================================================
Parallelization
=========================================================================================
Fail* is designed to allow parallelization of experiment execution allowing to reduce
the time needed to execute the experiments on a (larger) set of experiment data (aka
input parameters for the experiment execution, e.g. instruction pointer, registers, bit
numbers, ...). We call such "experiment data" the parameter sets. The so called "campaign"
is responsible for managing the parameter sets (i.e., the data to be used by the experiment
flows), inquired by the clients. As a consequence, the campaign is running on the server-
side and the experiment flows are running on the (distributed) clients.

Adrian Böckenkamp
committed
First of all, the Fail* instances (and other required files, e.g. saved state) are
distributed to the clients. In the second step the campaign(-server) is started, preparing
its parameter sets in order to be able to answer the requests from the clients. (Once
there are available parameter sets, the clients can request them.) In the final step,

Adrian Böckenkamp
committed
the distributed Fail* clients have to be started. As soon as this setup is finished,
the clients request new parameter sets, execute their experiment code and return their
results to the server (aka campaign) in an iterative way, until all paremeter sets have
been processed successfully. If all (new) parameter sets have been distributed, the
campaign starts to re-send unfinished parameter sets to requesting clients in order to

Adrian Böckenkamp
committed
speed up the overall campaign execution. Additionally, this ensures that all parameter
sets will produce a corresponding result set. (If, for example, a client terminates
abnormally, no result is sent back. This scenario is dealt with by this mechanism, too.)

Adrian Böckenkamp
committed
Shell scripts supporting experiment distribution:
**********************************************************************
These can be found in ${FAIL_DIR}/scripts/ (for now have a look at the script files
themselves, they contain some documentation):
- fail-env.sh: Environment variables for distribution/parallelization host
lists etc.; don't modify in-place but edit your own copy!
- distribute-experiment.sh: Distribute necessary FailBochs ingredients to
experiment hosts.
- runcampaign.sh: Locally run a campaign server, and a large amount of
clients on the experiment hosts.
- multiple-clients.sh: Is run on an experiment host by runcampaign.sh,
starts several instances of client.sh in a tmux session.
- client.sh: (Repeatedly) Runs a single fail-client instance.

Adrian Böckenkamp
committed
Some useful things to note:
**********************************************************************
- Using the distribute-experiment.sh script causes the local fail-client binary to

Adrian Böckenkamp
committed
be copied to the hosts. If the binary is not present in the current directory
the default fail-client binary (-> $ which fail-client) will be used. If you
have modified some of your experiment code (i.e., your fail-client binary will
change), don't forget to delete the local fail-client binary in order to
distribute the *new* binary.

Adrian Böckenkamp
committed
- The runcampaign.sh script prints some status information about the clients
recently started. In addition, there will be a few error messages concerning
ssh, tmux and so on. They can be ignored for now.
- The runcampaign.sh script starts the coolchecksum-server. Note that the server
instance will terminate immediately (without notice), if there is still an

Adrian Böckenkamp
committed
existing coolcampaign.csv file.
- In order to make the performance gains (mentioned above) take effect, a "workload
balancing" between the server and the clients is mandatory. This means that
the communication overhead (client <-> server) and the time needed to execute

Adrian Böckenkamp
committed
the experiment code on the client-side should be in due proportion. More
specifically, for each experiment there will be exactly 2 TCP connections
(send parameter set to client, send result to server) established. Therefore
you should ensure that the jobs you distribute take enough time not to
overflow the server with requests. You may need to bundle parameters for
more than one experiment if a single experiment only takes a few hundred
milliseconds. (See existing experiments for examples.)
=========================================================================================
Steps to run an experiment with gem5:
=========================================================================================
1. Create a directory which will be used as gem5 system directory (which
will contain the guest system and boot image). Further called $SYSTEM.
2. Create two directories $SYSTEM/binaries and $SYSTEM/disks.
3. Put guestsystem kernel to $SYSTEM/binaries and boot image to $SYSTEM/disks.
For ARM targets, you can use the "linux-arm-ael.img" image contained in
http://www.gem5.org/dist/current/arm/arm-system-2011-08.tar.bz2
As an example, the resulting directory structure might look like this
boecke@kos:~/$FAIL_DIR/build/gem5sys$ find
./binaries/abo-simple-arm.elf # your experiment binary (!= gem5)
./disks/linux-arm-ael.img # the ARM image (FIXME: whats this exactly?)
./disks/boot.arm # the ARM bootloader (FIXME: dito)
4. Run gem5 in $FAIL_DIR/simulators/gem5/ with:
$ M5_PATH=$SYSTEM build/ARM/gem5.debug configs/example/fs.py --bare-metal --kernel kernelname