Commits · d307dd2ecb9c83e21a764a7d2e312b1fb7ed1921 · E-EXK4 - Operating System Group / projects / fail

Jan 27, 2014

dciao-kernelstructs: reuse sobres experiment for ISORC2014 · d307dd2e

Christian Dietrich authored 11 years ago

Differences:

- the task activation order is determined in the faulty experiment as
  well as in the golden run (which is now done by
  fail-generic-tracing) by observing a variable fail_virtual_port.
- There is a panic value read from the fail_virtual_port
- The golden run task activation is determined by giving an extended
  trace to task_activation.py. The script collects all writes to
  fail_virtual_port, and determines the activation from this.

Change-Id: Id401b78933b45a4b2cf031fc0a8b5ac90151ec24

d307dd2e

Jan 24, 2014
- util/WallclockTimer: bugfix: include ostream · c48c7296
  Horst Schirmeier authored 11 years ago
  
  This only compiled everywhere because all users included (i)ostream. Change-Id: I29b0fb13a01606fdffd8ebdb9701eff652065916
  c48c7296
- Merge branch 'ubuntu-saucy-fixes' · 85e39112
  Horst Schirmeier authored 11 years ago
  
  85e39112
Jan 23, 2014

cpn: needs comm and MySQL at link time · 17e76c14

Horst Schirmeier authored 11 years ago

The dependency on fail-comm exists not only at compile time (the
latter is due to protobuf header generation).

Change-Id: I2bae51e763d9a385bda94e77df3e88619fa28a30

17e76c14

Jan 22, 2014

formatting, typos, comments, details · 4cb97a7f
Horst Schirmeier authored 11 years ago
```
Change-Id: Iae5f1acb653a694622e9ac2bad93efcfca588f3a
```
4cb97a7f
Merge branch 'jobclientserver-fixes' · 7591c9ed
Horst Schirmeier authored 11 years ago

7591c9ed
Merge "prune-trace: use the first write pilot instead of any" · e37f2db4
Michael Lenz authored 11 years ago

e37f2db4

prune-trace: use the first write pilot instead of any · 4ccddeb1

Michael Lenz authored 11 years ago

In some cases the write-pilot is located at the upper boundary of the
experiment and thus is in a race situation with the experiment's end.
If the experiment's end occurs first, the campaign ends and complains
about missing data, otherwise everything is fine.
This patch circumvents this via using "the first" writing pilot; iff the
only write is located at the experiment's end, the race will still occur,
but cleverly written experiment code can, according to hsc, circumvent it.

Change-Id: I6a27a8c4770c04ea8dcaef8aa7bd85d18f43f0b5

4ccddeb1

Jan 21, 2014

gem5: TrapListener implemented · 5fbf13d0

Richard Hellwig authored 11 years ago

The TrapListener works like in Bochs.
Instead of a number to a trap the offset is returned for GEM5.
See:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0211h/Babfeega.html

Conflicts:
	simulators/gem5/src/cpu/simple/atomic.cc

Change-Id: Ia8b2083e3c16315d9c577150f14f16995494b2e6

5fbf13d0

Merge "core/sal: Added features that indicate whether FAIL* is initialized" · fa1690bd
Richard Hellwig authored 11 years ago

fa1690bd

util: boost::thread 1.53 depends on boost::system · 81341498

Horst Schirmeier authored 11 years ago

Unfortunately this implicit dependency is currently not resolved anywhere
else (e.g., FindBoost.cmake), although the 'net heavily discusses this
issue.

Change-Id: I8a7c8518394cdba27e591fed250623011d988067

81341498

cpn: use strtoul for conversion of unsigned ints · 4e21b423

Lars Rademacher authored 11 years ago

As 32-bit libc6 atoi() caps the value of unsigned ints bigger than
2^31-1 (instead of just letting it overflow to the corresponding
negative value, as on x86_64), it must not be used especially for the
conversion of 32-bit pointers.

Change-Id: Ie0821a6f4cd04aebd37ea3d4028b63a05373810f

4e21b423

use uint32 for addresses in protobuf msgs · 122eb8c9

Horst Schirmeier authored 11 years ago

This prevents integer overflows when using addresses > 2GiB, which are
common for x86 operating systems with paging (Linux, Fiasco.OC) or
some test cases on the PandaBoard.

Note that this results in slightly different result table definitions
when automatically translating an experiment's protobuf message in the
DatabaseCampaign.

This change affects all existing protobuf messages to prevent
copy/paste propagation of this issue.

Change-Id: I09ec4b9d45eddd67a7a24c8b101e8b2b258df5e2

122eb8c9

Jan 20, 2014

jobclient: use initializer list · de39bf61
Horst Schirmeier authored 11 years ago
```
Change-Id: I7eb42f947bbabd61e1aad9224cedd7ffceec4f10
```
de39bf61

jobclient: initial number of jobs configurable · 5ffcb821

Horst Schirmeier authored 11 years ago

The new CLIENT_JOB_INITIAL configuration option allows to configure
the client to request more than one job in the first request round.
If a reasonable initial value is chosen, this removes the job ramp-up
after each fail-client restart, and slightly improves overall
throughput.

Change-Id: Idac2721264ec264c520d341fac64a8311a974708

5ffcb821

jobclient: expect communication failures · 2c31bf79

Horst Schirmeier authored 11 years ago

This change makes the JobClient act properly on communication aborts.

Change-Id: I0a76489f117e9721546215e3b627002605e25452

2c31bf79

jobclient: bugfix: faster shutdown at campaign end · 882d4f38

Horst Schirmeier authored 11 years ago

The JobClient currently waits a LONG time until it really shuts down
after not having reached the server in sendResultsToServer() (which is
unfortunately the by far most probable point in the code to determine
this):

 -  A different bug (fixed in the previous commit) provoked the
    situation that a (way) too large amount of jobs was fetched
    before.
 -  sendResult() (called after each experiment iteration) realized
    that CLIENT_JOB_REQUEST_SEC seconds are over, and tried to
    prematurely call home to send first results (without planning to
    get new jobs yet).
 -  If the server was gone (done, or aborted), connect in
    sendResultsToServer() failed after several retries and timeouts.
 -  All subsequent calls to sendResult() retried connecting to the
    server (again, with retries and timeouts), once for each remaining
    job.
 -  When all jobs were done, getParam() tries to connect a last time,
    finally telling the experiment that nobody's home.

This resulted in client shutdown times of up to four hours (for the
default CLIENT_JOB_LIMIT of 1000) after the campaign server
terminated.  This change solves the issue by not handing out new
(cached) jobs after the connect failed once, making the experiment
terminate quickly.

Change-Id: I0d8cb2e084d783aca74c51a503fa72eb2b2eb0b7

882d4f38

jobclient: bugfix: initialize timing statistics · ee7bc23d

Horst Schirmeier authored 11 years ago

If we don't properly initialize the job timing statistics, the number
of jobs to be requested in the second request to the server is based
on the wrong timings.  In our test case, CLIENT_JOB_LIMIT jobs were
requested at once.

Change-Id: I7e9d8ab6fe14e4488b3a74baf061d9a07f3a77c4

ee7bc23d

jobserver: bugfix: potential race · 1f6e275e

Horst Schirmeier authored 11 years ago

Delay insertion of to-be-sent jobs into m_runningJobs until they are
really sent, as getMessage() won't work anymore (as in: segfault) if
this job is concurrently re-sent (due to campaign end), its result is
received, and deleted in the campaign. This becomes non-hypothetical
with larger values for CLIENT_JOB_LIMIT and CLIENT_JOB_REQUEST_SEC.

Additionally, reinsert the remaining jobs into the input queue if
communication fails, instead of inefficiently delaying redistribution
until the campaign end.

Change-Id: If85e3c8261deda86beb8d4d93343429223753f22

1f6e275e

jobserver: outgoing jobqueue bounded by default · 128b54b0

Horst Schirmeier authored 11 years ago

Bounding the outgoing queue is always a good idea: If the campaign has
separate threads for outgoing and incoming jobs (true for the
DatabaseCampaign), this keeps memory requirements reasonable. If the
campaign works in a single thread, this is not disadvantageous either.

Change-Id: Ic75272daa8266f051adf7b23e2ffe87f5c965b86

128b54b0

jobserver: use non-blocking accept · 73adc714

Horst Schirmeier authored 11 years ago

To allow the JobServer to shutdown properly, the accept() loop in
JobServer::run() needs to regularly check whether we're done.  This
change introduces a timed, non-blocking variant of accept() into
SocketComm to achieve this.

Change-Id: Id411096be816c4ed6c7b0b37674410e22152eb22

73adc714

jobserver: join remaining threads on shutdown · 86716690

Horst Schirmeier authored 11 years ago

To avoid accessing destroyed resources in CommThreads talking to clients,
we need to properly join them on shutdown.  The m_CommMutex becomes a
JobServer member to make sure it isn't destroyed before the JobServer
itself.

Change-Id: I35b9fb93ace08a7a9476650f8f5e93597a3a8aa0

86716690

jobserver: synchronization cleanup · 8505ddbb

Horst Schirmeier authored 11 years ago

This change cleans up in/out queue synchronization in the job server.
End-of-jobs conditions are now properly signaled through the
SynchronizedQueue, allowing to resume and abort blocked readers when
no more input is expected.

Change-Id: I3eaf37115ccf8c5b5afe3d971c7109cd62b68906

8505ddbb

import-trace: emit warning for malformed traces · 84edd02b

Horst Schirmeier authored 11 years ago

The Fail* tools expect trace events to be ordered in a specific way:
memory-access events are supposed to come *after* the instruction
event for the instruction that caused them. Using a different order
may cause subtle problems with both fault-space pruning and fast
forwarding. This change introduces a warning message when such a
malformed trace is detected (i.e., when the instruction pointer of a
memory-access event does not match the preceding instruction event).

Change-Id: I8ae7420fd8ff26e2574590748bdcc5a63db76490

84edd02b

Merge branch 'mysql-concurrency-fixes' · 5ac108ea
Horst Schirmeier authored 11 years ago

5ac108ea

use libmysqlclient_r to ensure thread safety · 84aac60a

Horst Schirmeier authored 11 years ago

According to
<http://dev.mysql.com/doc/refman/5.5/en/c-api-threaded-clients.html>,
(potentially) threaded clients should use the reentrant
libmysqlclient_r.  This is just a precaution, I haven't seen any
issues with the normal libmysqlclient.

Change-Id: Icb29df6dd54eb666e3b43b73fbda406acccd11cb

84aac60a

DatabaseCampaign: run statistics update when finished · 8f9ee3fd
Horst Schirmeier authored 11 years ago
```
Change-Id: Ib68e54ba82e988db0d2d74ffafa6dc9bd54cd272
```
8f9ee3fd

DatabaseCampaign: MySQL / concurrency fixes · 33b63651

Horst Schirmeier authored 11 years ago

According to
<http://dev.mysql.com/doc/refman/5.5/en/c-api-threaded-clients.html>,
a MySQL connection handle must not be used concurrently with an open
result set and mysql_use_result() in one thread
(DatabaseCampaign::run()), and mysql_query() in another
(DatabaseCampaign::collect_result_thread()).  This indeed leads to
crashes when bounding the outgoing job queue (SERVER_OUT_QUEUE_SIZE),
and maybe even more insidous effects in other cases.  The solution is
to create separate connections for both threads.

Additionally, call mysql_library_init() before spawning any threads.

Change-Id: I2981f2fdc67c9a2cbe8781f1a21654418f621aeb

33b63651

Jan 15, 2014

Merge branch 'use_size_prefix-REMOVED' · 0534b503
Michael Lenz authored 11 years ago

0534b503

fail/cpn: (Database)Campaign no longer loses jobs · 9c984b97

Michael Lenz authored 11 years ago

Up until now the JobServer was silently losing jobs and only claiming to be
finished - a workaround for this was to restart the campaign until all jobs
were finished according to the database and the campaign's output.
This change fixes the underlying problem, so a single campaign-run suffices
and does no longer lose any jobs.
Debugging this was awful and took us quite some time...

Change-Id: Ie6c982cc3b2ce11128941f1f13be563bae22565c

9c984b97

fail/cpn: removed USE_SIZE_PREFIX from SocketComm · abd9decf

Michael Lenz authored 11 years ago

This removes the ability to directly parse protobufs from the socket, because
google::protobuf::Message::ParseFromFileDescriptor() needs a EOF after each message;
thus preventing us from sending multiple Message objects over a single socket.

Change-Id: I67c0f631071470d6e0ae597e42848036a6db3656

abd9decf

ecos_kernel_test experiment bugix: don't resume if 'experiment reached finish() before FI' · 0a5e54e9
Christoph Borchert authored 11 years ago
```
Change-Id: Id0bb9400b8aa28307ed385a8c32b91b17254ba1c
```
0a5e54e9

Jan 14, 2014
- Merge "gem5: don't count instruction fetch as mem access" · c0fe64ec
  Richard Hellwig authored 11 years ago
  
  c0fe64ec
- core/sal: Added features that indicate whether FAIL* is initialized · 3c7861ff
  Richard Hellwig authored 11 years ago
  
  GEM5 throws a reset trap during initialization. This happens before the startup function is called. This leads to problems because the startup function fills the m_CPUs list. m_CPUs is needed for the TrapListener. Therefore, we only react on traps after initialization. This is needed in the following commit (see gem5/src/arch/arm/faults.cc). Change-Id: I9ec6fd453705feb54b4f8a87d024181323a2d7ef
  3c7861ff
- Merge "sal/gem5: getTimerTicks(), getTimerTicksPerSecond() implemented" · efbb6c68
  Richard Hellwig authored 11 years ago
  
  efbb6c68
- sal/gem5: getTimerTicks(), getTimerTicksPerSecond() implemented · f3593648
  Richard Hellwig authored 11 years ago
  
  Change-Id: I01fdb5e4bdd61fc761e93ef77904c830131c9ed6
  f3593648
Jan 06, 2014
- gem5: don't count instruction fetch as mem access · f41247b1
  Richard Hellwig authored 11 years ago
  
  Change-Id: I6ea9811c132ef7c235d5a03486ca08afc842b51f
  f41247b1
Jan 03, 2014

weather-monitor: command line parameter are forwarded now · 34065fea

Richard Hellwig authored 11 years ago

Parameters that are specified on the command line are now also forwarded.

Change-Id: I0e636f14dba43ef7877ce6e6deca1abb1f00a8a6

34065fea

Dec 11, 2013

weather-monitor: now is a DatabaseCampaign · 0907dfb0

Michael Lenz authored 11 years ago

"removed" unneccessary memory-mapping ("Step 0")
cleaned out ExperimentData - now consists only of fsppilot and resultset
resultset now contains bitoffset which is part of result-table's primary key
adapted code to work with msg.fsppilot() instead of ExperimentData-values

Change-Id: I3b310e7a71d4b28479028250cd5722b3b2ce9f8c

0907dfb0

Dec 06, 2013
- Merge "Coding Guideline: Fixes." · 83991359
  Martin Hoffmann authored 11 years ago
  
  83991359