Dove: 2010

Tuesday, February 16, 2010

To Do - 2/16

Misbehaving servers:

Extend the worker to run in 'test' mode, which tests a server for viability. If it fails to accomplish all of the pieces, reject it.
Profile the code using gprof on one of the misbehaving servers to find out why it is taking so long to do its processing.
One server is giving all 0 valued latencies, as well as bad absolute time. Find out why.

Make the analyzer not consider peers outside of the aoi.
√Make a new analyzer mode that outputs data suitable for use with R.

Figure out how to use R in batch-mode, to automatically generate pngs of the data.
Need to hook up shell script.

Monday, February 15, 2010

http://spreadsheets.google.com/ccc?key=0AjbHDjXctNgDdHlUaGpob2ZkT2RkZXV1clN2QW5LVXc&hl=en

This is telling me that something rather strange is going on with my network. Why the spikes? If they simply went up for all eternity, I would suspect some kind of bug or disconnection from the network. But they don't. Rather, some steadily increase up to 2000 ms, and then resume normal 100ms behavior. What's up with that?

To Do - 2/15

Back from my mini-vacation.

Misbehaving servers:

Extend the worker to run in 'test' mode, which tests a server for viability. If it fails to accomplish all of the pieces, reject it.
Profile the code using gprof on one of the misbehaving servers to find out why it is taking so long to do its processing.
One server is giving all 0 valued latencies, as well as bad absolute time. Find out why.

Make the analyzer not consider peers outside of the aoi.
Make a new analyzer mode that outputs data suitable for use with gnuplot & write a shell script that calls the analyzer tool & then gnuplot. google chart api. R.

√Got gnuplot installed & working
Figure out how to use R in batch-mode, to automatically generate pngs of the data.
Need to hook up shell script.

Tuesday, February 9, 2010

To Do - 2/9

Misbehaving servers:

Extend the worker to run in 'test' mode, which tests a server for viability. If it fails to accomplish all of the pieces, reject it.
Profile the code using gprof on one of the misbehaving servers to find out why it is taking so long to do its processing.

Make the analyzer not consider peers outside of the aoi.
√Drop peers that I haven't heard from in over 5 seconds. I do this in the analysis, not the actual network. Maybe I should?
Make a new analyzer mode that outputs data suitable for use with gnuplot & write a shell script that calls the analyzer tool & then gnuplot.

√Got gnuplot installed & working
√Added -plot mode to analyzer.
Need to hook up shell script.

√Combine the recommendation messages with the position update message.

Found some time-related bugs. Fixed them.
Found some over-zealous asserts. Fixed them.
Data still isn't nearly as pretty as I'd like.

Monday, February 8, 2010

To Do - 2/8

remove the consistently misbehaving servers

Extend the worker to run in 'test' mode, which tests a server for viability. If it fails to accomplish all of the pieces, reject it.

√Now that I have a larger data set, I'm seeing strange sections of nothing in the output from the analyzer, implying the lastSeen time is invalid. Fix this. (see http://spreadsheets.google.com/ccc?key=0AjbHDjXctNgDdDNpeEpzaVNzVGoxeUNPSTRjd1c3aHc&hl=en for before. See http://spreadsheets.google.com/ccc?key=0AjbHDjXctNgDdGhEczhqemR5cFVxSmRTcms5MnQ5Y3c&hl=en for after)

√ I'm going to calculate the skew between the controller & the worker on first contact. Then, I will modify GetNow to use this skew.

Make the analyzer not consider peers outside of the aoi.
Drop peers that I haven't heard from in over 5 seconds.

Friday, February 5, 2010

To Do - 2/5

Sick yesterday. Still sick today.

remove the consistently misbehaving servers

Extend the worker to run in 'test' mode, which tests a server for viability. If it fails to accomplish all of the pieces, reject it.

Now that I have a larger data set, I'm seeing strange sections of nothing in the output from the analyzer, implying the lastSeen time is invalid. Fix this. (see http://spreadsheets.google.com/ccc?key=0AjbHDjXctNgDdDNpeEpzaVNzVGoxeUNPSTRjd1c3aHc&hl=en)

I'm going to calculate the skew between the controller & the worker on first contact. Then, I will modify GetNow to use this skew.

Make the analyzer not consider peers outside of the aoi.
Drop peers that I haven't heard from in over 5 seconds.

Wednesday, February 3, 2010

To Do - 2/3

remove the consistently misbehaving servers

Extend the worker to run in 'test' mode, which tests a server for viability. If it fails to accomplish all of the pieces, reject it.

√Change the output directory to ~ in order to test whether or not certain servers can't write to /tmp

Why does this result in 0 byte files for all servers?

Because we are throwing some exception
When I run with gdb, we get no exception
Found a bug when choosing an initial port. This explains the 65535 ports I've been seeing in my output. BONUS: it fixes the 0 byte files, too. Neat!

Now that I have a larger data set, I'm seeing strange sections of nothing in the output from the analyzer, implying the lastSeen time is invalid. Fix this. (see http://spreadsheets.google.com/ccc?key=0AjbHDjXctNgDdDNpeEpzaVNzVGoxeUNPSTRjd1c3aHc&hl=en)
Make the analyzer not consider peers outside of the aoi.
Drop peers that I haven't heard from in over 5 seconds.

Tuesday, February 2, 2010

To Do - 2/2

√Switch up to node63.txt node set (since only about half of my servers work at any one time, and I want to test scalability, I need to increase the node size)
remove the consistently misbehaving servers

Extend the worker to run in 'test' mode, which tests a server for viability. If it fails to accomplish all of the pieces, reject it.

√Make the QUIT command broadcast to all a peer's neighbors

Start varying the inputs (number of peers) and see what changes

Monday, February 1, 2010

January Recap

These kinds of posts compare the project at the start of a month to the end of the month.

What's gotten better:

Reliability has improved significantly due to a variety of debugging tools that have helped me find some bugs in DOVE
Ease of use has improved due to creation of shell scripts for common tasks
Flexibility has improved as the analyzer can now output the resulting data in two formats, with the new spreadsheet mode being particularly illuminating
The analyzer also bins the datapoints into 100ms bins in the spreadsheet mode, which makes comparison across the servers (which have clock skew relative to one another) much easier.

What is still not great:

The network is still not running successfully on certain computers, and I'm not sure why.
Some of the computers on the network are really, REALLY underpowered so that the algorithm which is supposed to tick every 100ms is actually ticking every second. This might be due to the processing of the algorithm taking so long or the networking calls blocking due to the send buffer being filled.
The analysis tool still can't quite do the processing I need

For next month:

The network needs to run reliably, or I'm hosed. Find out what's going on with those dropped connections. Perhaps a non-zero exit will email me the output? We'll just save the output every time into some output.txt for later retrieval.
Identify the exact bottleneck by profiling a worker during execution.
There's a bug where doing a run and stopping it, and then running immediately after will capture some of the peers that were trying to connect from the previous run. Make sure that this bug is handled correctly.
Make the analysis tool spit out graphable data

To Do - 2/1

√see if you can get the non-waiting ssh to return immediate errors anyway. 10 seconds seems to be a magic number. It means it takes minutes to start all the workers, though.

remove the consistently misbehaving servers
Extend the worker to run in 'test' mode, which tests a server for viability. If it fails to accomplish all of the pieces, reject it.

Make the QUIT command broadcast to all a peer's neighbors

Start varying the inputs (number of peers) and see what changes

Thursday, January 28, 2010

Temporal Latency

http://spreadsheets.google.com/ccc?key=0AjbHDjXctNgDdDhpWEIzT3JLVnB5WkpZUnRrSlRsWnc&hl=en

This is the average ping between a peer and its neighbors in milliseconds.

To Do - 1/28

I can't believe January is almost over!

Yesterday, didn't get as much done as I wished, as family visited me at work and took the train back with me. So the To Do list is quite similar:

Try to run with a larger dataset (nodes63), and see if you can get the non-waiting ssh to return immediate errors anyway.

This seems to be working http://spreadsheets.google.com/ccc?key=0AjbHDjXctNgDdEVPYnp5d1Vna0tfZ0UzbU5zRzRIVGc&hl=en but still has strange issues... why do some nodes not reply with a port? UDP packet shaping? Annoying. Also, some peers aren't writing their file out, implying a crash. I have a mind to just remove this consistently misbehaving servers; were it random, I'd really need to look further, but as I'm just trying to scale and get data, I should probably just remove those servers from my slice/nodelist. In fact:
Extend the worker to run in 'test' mode, which tests a server for viability. If it fails to accomplish all of the pieces, reject it.

Make the QUIT command broadcast to all a peer's neighbors

Start varying the inputs (number of peers) and see what changes

Wednesday, January 27, 2010

A wee bit of analysis

Looking at the data has provided me a world of insight into what my network is really doing:

Even though the computers are supposed to be ticking every .1s, and hence writing data out, some computers are only ticking every .5s, implying that the work required in the algorithm takes longer than 100ms (this is surprising... those computers must be rather slow). This means I should at least run with -O3 optimized versions, rather than my current -O0 unoptimized builds. If this is still a problem, I need to work on improving the efficiency of the code.

To Do - 1/27

Try to run with a larger dataset (nodes63), and see if you can get the non-waiting ssh to return immediate errors anyway.
Start varying the inputs (number of peers) and see what changes

Tuesday, January 26, 2010

To Do - 1/26

Ah, real data. Here's a longer dataset: http://spreadsheets.google.com/ccc?key=0AjbHDjXctNgDdFBVSzFWNmNWbDc3bkxlcklHWnp1T0E&hl=en

√Find out why bad data is getting written out... even though there are asserts!

make sure udp messages are working correctly by sending two messages to a worker, and testing that two reads happen...
√Run the workers repeatedly, trying to identify some of the malformed output errors.

Done, and fixed.

√Modify analyzer to spit out unprocessed data as a spreadsheet (peer x time), so as to graph the raw data, rather than just the processed data.
Start varying the inputs (number of peers) and see what changes

Monday, January 25, 2010

Actual Data!

Got through all of my asserts! Woot!

I have actual data on my machine.. this is terribly exciting. I'm able to parse the data without hitting any asserts, implying that my raw data makes some kind of sense. My current analysis techniques are a bit lacking, though, so I'm not going to put up my final data just as of yet. I'm going to work now on the analyzer, and make it spew out a spreadsheet-like data format first... I'll put it online @ google spreadsheets and put a link here when I do this. Done: http://spreadsheets.google.com/ccc?key=0AjbHDjXctNgDdG8zbVd6LTJtdDdQaDdjQmdXU3ZSa3c&hl=en

To Do - 1/25

Ah... good weekend. Back to work!

Find out why bad data is getting written out... even though there are asserts!

√Add this sanity checking to the writing of the data in worker. Add a flag to enable it. Put the functionality for it in an inline function in shader.h
try running the worker in some kind of test mode on each server
make sure udp messages are working correctly by sending two messages to a worker, and testing that two reads happen...
Run the workers repeatedly, trying to identify some of the malformed output errors.

Modify analyzer to spit out unprocessed data as a spreadsheet (peer x time), so as to graph the raw data, rather than just the processed data.

Thursday, January 21, 2010

Tools roundup - 1/21

The newest detail on the tools I use:

setenv.sh - sets up handy environment variables for pssh so they don't need to be specified on the command-line
regenworkingset.sh - takes a list of servers, and creates a file that corresponds to only those servers that are online
build.sh - copies source files to destination server & compiles there
deploybin.sh - deploys worker on source server to servers listed in destination file
gatherdata.sh - copies outputs from workers to a dest server, packages them up and copies them locally

To Do - 1/21

√Write a handle tool for collecting the resulting data
Find out why bad data is getting written out... even though there are asserts!

√Add sanity checking on the files... some kind of preprocess step (a flag to analyzer), identifying the areas that are malformed.
Add this sanity checking to the writing of the data in worker. Add a flag to enable it. Put the functionality for it in an inline function in shader.h
Log all of the data that is written out to std out as well, for later sanity checking.
try running the worker in some kind of test mode on each server
make sure udp messages are working correctly by sending two messages to a worker, and testing that two reads happen...

Modify analyzer to spit out unprocessed data as a spreadsheet (peer x time), so as to graph the raw data, rather than just the processed data.

Wednesday, January 20, 2010

To Do - 1/20

Continue to identify which servers are good and which are bad

√ extend output in controller to include ip when error
√remove worker from list when error Add a status field to the worker info struct so that we know the current state of a worker

√wait until prepare is finished before letting run happen... at least notify when prepare is done...

Find out why bad data is getting written out... even though there are asserts!

Log all of the data that is written out to std out as well, for later sanity checking.
try running the worker in some kind of test mode on each server
make sure udp messages are working correctly by sending two messages to a worker, and testing that two reads happen...

Modify analyzer to spit out unprocessed data as a spreadsheet (peer x time), so as to graph the raw data, rather than just the processed data.

Tuesday, January 19, 2010

Status

I have the network making data, but the data seems to be malformed...

In particular, some of the data streams reference other streams that do not exist (the worker IDs are bad). There is an assert in the code to prevent this, but it's still happening... Very strange.

More about the shell scripts

I created several useful shell scripts that help me get stuff done.
setenv.sh

export PSSH_USER=uic_voronoi
export PSSH_PAR=32
export PSSH_OUTDIR=outputs
export PSSH_OPTIONS="StrictHostKeyChecking=no"

build.sh

if [ "$1" == "" ]
then
echo "No hostname specified"
exit 1
fi

cmd1="scp -r *.cpp *.h *.hpp main uic_voronoi@$1:~"
echo "$cmd1"

cmd2="ssh uic_voronoi@$1 g++ -g *.cpp main/worker.cpp -I. -Iinclude -Llib -lboost_system -lboost_thread -lboost_program_options -o worker && g++ -g *.cpp main/controller.cpp -I. -Iinclude -Llib -lboost_system -lboost_thread -lboost_program_options -o controller"
echo $cmd2
$cmd1 && $cmd2

deploybin.sh

cmd="ssh uic_voronoi@$1 pscp -h $2 -p 32 -t 200 -o outputs -e errors -l uic_voronoi -O StrictHostKeyChecking=no worker /home/uic_voronoi/worker"
echo $cmd
$cmd

Detail - 1/19

I am seeing problems wherein certain servers fail to run the worker correctly. Why is that? It might be due to different architectures, unlikely as that is, so I need to check. I also need to just make sure I have the ones that are online at any particular moment:

So I take my list of nodes of various sizes and convert them into lists of nodes that are up right now via:

pssh -h nodes31.txt hostname; cat outputs/* > workingnodes31.txt

This generated my node lists. To check the architectures of these machines, we run

pssh -h workingnodes31.txt uname -a; cat outputs/* > unames.txt

The above command outputs something like:

Linux cs-planetlab3.cs.surrey.sfu.ca 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 23 18:27:24 UTC 2009 i686 i686 i386 GNU/Linux

Linux cs-planetlab4.cs.surrey.sfu.ca 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 23 18:27:24 UTC 2009 i686 i686 i386 GNU/Linux

Linux onelab1.info.ucl.ac.be 2.6.22.19-vs2.3.0.34.39.onelab.1 #1 SMP Tue Nov 3 14:49:39 CET 2009 i686 i686 i386 GNU/Linux

Linux onelab2.info.ucl.ac.be 2.6.22.19-vs2.3.0.34.39.onelab.1 #1 SMP Tue Nov 3 14:49:39 CET 2009 i686 i686 i386 GNU/Linux

Linux onelab3.info.ucl.ac.be 2.6.22.19-vs2.3.0.34.39.onelab #1 SMP Tue Jun 16 18:43:37 CEST 2009 i686 i686 i386 GNU/Linux

Linux pl1.eng.monash.edu.au 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 30 09:32:05 UTC 2009 i686 i686 i386 GNU/Linux

Linux pl2.eng.monash.edu.au 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 30 09:32:05 UTC 2009 i686 i686 i386 GNU/Linux

Linux plab1-itec.uni-klu.ac.at 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 23 18:27:24 UTC 2009 i686 i686 i386 GNU/Linux

Linux plab2-itec.uni-klu.ac.at 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 30 09:32:05 UTC 2009 i686 i686 i386 GNU/Linux

Linux plab2.larc.usp.br 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 23 18:27:24 UTC 2009 i686 i686 i386 GNU/Linux

Linux planet-lab1.itba.edu.ar 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 23 18:27:24 UTC 2009 i686 i686 i386 GNU/Linux

Linux planet-lab1.uba.ar 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 23 18:27:24 UTC 2009 i686 athlon i386 GNU/Linux

Linux planet-lab2.itba.edu.ar 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 23 18:27:24 UTC 2009 i686 i686 i386 GNU/Linux

Linux planet-lab2.ufabc.edu.br 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 30 09:32:05 UTC 2009 i686 i686 i386 GNU/Linux

Linux planetlab1.c3sl.ufpr.br 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 23 18:27:24 UTC 2009 i686 i686 i386 GNU/Linux

Linux planetlab1.pop-ce.rnp.br 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 30 09:32:05 UTC 2009 i686 i686 i386 GNU/Linux

Linux planetlab1.pop-mg.rnp.br 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 23 18:27:24 UTC 2009 i686 i686 i386 GNU/Linux

Linux planetlab1.pop-rs.rnp.br 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 30 09:32:05 UTC 2009 i686 i686 i386 GNU/Linux

Linux planetlab1.win.trlabs.ca 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 23 18:27:24 UTC 2009 i686 i686 i386 GNU/Linux

Linux planetlab2.ani.univie.ac.at 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 23 18:27:24 UTC 2009 i686 i686 i386 GNU/Linux

Linux planetlab2.c3sl.ufpr.br 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Wed May 27 06:16:01 EDT 2009 i686 i686 i386 GNU/Linux

Linux planetlab2.pop-ce.rnp.br 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 30 09:32:05 UTC 2009 i686 i686 i386 GNU/Linux

Linux planetlab2.pop-mg.rnp.br 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 23 18:27:24 UTC 2009 i686 i686 i386 GNU/Linux

Linux planetlab2.pop-rs.rnp.br 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 30 09:32:05 UTC 2009 i686 i686 i386 GNU/Linux

Linux planetlab3.ani.univie.ac.at 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 23 18:27:24 UTC 2009 i686 i686 i386 GNU/Linux

Linux planetlab4.ani.univie.ac.at 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 30 09:32:05 UTC 2009 i686 i686 i386 GNU/Linux

Linux plnode01.cs.mu.oz.au 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 23 18:27:24 UTC 2009 i686 i686 i386 GNU/Linux

Linux plnode02.cs.mu.oz.au 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 23 18:27:24 UTC 2009 i686 i686 i386 GNU/Linux

Which is exactly what I would hope for.

To Do - 1/19

To Do List for today:

Continue to identify which servers are good and which are bad

√ do a simple pssh hostname test to return the list of valid servers...

√ store this in workingnodesN.txt
√ Profile the peers via uname -a. Ensure none are 64-bit. Cull those that are.

create handy shell scripts that do what I do regularly, to save deploy time.

√ setenv.sh - sets up pssh env vars so I don't need to type them all out again and again
√ build.sh - copies src to $1, and builds worker & controller there
deploybin.sh - copies $1:~/worker to *:/home/uic_voronoi/worker - still issues

extend output in controller to include ip when error

remove worker from list when error
wait until prepare is finished before letting run happen... at least notify when prepare is done...

Find out why bad data is getting written out... even though there are asserts!

try running the worker in some kind of test mode on each server
make sure udp messages are working correctly by sending two messages to a worker, and testing that two reads happen...

Modify analyzer to spit out unprocessed data as a spreadsheet (peer x time), so as to graph the raw data, rather than just the processed data.

Purpose of the blog

This blog will document my escapades with building and testing Dove. I'll update every day, with a report on what I plan on doing or have done, and I'll go back and edit these posts when I finish a task.

Also, the blog will serve as a place for me to ask questions and for interested parties to respond (via comments).