Dove: February 2010

Tuesday, February 16, 2010

To Do - 2/16

Misbehaving servers:

Extend the worker to run in 'test' mode, which tests a server for viability. If it fails to accomplish all of the pieces, reject it.
Profile the code using gprof on one of the misbehaving servers to find out why it is taking so long to do its processing.
One server is giving all 0 valued latencies, as well as bad absolute time. Find out why.

Make the analyzer not consider peers outside of the aoi.
√Make a new analyzer mode that outputs data suitable for use with R.

Figure out how to use R in batch-mode, to automatically generate pngs of the data.
Need to hook up shell script.

Monday, February 15, 2010

http://spreadsheets.google.com/ccc?key=0AjbHDjXctNgDdHlUaGpob2ZkT2RkZXV1clN2QW5LVXc&hl=en

This is telling me that something rather strange is going on with my network. Why the spikes? If they simply went up for all eternity, I would suspect some kind of bug or disconnection from the network. But they don't. Rather, some steadily increase up to 2000 ms, and then resume normal 100ms behavior. What's up with that?

To Do - 2/15

Back from my mini-vacation.

Misbehaving servers:

Extend the worker to run in 'test' mode, which tests a server for viability. If it fails to accomplish all of the pieces, reject it.
Profile the code using gprof on one of the misbehaving servers to find out why it is taking so long to do its processing.
One server is giving all 0 valued latencies, as well as bad absolute time. Find out why.

Make the analyzer not consider peers outside of the aoi.
Make a new analyzer mode that outputs data suitable for use with gnuplot & write a shell script that calls the analyzer tool & then gnuplot. google chart api. R.

√Got gnuplot installed & working
Figure out how to use R in batch-mode, to automatically generate pngs of the data.
Need to hook up shell script.

Tuesday, February 9, 2010

To Do - 2/9

Misbehaving servers:

Extend the worker to run in 'test' mode, which tests a server for viability. If it fails to accomplish all of the pieces, reject it.
Profile the code using gprof on one of the misbehaving servers to find out why it is taking so long to do its processing.

Make the analyzer not consider peers outside of the aoi.
√Drop peers that I haven't heard from in over 5 seconds. I do this in the analysis, not the actual network. Maybe I should?
Make a new analyzer mode that outputs data suitable for use with gnuplot & write a shell script that calls the analyzer tool & then gnuplot.

√Got gnuplot installed & working
√Added -plot mode to analyzer.
Need to hook up shell script.

√Combine the recommendation messages with the position update message.

Found some time-related bugs. Fixed them.
Found some over-zealous asserts. Fixed them.
Data still isn't nearly as pretty as I'd like.

Monday, February 8, 2010

To Do - 2/8

remove the consistently misbehaving servers

Extend the worker to run in 'test' mode, which tests a server for viability. If it fails to accomplish all of the pieces, reject it.

√Now that I have a larger data set, I'm seeing strange sections of nothing in the output from the analyzer, implying the lastSeen time is invalid. Fix this. (see http://spreadsheets.google.com/ccc?key=0AjbHDjXctNgDdDNpeEpzaVNzVGoxeUNPSTRjd1c3aHc&hl=en for before. See http://spreadsheets.google.com/ccc?key=0AjbHDjXctNgDdGhEczhqemR5cFVxSmRTcms5MnQ5Y3c&hl=en for after)

√ I'm going to calculate the skew between the controller & the worker on first contact. Then, I will modify GetNow to use this skew.

Make the analyzer not consider peers outside of the aoi.
Drop peers that I haven't heard from in over 5 seconds.

Friday, February 5, 2010

To Do - 2/5

Sick yesterday. Still sick today.

remove the consistently misbehaving servers

Extend the worker to run in 'test' mode, which tests a server for viability. If it fails to accomplish all of the pieces, reject it.

Now that I have a larger data set, I'm seeing strange sections of nothing in the output from the analyzer, implying the lastSeen time is invalid. Fix this. (see http://spreadsheets.google.com/ccc?key=0AjbHDjXctNgDdDNpeEpzaVNzVGoxeUNPSTRjd1c3aHc&hl=en)

I'm going to calculate the skew between the controller & the worker on first contact. Then, I will modify GetNow to use this skew.

Make the analyzer not consider peers outside of the aoi.
Drop peers that I haven't heard from in over 5 seconds.

Wednesday, February 3, 2010

To Do - 2/3

remove the consistently misbehaving servers

Extend the worker to run in 'test' mode, which tests a server for viability. If it fails to accomplish all of the pieces, reject it.

√Change the output directory to ~ in order to test whether or not certain servers can't write to /tmp

Why does this result in 0 byte files for all servers?

Because we are throwing some exception
When I run with gdb, we get no exception
Found a bug when choosing an initial port. This explains the 65535 ports I've been seeing in my output. BONUS: it fixes the 0 byte files, too. Neat!

Now that I have a larger data set, I'm seeing strange sections of nothing in the output from the analyzer, implying the lastSeen time is invalid. Fix this. (see http://spreadsheets.google.com/ccc?key=0AjbHDjXctNgDdDNpeEpzaVNzVGoxeUNPSTRjd1c3aHc&hl=en)
Make the analyzer not consider peers outside of the aoi.
Drop peers that I haven't heard from in over 5 seconds.

Tuesday, February 2, 2010

To Do - 2/2

√Switch up to node63.txt node set (since only about half of my servers work at any one time, and I want to test scalability, I need to increase the node size)
remove the consistently misbehaving servers

Extend the worker to run in 'test' mode, which tests a server for viability. If it fails to accomplish all of the pieces, reject it.

√Make the QUIT command broadcast to all a peer's neighbors

Start varying the inputs (number of peers) and see what changes

Monday, February 1, 2010

January Recap

These kinds of posts compare the project at the start of a month to the end of the month.

What's gotten better:

Reliability has improved significantly due to a variety of debugging tools that have helped me find some bugs in DOVE
Ease of use has improved due to creation of shell scripts for common tasks
Flexibility has improved as the analyzer can now output the resulting data in two formats, with the new spreadsheet mode being particularly illuminating
The analyzer also bins the datapoints into 100ms bins in the spreadsheet mode, which makes comparison across the servers (which have clock skew relative to one another) much easier.

What is still not great:

The network is still not running successfully on certain computers, and I'm not sure why.
Some of the computers on the network are really, REALLY underpowered so that the algorithm which is supposed to tick every 100ms is actually ticking every second. This might be due to the processing of the algorithm taking so long or the networking calls blocking due to the send buffer being filled.
The analysis tool still can't quite do the processing I need

For next month:

The network needs to run reliably, or I'm hosed. Find out what's going on with those dropped connections. Perhaps a non-zero exit will email me the output? We'll just save the output every time into some output.txt for later retrieval.
Identify the exact bottleneck by profiling a worker during execution.
There's a bug where doing a run and stopping it, and then running immediately after will capture some of the peers that were trying to connect from the previous run. Make sure that this bug is handled correctly.
Make the analysis tool spit out graphable data

To Do - 2/1

√see if you can get the non-waiting ssh to return immediate errors anyway. 10 seconds seems to be a magic number. It means it takes minutes to start all the workers, though.

remove the consistently misbehaving servers
Extend the worker to run in 'test' mode, which tests a server for viability. If it fails to accomplish all of the pieces, reject it.

Make the QUIT command broadcast to all a peer's neighbors

Start varying the inputs (number of peers) and see what changes

Dove