Thursday, January 28, 2010

Temporal Latency

http://spreadsheets.google.com/ccc?key=0AjbHDjXctNgDdDhpWEIzT3JLVnB5WkpZUnRrSlRsWnc&hl=en

This is the average ping between a peer and its neighbors in milliseconds.

To Do - 1/28

I can't believe January is almost over!

Yesterday, didn't get as much done as I wished, as family visited me at work and took the train back with me. So the To Do list is quite similar:


  • Try to run with a larger dataset (nodes63), and see if you can get the non-waiting ssh to return immediate errors anyway.
    • This seems to be working http://spreadsheets.google.com/ccc?key=0AjbHDjXctNgDdEVPYnp5d1Vna0tfZ0UzbU5zRzRIVGc&hl=en but still has strange issues... why do some nodes not reply with a port? UDP packet shaping? Annoying. Also, some peers aren't writing their file out, implying a crash. I have a mind to just remove this consistently misbehaving servers; were it random, I'd really need to look further, but as I'm just trying to scale and get data, I should probably just remove those servers from my slice/nodelist. In fact:
    • Extend the worker to run in 'test' mode, which tests a server for viability. If it fails to accomplish all of the pieces, reject it.
  • Make the QUIT command broadcast to all a peer's neighbors
    • Start varying the inputs (number of peers) and see what changes

    Wednesday, January 27, 2010

    A wee bit of analysis

    Looking at the data has provided me a world of insight into what my network is really doing:

    • Even though the computers are supposed to be ticking every .1s, and hence writing data out, some computers are only ticking every .5s, implying that the work required in the algorithm takes longer than 100ms (this is surprising... those computers must be rather slow). This means I should at least run with -O3 optimized versions, rather than my current -O0 unoptimized builds. If this is still a problem, I need to work on improving the efficiency of the code.

    To Do - 1/27


    • Try to run with a larger dataset (nodes63), and see if you can get the non-waiting ssh to return immediate errors anyway.
    • Start varying the inputs (number of peers) and see what changes

    Tuesday, January 26, 2010

    To Do - 1/26

    Ah, real data. Here's a longer dataset: http://spreadsheets.google.com/ccc?key=0AjbHDjXctNgDdFBVSzFWNmNWbDc3bkxlcklHWnp1T0E&hl=en


    • √Find out why bad data is getting written out... even though there are asserts!
      • make sure udp messages are working correctly by sending two messages to a worker, and testing that two reads happen...
      • √Run the workers repeatedly, trying to identify some of the malformed output errors.
        • Done, and fixed.
    • √Modify analyzer to spit out unprocessed data as a spreadsheet (peer x time), so as to graph the raw data, rather than just the processed data.
    • Start varying the inputs (number of peers) and see what changes

    Monday, January 25, 2010

    Actual Data!

    Got through all of my asserts! Woot!

    I have actual data on my machine.. this is terribly exciting. I'm able to parse the data without hitting any asserts, implying that my raw data makes some kind of sense. My current analysis techniques are a bit lacking, though, so I'm not going to put up my final data just as of yet. I'm going to work now on the analyzer, and make it spew out a spreadsheet-like data format first... I'll put it online @ google spreadsheets and put a link here when I do this. Done: http://spreadsheets.google.com/ccc?key=0AjbHDjXctNgDdG8zbVd6LTJtdDdQaDdjQmdXU3ZSa3c&hl=en

    To Do - 1/25

    Ah... good weekend. Back to work!


    • Find out why bad data is getting written out... even though there are asserts!
      • √Add this sanity checking to the writing of the data in worker. Add a flag to enable it. Put the functionality for it in an inline function in shader.h
      • try running the worker in some kind of test mode on each server
      • make sure udp messages are working correctly by sending two messages to a worker, and testing that two reads happen...
      • Run the workers repeatedly, trying to identify some of the malformed output errors.
    • Modify analyzer to spit out unprocessed data as a spreadsheet (peer x time), so as to graph the raw data, rather than just the processed data.

    Thursday, January 21, 2010

    Tools roundup - 1/21

    The newest detail on the tools I use:


    • setenv.sh - sets up handy environment variables for pssh so they don't need to be specified on the command-line
    • regenworkingset.sh - takes a list of servers, and creates a file that corresponds to only those servers that are online
    • build.sh - copies source files to destination server & compiles there
    • deploybin.sh - deploys worker on source server to servers listed in destination file
    • gatherdata.sh - copies outputs from workers to a dest server, packages them up and copies them locally

    To Do - 1/21


    • √Write a handle tool for collecting the resulting data
    • Find out why bad data is getting written out... even though there are asserts!
      • √Add sanity checking on the files... some kind of preprocess step (a flag to analyzer), identifying the areas that are malformed.
      • Add this sanity checking to the writing of the data in worker. Add a flag to enable it. Put the functionality for it in an inline function in shader.h
      • Log all of the data that is written out to std out as well, for later sanity checking.
      • try running the worker in some kind of test mode on each server
      • make sure udp messages are working correctly by sending two messages to a worker, and testing that two reads happen...
    • Modify analyzer to spit out unprocessed data as a spreadsheet (peer x time), so as to graph the raw data, rather than just the processed data.

    Wednesday, January 20, 2010

    To Do - 1/20


    • Continue to identify which servers are good and which are bad
      •  √ extend output in controller to include ip when error
      •  √remove worker from list when error Add a status field to the worker info struct so that we know the current state of a worker
        • √wait until prepare is finished before letting run happen... at least notify when prepare is done...
      • Find out why bad data is getting written out... even though there are asserts!
        • Log all of the data that is written out to std out as well, for later sanity checking.
        • try running the worker in some kind of test mode on each server
        • make sure udp messages are working correctly by sending two messages to a worker, and testing that two reads happen...
    • Modify analyzer to spit out unprocessed data as a spreadsheet (peer x time), so as to graph the raw data, rather than just the processed data.

    Tuesday, January 19, 2010

    Status

    I have the network making data, but the data seems to be malformed...

    In particular, some of the data streams reference other streams that do not exist (the worker IDs are bad). There is an assert in the code to prevent this, but it's still happening... Very strange.

    More about the shell scripts

    I created several useful shell scripts that help me get stuff done.
    setenv.sh
    export PSSH_USER=uic_voronoi
    export PSSH_PAR=32
    export PSSH_OUTDIR=outputs
    export PSSH_OPTIONS="StrictHostKeyChecking=no"
    build.sh
    if [ "$1" == "" ]
    then
    echo "No hostname specified"
    exit 1
    fi

    cmd1="scp -r *.cpp *.h *.hpp main uic_voronoi@$1:~"
    echo "$cmd1"

    cmd2="ssh uic_voronoi@$1 g++ -g *.cpp main/worker.cpp -I. -Iinclude -Llib -lboost_system -lboost_thread -lboost_program_options -o worker && g++ -g *.cpp main/controller.cpp -I. -Iinclude -Llib -lboost_system -lboost_thread -lboost_program_options -o controller"
    echo $cmd2
    $cmd1 && $cmd2
    deploybin.sh
    cmd="ssh uic_voronoi@$1 pscp -h $2 -p 32 -t 200 -o outputs -e errors -l uic_voronoi -O StrictHostKeyChecking=no worker /home/uic_voronoi/worker"
    echo $cmd
    $cmd

    Detail - 1/19

    x
    I am seeing problems wherein certain servers fail to run the worker correctly. Why is that? It might be due to different architectures, unlikely as that is, so I need to check. I also need to just make sure I have the ones that are online at any particular moment:


    So I take my list of nodes of various sizes and convert them into lists of nodes that are up right now via:
    pssh -h nodes31.txt hostname; cat outputs/* > workingnodes31.txt
    This generated my node lists. To check the architectures of these machines, we run
    pssh -h workingnodes31.txt uname -a; cat outputs/* > unames.txt
    The above command outputs something like:
    Linux cs-planetlab3.cs.surrey.sfu.ca 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 23 18:27:24 UTC 2009 i686 i686 i386 GNU/Linux  
    Linux cs-planetlab4.cs.surrey.sfu.ca 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 23 18:27:24 UTC 2009 i686 i686 i386 GNU/Linux
    Linux onelab1.info.ucl.ac.be 2.6.22.19-vs2.3.0.34.39.onelab.1 #1 SMP Tue Nov 3 14:49:39 CET 2009 i686 i686 i386 GNU/Linux
    Linux onelab2.info.ucl.ac.be 2.6.22.19-vs2.3.0.34.39.onelab.1 #1 SMP Tue Nov 3 14:49:39 CET 2009 i686 i686 i386 GNU/Linux
    Linux onelab3.info.ucl.ac.be 2.6.22.19-vs2.3.0.34.39.onelab #1 SMP Tue Jun 16 18:43:37 CEST 2009 i686 i686 i386 GNU/Linux
    Linux pl1.eng.monash.edu.au 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 30 09:32:05 UTC 2009 i686 i686 i386 GNU/Linux
    Linux pl2.eng.monash.edu.au 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 30 09:32:05 UTC 2009 i686 i686 i386 GNU/Linux
    Linux plab1-itec.uni-klu.ac.at 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 23 18:27:24 UTC 2009 i686 i686 i386 GNU/Linux
    Linux plab2-itec.uni-klu.ac.at 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 30 09:32:05 UTC 2009 i686 i686 i386 GNU/Linux
    Linux plab2.larc.usp.br 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 23 18:27:24 UTC 2009 i686 i686 i386 GNU/Linux
    Linux planet-lab1.itba.edu.ar 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 23 18:27:24 UTC 2009 i686 i686 i386 GNU/Linux
    Linux planet-lab1.uba.ar 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 23 18:27:24 UTC 2009 i686 athlon i386 GNU/Linux
    Linux planet-lab2.itba.edu.ar 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 23 18:27:24 UTC 2009 i686 i686 i386 GNU/Linux
    Linux planet-lab2.ufabc.edu.br 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 30 09:32:05 UTC 2009 i686 i686 i386 GNU/Linux
    Linux planetlab1.c3sl.ufpr.br 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 23 18:27:24 UTC 2009 i686 i686 i386 GNU/Linux
    Linux planetlab1.pop-ce.rnp.br 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 30 09:32:05 UTC 2009 i686 i686 i386 GNU/Linux
    Linux planetlab1.pop-mg.rnp.br 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 23 18:27:24 UTC 2009 i686 i686 i386 GNU/Linux
    Linux planetlab1.pop-rs.rnp.br 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 30 09:32:05 UTC 2009 i686 i686 i386 GNU/Linux
    Linux planetlab1.win.trlabs.ca 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 23 18:27:24 UTC 2009 i686 i686 i386 GNU/Linux
    Linux planetlab2.ani.univie.ac.at 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 23 18:27:24 UTC 2009 i686 i686 i386 GNU/Linux
    Linux planetlab2.c3sl.ufpr.br 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Wed May 27 06:16:01 EDT 2009 i686 i686 i386 GNU/Linux
    Linux planetlab2.pop-ce.rnp.br 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 30 09:32:05 UTC 2009 i686 i686 i386 GNU/Linux
    Linux planetlab2.pop-mg.rnp.br 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 23 18:27:24 UTC 2009 i686 i686 i386 GNU/Linux
    Linux planetlab2.pop-rs.rnp.br 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 30 09:32:05 UTC 2009 i686 i686 i386 GNU/Linux
    Linux planetlab3.ani.univie.ac.at 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 23 18:27:24 UTC 2009 i686 i686 i386 GNU/Linux
    Linux planetlab4.ani.univie.ac.at 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 30 09:32:05 UTC 2009 i686 i686 i386 GNU/Linux
    Linux plnode01.cs.mu.oz.au 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 23 18:27:24 UTC 2009 i686 i686 i386 GNU/Linux
    Linux plnode02.cs.mu.oz.au 2.6.22.19-vs2.3.0.34.39.planetlab #1 SMP Tue Jun 23 18:27:24 UTC 2009 i686 i686 i386 GNU/Linux
    Which is exactly what I would hope for.

    To Do - 1/19

    To Do List for today:
    • Continue to identify which servers are good and which are bad
      •  √ do a simple pssh hostname test to return the list of valid servers...
        •  √ store this in workingnodesN.txt
        •  √ Profile the peers via uname -a. Ensure none are 64-bit. Cull those that are.
      •  create handy shell scripts that do what I do regularly, to save deploy time.
        • √ setenv.sh - sets up pssh env vars so I don't need to type them all out again and again
        • √ build.sh - copies src to $1, and builds worker & controller there
        • deploybin.sh - copies $1:~/worker to *:/home/uic_voronoi/worker - still issues
      •  extend output in controller to include ip when error
        •  remove worker from list when error
        •  wait until prepare is finished before letting run happen... at least notify when prepare is done...
      • Find out why bad data is getting written out... even though there are asserts!
        • try running the worker in some kind of test mode on each server
        • make sure udp messages are working correctly by sending two messages to a worker, and testing that two reads happen...
    • Modify analyzer to spit out unprocessed data as a spreadsheet (peer x time), so as to graph the raw data, rather than just the processed data.


    Purpose of the blog

    This blog will document my escapades with building and testing Dove. I'll update every day, with a report on what I plan on doing or have done, and I'll go back and edit these posts when I finish a task.

    Also, the blog will serve as a place for me to ask questions and for interested parties to respond (via comments).