NotesToSelf

NotesToSelf

DK  //  Factoids and occasional bits of useful information.

Nov 2 / 6:40pm

ggplot2, plyr, and your.flowingdata

The previous post described how I went about cleaning up some yfd data using Python and numpy. I have no doubt it can be done in fewer lines of code, but I think the post described how useful it can be to manipulate arrays rather than looping through everything. With the data cleaned up, I hoped to visualize my newborn son's sleep schedule. I recently received an example that does the same thing as my python code, but in 3 lines! It uses R, ggplot2, and plyr. A few more lines can generate pretty plots like this (box plot of sleep length in hrs vs. start time):


As the plot above shows, my son doesn't sleep a helluva lot during the day. The boxplot also illustrates how volatile his night sleeping has been. This tells me I need to do a better job of getting the boy to nap during the day in hopes of producing longer and more restful sleep periods at night.

While Python has been my gateway drug into the world of programming, I've been itching to try out a plotting package based on R, ggplot2. R is a popular language in the statistics community that has enjoyed some good press recently. Anyway, my little sleep duration project seemed perfect for some R exploration.

After searching around on the Interweb, I managed to write some broken R code that didn't really do what I wanted. Luckily, Hadley Wickham (the author of plyr and ggplot2) took pity on me and offered up some example code to point me in the right direction. I was shocked at the efficiency of the example, particularly given all the wrangling I had to do in python. Now, just for the record, I'm not making any statements about R vs. Python. Hadley obviously created plyr and ggplot2 to make R easier to use, and I imagine the same could be (or already has been) done for python. I just lack the experience and education to know!

Anyway, plyr and ggplot2 are very nice libraries that offer yet more reasons to learn R. Thank you Professor Wickham! Between python and R, I've got to believe one can slice and dice almost anything. If I could only get rpy2 working...
Filed under  //  life   python   R  

Comments (0)

Oct 28 / 10:46pm

Use numpy to flog your.flowingdata

As noted in a previous post, your.flowingdata.com (yfd) is a handy way to collect personal data. I've been collecting sleep, diaper, etc. data on my newborn son. Although yfd now allows users to calculate durations between specified events, the visualization of the information isn't quite to my liking and it's clear that errors in the data can make for some odd durations (e.g., my son slept for two days!). Numpy to the rescue!

For those of you who don't know, numpy is python's powerful array package. Rather than loop myself to death, I thought it made more sense to use of numpy's powerful slicing and masking features to clean up the data. These features make it easy to find data entry errors.

I use the Enthought python distribution for convenience sake (and because I can't resist all those libraries -- most of which I'll never use).   Below you'll find some screenshots that step through my little script. Refer to the complete code here. (Well, it's just a start really). The code is probably a bit verbose for what it does, but we all start somewhere.

The first step is getting the data into an array you can manipulate. For your reference, your.flowingdata yields data that looks like this:


As you can see, it's basically just events and timestamps (I'm not really making full use of the data types yfd offers, as shown by all the empty fields).

The code below creates a structured array. Typically, numpy arrays are made up of items of the same type. It occurs to me that this example isn't so great because I ended up sticking with strings (S10 = a ten character string), but you get the general idea. If you imagine a 2D array, you can define one column as floats, another as strings, and yet another as int, etc. I'm mostly interested in how much the little guy is sleeping, so the 'sleep_mask' variable creates a boolean mask of all the 'gnight' and 'gmorning' events (since they are mixed in with diaper changes and other random events).


We can use numpy's where() method to help us index the events we want. Now that I have an array of only gnight and gmorning events, I can offset the two (since they alternate) to see if there are any duplicates that might screw things up.


The first time I called 'errors', numpy returned something like the following (basically telling me when/where there are dupes):

array([('gmorning', '', '', '2009-10-24 23:45:36'),('gmorning', '', '', '2009-09-30 18:15:04'), ('gnight', '', '', '2009-09-23 21:00:03'), ('gmorning', '', '', '2009-09-23 19:15:03')])

I won't step through all the code here since it's available above, but you get the idea. One thing to watch out for: datetimes. I spent a lot of time trying to figure out the best way to handle the timestamps included with the yfd event data. There are ways to convert strings to ordinal numbers into datetime objects and back again, but really I wanted to manipulate the datetime objects directly to take advantage of numpy's array slicing and arithmetic. Luckily, numpy allows object types (technically, they are called 'dtypes'). This allows you to subtract one timestamp array from another to get the elapsed time without any conversions (though you'll have to convert at some point if you want to generate a human-readable string). Here's an example of the array you'll get at the end (heads -> sleep duration, start time, end time):


Another unexpected pain in the butt is TIMEZONES. Although yfd's UI shows the correct local time on the web page, the tab-delimited file uses UTC (GMT) timestamps. This actually makes sense if you think about it. If you travel a lot, you'll never be sure when something happened since your timezone isn't held constant. Keeping datetime in UTC solves this problem, though you have to convert to local time yourself if necessary. Handling timezones with python's datetime library, however, sort of sucks. I recommend checking out pytz. It makes timezone management a little bit easier.

Plans for the future include visualizing this data with either python or R (ggplot2 anyone?). Too bad I don't know R...
Filed under  //  life   python   tech  

Comments (1)

Oct 13 / 8:50am

Stock Ticker Orbital Comparison = COOL

Care of Flowing Data, Stock Ticker Orbital Comparison (STOC) is one of the coolest representations of the market I've seen. Although I can't see anyone really trading on top of this visualization metaphor, it does make one think of how correlations and other parameters might be represented via animation.

STOC was built using Processing, a Java-based visualization IDE developed at MIT. I understand there are Scala and Javascript versions in development as well. The closest python equivalents I can think of are NodeBox and Mayavi. In any case, STOC has swerve. Respect.

Filed under  //  finance   tech   video  

Comments (0)

Oct 12 / 11:38am

Import AntiGravity

Just saw this...

Filed under  //  life   python  

Comments (0)

Oct 10 / 11:42am

Baby T-Pain

I wish it sounded like that...

Filed under  //  life  

Comments (0)

Oct 7 / 1:24pm

Freeset Helps Free the Indentured in India

Some friends of mine are hosting a talk by Kerry Hilton, the founder of Freeset. From the website:

Freeset exists specifically to provide freedom for women from the sex trade, women who were forced into prostitution by trafficking or poverty. These women didn't choose their profession — it was chosen for them.

Now, they're being offered a real choice. When they choose to work at Freeset, they can start new lives, regain dignity in their communities, and begin a journey towards healing and wholeness.

All profits from Freeset in Kolkata benefit the women (salary, health insurance and retirement plan) and are used to grow the business. This means more women can be employed and experience freedom.

The great thing is, when you buy a Freeset product, you directly participate in a woman's journey to freedom.


Freeset trains these women to make custom bags and tee shirts. I'm not sure how differentiated the bags are from other bags, but the story is pretty unique.

The talk starts at 2:30pm this Sunday in Tarrytown, NY at the Reformed Church of the Tarrytowns (42N Broadway, Tarrytown, NY). Stop by if you want to learn more.

Filed under  //  life  

Comments (0)

Oct 1 / 4:55am

Palantir Finance looks promising

Garry (one of Posterous' founders), highlights the latest offering from Palantir - Palantir Finance. It looks like it has pretty powerful charting tools. I've signed up for an account and will report back once I've fiddled with it. I'm excited to explore the data exploration capabilities of this new tool (and, of course, whether there's an API).

Filed under  //  finance  

Comments (0)

Sep 30 / 5:11am

Malawi boy teaches himself to build windmills

This story deserves repeating (care of Gizmodo).

UPDATE: He just made it onto the Daily Show.

Filed under  //  life   video  

Comments (0)

Sep 25 / 7:17pm

Use your.flowingdata.com...for the children

Personal data capture is a meme that's gaining momentum. Products such as Nike+ and, more recently, Fitbit, target those who would like to monitor daily exercise and other activities. Websites that allow users to manually track how they use their time have also started to pop-up. For those of us that like to procrastinate, these monitoring tools can help by providing regular feedback. Watching a little line move in the right direction can be pretty motivating.

Of course, I don't use any of these services. For myself.

Nevertheless, as a new father, I've found that your.flowingdata.com is an easy and useful way to track the activities of my newborn son! The service uses tweets to capture pretty much any kind of data you'd care to record. There are electronic products (e.g., Itsbeen, basically a stopwatch on steroids) that help new parents keep track of when the baby last slept, ate, poo'ed, etc. They do not, however, capture that data for analysis. My wife and I would like to see the historical data to see if we can tease out some insights about our son (e.g., how much sleep does he need before he gets cranky?). We tried using an iPhone app called Blogger that helps parents keep track of these things, but it wasn't immediate enough. We ended-up writing down events on the nursery mirror with a dry erase pen, but I really wanted to track things via a single button press. By the time I've finished dodging multiple salvos of pee and poo, multiple diaper changes due to said peeing and pooing, spit-up, puking, and sundry other lovely activities (a testament to how much I love you, boy), I can't remember anything that's happened in the last five minutes, let alone the last hour or two. So far, your.flowingdata.com has been the answer.

your.flowingdata.com ('yfd') is a service based on Twitter. Users send direct messages to 'yfd' and can visit the site for simple visualizations. Users can also download tab-delimited files with all the data. But wait, there's more! One kind soul also created a simple yfd iPhone application that allows users to send an update (e.g. 'd yfd gnight') via a single button press. Each button can be customized as well. I have no use for Twitter, but yfd got me to open an account. We're still figuring out what we want to record, but the service's flexibility and ease-of-use makes it much more likely we'll actually use it.

yfd isn't perfect. There's no built-in way to, for example, calculate the time that has elapsed between two actions (e.g. going to sleep and waking up). One has to download the data and calculate durations manually (or create a script to do it). There are other visualizations available, though. As I mentioned, I find it's much more important to make it easy to capture data for something like this. If it's a pain to capture the data, there won't be anything to analyze on the back-end anyway.

So, if you have absolutely no interest in personal fitness, time tracking, etc., you may want to check out your.flowingdata.com...for the children.

UPDATE: yfd has been updated to allow the calculation of durations between defined actions. I'd love to be able to aggregate these durations over a given time period (i.e. daily, weekly, monthly, etc.) in the form of a bar chart or something. yfd does visualize the data, but in a slightly different way. Best if you just check it out through the "Explore" link on the yfd site.

Filed under  //  life   tech  

Comments (2)

Sep 15 / 12:15am

Parsing DTCC Part 1: PITA

In a previous post, I complained about the DTCC's CDS data website and the one week lifespan of the data published there. For those of you who don't know, the DTCC clears and settles a massive number of transactions every day for multiple asset classes. It's one of those financial institutions that doesn't get much press but underpins the entire capital market.

Anyway, the recent crisis motivated the DTCC to publish weekly CDS (single name, index, and tranche) exposure data. A good idea, until one realizes the data goes up in smoke when the next week's data arrives. Although DTCC recently added links to data for "a week ago", "a month ago", and "a year ago," it's still pretty inconvenient. So, if you want the data, you have to parse it yourself. I originally wanted to write a smart parser that would dynamically react to whatever format it encountered...I came to my senses and adopted a simpler approach.

The approach thus far:

  • Download the raw html pages/files via "curl." Urllib2 is the preferred method to pull web pages, but I didn't have the patience to figure out how to handle redirects. Curl is a utility included with OS X that, for whatever reason, ignores redirects automatically. As such, I created a short python script to download the html for all the tables of interest weekly.
  • Use BeautifulSoup to parse the html. Other libraries, such as html5lib and lxml seem to be gaining ground on BeautifulSoup, particularly as it's author wants to get out of the parsing game altogether. Nevertheless, I couldn't be bothered to figure out the unicode issues I experienced with html5lib or lxml's logic. BeautifulSoup is straightforward and "gives you unicode, dammit!" (quoting the author).
  • Use numpy for easier data manipulation. Since my html, css, DOM, etc. knowledge is basic, I thought it might be better to use numpy to manipulate the table data rather than rely solely on the parser. This meant vectorizing the html data into a 1D array, cleaning it up, and generally preparing it for future reshaping. Numpy, how did I ever live without you?


This would've been much easier if all the tables were exactly the same format. Unfortunately, that's never the case. An extra cell here or there, or weird characters, can throw things off. This isn't an issue if you are parsing individual pieces of data or a single table. But what if you need to parse ten, 20, 100, etc. tables? It can get ugly fast. The DTCC data is broken into 23 pages, some of which have multiple tables. Luckily, most of my pain was self-inflicted (hey, I'm a parsing virgin). I only had to account for a few different table formats in the end.

One downside to my approach is I do not dynamically produce headers for the data I'm pulling. I plan to manually set the headers for each table (the ultimate destination for the data right now are csv files). If there's a better way, please let me know.

You can find the code here via pastebin (feedback is welcome).
You can find the DTCC tables here (if you want to view the html source).

Part 2 will cover the process of reformatting the data with numpy and perhaps feature some charts. I'm very curious to see what the numbers show!

Here are a few screenshots of a terminal session using the code so far:

       
Click here to download:
Parsing_DTCC_Part_1_PITA_tag_p.zip (464 KB)

Filed under  //  finance   python  

Comments (2)