Levenshtein distance function for Pig and Hadoop - take 2: Now in Scala

As a follow up to Levenshtein distance function for Pig and Hadoop, I wanted to implement the same in Scala (I wanted to write cleaner code than I could in Java).

I have a really small DefaultEvalFunction that extends org.apache.pig.EvalFunc but instead of exec function where implementors sometimes return null, I have an execute def that returns Option[T].  And another simple nicety of a RichTuple that allows me to fetch parameters and cast in one shot.

To use it:

Comments [0]

Levenshtein distance function for Pig and Hadoop

Need to do a whole slew of fuzzy string comparisons?  Have a Hadoop cluster at your disposal?  Use the above gist to give you a Levenshtein distance function that you can use within Pig.

 

Comments [0]

Measuring @twitterapi; per-tweet "delivery time" is approx. 16 ms

Assuming a model of

delivery time = constant time + (number tweets returned * time per tweet)

and based on data gathered from polling statuses/home_timeline from "similar" users at 15 second intervals both using since_id and not using since_id, the computed median per tweet "delivery time" by the Twitter API (excluding constant processing done on every request) is 16.425 ms.  The range from the second and third quartiles is 15.400-17.638 ms.  I'm defining "per tweet delivery time" as the time to do any per-tweet processing and the time to send it down the wire to the client.

Read the rest of this post »

Comments [0]

One year on @twitterapi

One year ago today I posted this tweet.

Comments [1]

To add to my list of things to build in my copious spare time

Racer = RC car + wireless video and actuation control + lots of cardboard + video game cabinet.  Final product is an awesome recreation of a lo-fi video game.

Comments [0]

Marin Century 2010

Rode the Marin Century this weekend - 6800' - 7000' feet of hills in damp cold weather.  Loved it, except I don't like climbing in massive groups.  Next year, the Mount Tam version is on my calendar.

Comments [0]

iOS4 jailbreaking

I jailbroke my iPhone 4 running iOS 4.0.1 using jailbreakme last night.  And then I finally installed the apps to make my phone useful again -- from the photo:

  • Intelliscreen - that gives me a lock screen that is actually useful!  It displays my calendar and the current weather right when i turn it on as well as give me some useful icons in the top bar;
  • MyWi - Turns on the ability to do tethering on your iPhone. I don't like doing tethering over WiFi to my phone (as some apps like PDANet will do), but MyWi will allow you to tether the "correct" way over the USB cable;
  • Five Icon Dock - to get one more app down there; and
  • Notified - logs all my notifications to give me historical access to those pop ups instead of the ephemeral way they are displayed today.

Comments [0]

New cycling goal: Berkeley Hills Death Ride

I tried to ride the Berkeley Hills Death Ride last weekend - in a word: painful.  I only got through the first four of six hills (although, hill six requires some fire trailing - probably not the best on skinny tires).  It's now my "once a month" ride until I finish it.

Comments [0]

Getting out and about for @twitterapi

I've been talking and speaking at a few places recently about the @twitterapi, and I figure its time to update some presentations.  Last week @themattharris and I spoke at The Hacker Dojo covering some of the latest features that we've put out, as well as answering a bunch of questions around basic-auth shutdown.  It was a blast!  Great energy down there.  @themattharris put all his slides up so you can read them too:

Read the rest of this post »

Comments [3]

OAuth 1.0a test strings for the taking - get them while they're hot

Say, hypothetically, you were re-factoring a lot of code that verifies OAuth 1.0a signatures.  You would probably want a comprehensive list of strings to test against, no?  @episod and @danadanger created the OAuth UTF-8 character map, and I adapted that into a YML file of just the test strings.  Hope you find it useful.  I'm bashing some code against it right now.  You get some stuff that looks like the following:

--- 
tests: 
  inputs: 
  - 23456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abc
  - !binary |
    ZGVmZ2hpamtsbW5vcHFyc3R1dnd4eXp7fH1+f8KAwoHCgsKDwoTChcKGwofC
    iMKJworCi8KMwo3CjsKPwpDCkcKSwpPClMKV

  - !binary |
    wpbCl8KYwpnCmsKbwpzCncKewp/CoMKhwqLCo8KkwqXCpsKnwqjCqcKqwqvC
    rMKtwq7Cr8KwwrHCssKzwrTCtcK2wrfCuMK5wrrCu8K8wr3CvsK/w4DDgcOC
    w4PDhMOFw4bDhw==

 

Comments [0]

About