Ruby is beautiful (but I’m moving to Python)

Tuesday, 23 November 2010

The Ruby language is beautiful. And I think it deserves to break free from the Web. I think the future of Ruby is firmly stuck in Web development, though, so I’m going to invest in a new language for data analysis, at least for now. This is a look at the fantastic language I came to from Java and a look at a possible candidate. (Update: I’ve since written a followup.)

Java to Ruby

Six years ago, I added Ruby to my technical arsenal. I learned C++ and Java in high school, and I planned to use them for data analysis in college–mainly for research kung fu. But when I discovered Ruby, I knew something was different. Ruby let me be productive and get things done fast. It was fantastically useful, for everything from renaming files to plotting finances to doing math homework to preparing lab reports. Now, I’m no computer scientist, and I did not need C’s superspeed, so Ruby–a little slower, but dynamic, was perfect for me.

What struck me about Ruby? What features did I see that made me move from static, fast languages like Java to a dynamic language like Ruby?

First, the language is strikingly expressive.

A standard “Hello, World” program looks like this in Java:

class ThisIsAClassIDontReallyWantToNameButJavaMakesMe {
  public static void main() {
    System.out.println("Hello World");
  }
}

Ruby’s creator, Yukihiro Matsumoto, doesn’t believe in unnecessary fluff or hierarchy. So you don’t have to declare a class or main method just to run a script. The above code is much simpler in Ruby:

puts 'Hello World'

It’s pretty obvious: Ruby encourages simplicity.

Ruby embraces convention over needless configuration. That means short method names that do what most people want–print and puts will print, gets will grab keyboard input. It also means including common libraries–for files and regular expressions–in the language itself.

Nowhere is this more helpful than when we’re reading files into our script. In Java, IO is tedious. Sometimes I think its designers didn’t think we’d ever do that sort of thing. Don’t believe me? In Java, the simplest way to read a flat file is as follows. (Note: I have not coded Java in about six years. Don’t take this as a recommendation or expert advice.)

import java.io.File;
import java.io.InputStream;

// ... declare class, etc., then ...

public byte[] justReadMeAFilePlease() {
  try {
    file = new File("my-file.txt");
    fis = new FileInputStream(file);
    byte[] b = new byte[(int) file.length()];
    fis.read(b);
    return b;
  } catch (Exception e) {
    e.printStackTrace();
  }
}

The same task in Ruby looks something like this:

contents = File.read 'my-file.txt'

I was physically shocked when I first learned how to read files in Ruby.

Ruby’s expressiveness is nice and, as I found out, lets me accomplish what I want wicked fast.

First-class functions

Higher-order functions are another vital part of Ruby. In true Lisp style, Ruby lets us pass around un-called functions in our code (local variables in tow) and call them exactly when we want.

Why would I want that in a language?

Certainly when I was coding in Java, I never knew I needed closures or higher-order functions. I happily chipped away at problems using for loops, iterators and tightly coupled functions. Because after all, that was programming. Right?

Ruby showed me a new way. Instead of implementing endless loops to work through data, Ruby showed me I could outsource those loops to its runtime. That is, all Ruby needs me to tell it is a list and a function. In turn, it will run that function on every item in my list. It can slice and dice a list with that function, or it could give me a new list derived from each element in the old one. The point is that instead of demanding trite details and edge cases (think C-style for loops), Ruby only asks for broad instructions.

In fact, Ruby is so well designed that for loops are almost never necessary and, in fact, most Rubyists grimace if you do use one. (Yes, they are more performant. No, we don’t use Ruby when that difference would matter.)

There have been a few attempts to mimic higher-order functions in Java. However, that’s probably as difficult as implementing an object system in C. And even valiant attempts at function passing in Java are not widely adopted. We’ll probably need a Java++ before we ever get these functions into Java. (Note: There is a project underway that aims to give closures to Java. We’ll see if that passes through the bureaucracy that is Oracle.)

A case study: list filtering

In case you don’t believe that higher-order functions are powerful and necessary, let’s make this more concrete. When exactly can we benefit from function passing? What does this look like?

Let’s use filtering as an example. It’s often the case that we have a collection of items that we want to screen before processing. Maybe we’re looking at a dataset and don’t want to include nil values in our analysis. Or maybe we want to drop words from a list if they’re shorter than 3 letters long. I don’t have concrete data, but I would say list manipulation can take up a good bit of a normal programmer’s development cycle. (Or, for a Lisper, 100% of his time.) The remaining time is, of course, debugging the edge cases in said list manipulation.

Filtering is important. So let’s filter.

Naturally when you want to filter something, you need an actual standard, a criterion that acts like a sieve for your data, one that allows some numbers to pass into a new list and bounces others. (In fact, a more advanced type of filter is a winnower - a function that returns two sets of data. The first meets our criteria. The second does not.) What can that criterion look like?

If we think about it hard enough, a criterion has to be a function. There is no way that a criterion can be a value (like 2). Even if you simply want to filter a list for all values equal to 2, you cannot simply pass 2 as a criterion. You have to have some way of comparing 2 to each element in your list. And that requires a function. Something like (in Ruby) lambda {|x| x == 2 }. You need a function, because a value can’t take arguments, and to filter a list, each item needs to be passed to something as an argument.

Now, if you have a simple criterion function and you know you’ll use it over and over again, the Java solution is to embed the criterion function right inside the filter function. That is, you have a custom-made filter that can only apply one criterion. In Java, you can do something like this to filter all even numbers into a new array:

public int[] filterEven(int[] array) {
  aggregator = int[array.size()]
  for(int i: array) {
    int aggregatorSize = 0;

    // The next line is the only unique part of
    // this function!
    if(array[i] % 2 == 0) {
      // Someone please tell me there's an easier way
      // to append to an array than this
      aggregator[aggregatorSize] = array[i];
      aggregatorSize += 1;
    }
  }

  return aggregator;
}

int[] array = { 1, 2, 3, 4, 5, 6 };
filterEven(array);
// => { 2, 4, 6 }

But there are several problems here. First and foremost, the only part of this function that has to do with “evenness” is the criterion!

if(array[i] % 2 == 0)

That’s one line out of eight! And in fact, only half of that line is unique. The if(array[i] ...) part will be the same in any filter regardless of the critera. In other words, about 90% of this function should be extracted into another function. The obvious problem? To separate our filter from our criterion, we have to have a way to pass the criterion function into the filter. And, as I’ve said, that’s just not possible in Java.

But Ruby has higher-order functions. So let’s do this thing:

array = [1, 2, 3, 4, 5, 6]
criterion = lambda {|n| n % 2 == 0 }

array.select &criterion
# => [2, 4, 6]

See that? I created a criterion function independent of my filter, and simply passed it to my filter. That function–select–will apply my criterion to each element in array and return a new array where the criterion is true.

But we can do better. In Ruby, everything is an object and responds to messages. And it turns out that we can ask a number if it’s even or odd. In other words, numbers have our criterion function built in, and we don’t have to define it at all. So:

array.select &:even?
# => [2, 4, 6]

This time I didn’t have to define a single function! Both my filter and criterion are built into the language.

Now, what happens if I ever to change my criterion or add more rules? Well, in Java, I have to write an entire new function–for example, filterOdd(), filterWordsLongerThan(), filterNamesThatAreInThisOtherList(), etc., etc., ad infinitum. Each of those functions will be about 90% repetition and 10% novel. At best, we could make one function that takes multiple parameters and tries to decipher all of them to create an internal filter of its own. This is extremely inelegant and leads to bloated code.

Ruby’s closures, though, let me keep my filters and criteria separate until the minute I need them. All I need to do is define new criteria and pass them to my existing filter function. Doing this lets us do some complex stuff in very few lines of code. Let’s play around. (Note: This is a look at Ruby’s power and elegance, not necessarily good coding practice.)

words = %w(red orange yellow green blue indigo violet)

# One way to build up a filter. Build your comparison
# function into a criterion function. Then curry a
# specific number into that criterion function.

longer_than = lambda {|len, str| str.length > len }
longer_than_5 = longer_than.curry[5]

words.select &longer_than_5
# => ['orange', 'yellow', 'indigo', 'violet'] 

# More general way to build up a filter. Build 
# a general comparison function. Curry in a
# comparator function (for example, the
# greater than function). Then curry in a number.
# This is a far more flexible way to do things.

compare_with = lambda {|f,len,s| s.length.send(f, len) }

shorter_than = compare_with.curry[:<]
longer_than = compare_with.curry[:>]

shorter_than_5 = shorter_than.curry[5]
longer_than_5 = longer_than.curry[5]

words.select &shorter_than_5
# => ['red', 'blue']

words.select &longer_than_5
# => ['orange', 'yellow', 'indigo', 'violet'] 

Cool, huh? The takeaway: let your language deal with repetitive details like loops. As a programmer, focus on making code actually do things.

Higher-order functions have so transformed my code that I barely recognize what I was writing without them. The Java mess I detailed above is the Certified Java Way™ of working through any and all collections, last time I checked, and it can’t easily be any other way: you simply can’t pass Java functions around as objects.

Needless to say, I’ve moved full time from Java to Ruby. Actually I did that when I first discovered the language.

Yes, Java ran faster than Ruby. Yes, more people were ‘doing’ Java. No, Ruby would not get you a job–this was before Rails 2. But I didn’t care about any of that. I wanted an easy, elegant way to make my computer work for me, to write scripts and to build libraries. I wanted to prototype and get things done. I was tired of spending too much time on boilerplate code and not enough time solving real problems. I think the industry still hasn’t grasped the elegance of the above code.

Ruby does not revel in structures or minutiae. It is flexible. And powerful. It really almost is a Lisp. And over the last six years, it’s bee invaluable to me. I’ve learned to build websites using Rack and Sinatra and Rails–in fact, the site you’re looking at is powered by Ruby that I wrote–and I feel like an expert at the language. I’m in the process of building a couple of open source libraries. And generally, I’m satisfied with the way things are in the Ruby community. Exciting things are happening.

What I’m missing (and it’s not the semi-colon)

Okay, I’ve spent a good screenfuls of text praising Ruby for being beautiful and useful. I do love the language and think it makes programming painless. But the Ruby community is not giving me one very important thing, something that’s vital for me at work: solid tools for science. I’ve already used Ruby at work to dramatically speed up some of our vaccine screens–for processing and summarizing data–but I’ve always had to pipe my numbers into R for statistics and graphing. Because unfortunately, despite all the hubbub around Ruby, no one is crunching numbers with it! I’m sick of R, though, and I want a complete solution for my numbers needs. So I’m going to do that thing that no Ruby programmer wants to do: I’m going to learn Python.

Now, learning Python is something that is usually frowned upon in the Ruby world, not just because it’s the ‘rival language’, but because it’s so similar. The traditional argument is: “Don’t learn Python! It’s a dynamically typed, object-oriented, imperative interpreted scripting language, just like Ruby! And it’s ugly.” And zomg, all of that is true. But I’ve found (over four years, and especially this year) that really Ruby is focused on one thing, and that’s Rails/web development.

I love webdev, and I’ve done my fair share of it. But I also happen to think coding at the command line is pretty damn cool (see: the recent penchant for programmers to switch to Vim) and I want solid community support in that arena.

In theory, I could port SciPy and NumPy and PyBrain to Ruby and spend a lot of effort maintaining that code. It’s been done before (see the failed SciRuby project) and I don’t want to go that route. Instead, I’m going to choose the smarter option and will simply add Python to my résumé. SciPy is unrivaled in the dynamic programming world, and I plan to embrace it for projects at work. What’s more, Python supports higher-order functions just like Ruby, so I’m not missing out on all the functional goodness I described earlier.

However, this big change is not without its problems.

Python package management

This past week, I started my foray into the Python world. This involved installing Python 2.7 and 3.1, bookmarking Dive into Python, and figuring out package management.

Er… by “figuring out”, I mean “being completely baffled by”.

At the moment, Python package management seems to be fragmented and complicated. I am used to typing something like gem install symbolic.

RubyGems is the standard way to install third-party libraries, and it’s quite the streamlined process. In Python, though, there seem to be competing managers (easy_install and pip) and separate ways to package libraries for uploading. I’m also hearing names like setuptools, distutils, distribute and virtualenv, and I have to say that the whole ecosystem isn’t too clear to me yet. And the documentation tends to assume I know what all the above mean already!

After asking around, I’m going to use pip for package management–apparently Pip Is The Future™. In fact, it looks like it’s meant to mimic RubyGems semantics. So installing SciPy should be as easy as pip install numpy scipy.

Awesome. Easy as Ruby.

But wait. What’s this? I see a lot of text moving down my screen. God. My computer is starting to heat up. Now I’m seeing errors all over the place. “You need a FORTRAN compiler. Found gfortran. Installation failed.”

Wha?! I mean, I expected C dependencies, sure. I would hate to do math without them. But FORTRAN? Is the extra speed really worth the complication?

I’m hoping to get this all sorted out soon and start doing some heavy stats with the new language. I’m excited to be joining a group of people who are focused on data and experiments, and not only HTTP requests, NoSQL, jQuery and event processing. That’s not a jab at the Rubyists I know. I’m just excited to put a language (and not just a web framework) to good use.

What to like about Ruby

Now, I want to be clear: I like Ruby better than Python. Its semantics make more sense to me than Python’s, and the language itself is more beautiful. I love that I can define + for any object in Ruby and that doing so gives me += for free. (Edit: Apparently, Python gives me that, too.) I love that blocks unify Ruby’s closures, anonymous functions, and iteration. That Ruby has so little syntax that it can masquerade as C or Perl or Scheme (more on that some other time). That when I code in Ruby, it does everything the way I would expect to–without reading documentation.

But Python and Ruby aren’t so different, and I want a strong data community to work with. I don’t want to duplicate effort to bring solid libraries to Ruby, and Python seems like the way to go. I might even use this new language to dabble in machine learning using PyBrain and the Natural Language Toolkit, which both, well, rock my socks off. The prosopects in Python seem endless.

Statistics kung-fu is in demand and will continue to be so, and I’m happy to become competent with it. Maybe someday, that will be viable on the Ruby platform–and I look forward to that day. For now, Python is going into my toolbelt. I welcome the challenge that implies.