Ruby is beautiful (but I’m learning Python)

The Ruby language is beautiful. But I think the future of Ruby is firmly stuck in Web development, which is a good reason to take a look at other languages for data analysis. This is a look at the fantastic language I came to from Java and a look at Python for data analysis.

Java, Ruby & expressiveness

Six years ago, I added Ruby to my technical arsenal. I learned C++ and Java in high school, and I planned to use them for data analysis in college—mainly for research kung fu. But when I discovered Ruby, I knew something was different. Ruby let me be productive and get things done fast. It was ridiculously useful, for everything from renaming files to plotting finances to doing math homework to preparing lab reports. I didn't need C's superspeed or Java's safety, so Ruby—a little slower, but dynamic—was perfect for me.

What struck me about Ruby? What features did I see that made me move from static, fast languages like Java to a dynamic language like Ruby?

First, the language is strikingly expressive.

A standard "Hello, World" program looks like this in Java:

class ThisIsAClassIDontReallyWantToNameButJavaMakesMe {
  public static void main() {
    System.out.println("Hello World");
  }
}

Of course, this is an extreme example: Most systems in Java won't see this high ratio of boilerplate to significant code. But the Java world does generally accept boilerplate as okay (saying it should be handled by an IDE), and that has pretty big effects on the ecosystem.

Yukihiro Matsumoto, on the other hand, doesn't believe in much hierarchy and wants to avoid surprise. So he lets us do the same thing in Ruby much more easily:

puts 'Hello World'

Another obvious example of Ruby's expressiveness is its API for IO.

In Java, IO can be a chore. Sometimes I think its designers didn't think we'd ever do things like read files. But of course it's not that. It's just that Java is honed for building robust, large-scale systems with fine-grained control over bits and bytes. So the standard library isn't optimized for scripting or naive implementations. In Java, users have to specify things like buffer interactions even if they don't need them. Reading a file goes something like this (Note: I haven't coded in Java for six years, so there may now be better ways):

import java.io.File;
import java.io.FileInputStream;

class FileReader {
  public static void main(String args[]) {
    try {
      File file = new File("./text.md");
      FileInputStream fis = new FileInputStream(file);
      byte[] contents = new byte[(int) file.length()];
      fis.read(contents);
      // do something with contents
    } catch (Exception e) {
      e.printStackTrace();
    }
  }
}

The same task in Ruby looks something like this:

contents = File.read 'text.md'

Java certainly gives us more control here, and that's really appropriate sometimes. But for engineers who don't need to optimize and do read files a lot, especially while scripting, having an easy-to-remember method to read files is a game changer.

Blocks make function passing easy

So Ruby removes boilerplate and gives us lots of power in just a few keystrokes. That's excellent, but if that were all Ruby offered, Java's advanced IDEs could be just as good of a solution. But Ruby also lets us decouple our code into re-usable chunks much more easily. It does this by define uncalled functions (as blocks) and manipulate them in our code.

When we pass a blocks to a method, we get the chance to specify how our outer code (the block definition and all of our local state) interacts with the inner code (the method definition). We pass our blocks around, uncalled by default, until exactly when we want to evaluate them. This means we can easily combine logic and state from multiple places to effectively decouple our code and achieve maximal code reuse. Taking a page out of the functional programming book, blocks effectively introduce higher-order programming to Ruby in a way that's both useful and easy to understand.

Why, specifically, would I want to pass around functions in a language?

Certainly when I was coding in Java, I never knew I needed to. I happily chipped away at similar problems repeatedly using for loops, iterators and tightly coupled functions. Those were the language primitives I was given.

But Ruby enlightened me. (If I had been around during Lisp's heydey, I would've learned that earlier.) Instead of implementing endless loops to work through data, Ruby showed me I could outsource that work to a standard set of methods for lots of tasks. For example, I learned that to transform every item in a list, all I need to do is call map on that list, and pass in a block that does the transformation. In turn, it's the array, and not my application code, that has already defined how to run that function on every item it contains.

I could also slice and dice a list with a block, or reduce it down to a single value. The point is that instead of having to repeat implementation details and edge cases in multiple spots (think C-style for loops), I can code at a level of abstraction that's appropriate for my application.

In fact, Ruby is so well designed that for loops aren't very common and, in fact, most Rubyists scratch their heads if you do use one.

There have been a few attempts to harness higher-order functions in Java. However, so far they've seemed clunky, and they haven't taken the Java world by storm. (Note: There is a project underway that aims to add anonymous functions to Java. I look forward to seeing how that goes.)

A case study: list filtering

If you're already on board with higher-order programming, you already know the power that comes from Ruby's blocks, and you can skip this section. But if you don't believe me, let's try to make this more concrete. When exactly can we benefit from function passing? What does this look like?

It's often the case that we have a collection of items that we want to filter them before processing. Maybe we're looking at a dataset and we want our dataset without empty values. Or maybe we want to get all of the words in a list that have more than 3 characters. Or all users who have logged into our site within the last hour.

Naturally when you want to filter something, you need an actual standard, a criterion that acts as a guard for your data, one that allows some numbers to pass into a new list and bounces others. (In fact, a more advanced type of filter is a winnower—a function that returns two sets of data. The first meets our criterion. The second does not.) That sounds nice in theory, but what does it mean to be a criterion? How do we code that?

A common way in languages without higher-order functions is to iterate over a collection and manually collect items that meet the hardcoded expression in an if statement:

import java.util.Arrays;

class ArrayFilterer {
  public static int[] filterEven(int[] array) {
    int[] filteredArray = new int[array.length];
    int filteredArraySize = 0;

    for (int i : array) {
      // The next line is the only unique part of
      // this function!
      if (i % 2 == 0) {
        // Someone please tell me there's an easier way
        // to append to an array than this
        filteredArray[i] = i;
        filteredArraySize += 1;
      }
    }

    return Arrays.copyOfRange(filteredArray, 0, filteredArraySize);
  }

  public static void main(String[] args) {
    int[] array = { 1, 2, 3, 4, 5, 6 };
    filterEven(array); // => { 2, 4, 6, 0, 0, 0 }
  }
}

This is a fine way to filter if we need to do it once ever. But what if we want to filter the same list using another criterion? In Java, we build another loop and filter again!

It may be okay to program at such a repetitive, low level for performance-sensitive code (like you might in C). But what if we could decouple the criterion and the grunt work for this filter? If we could do that, not only could we have more reusable chunks of code, but some of those key chunks might even be provided by the standard library!

To get a handle on this goal, notice that the only part of our Java function that has to do with "evenness" is the expression inside the if statement.

if (i % 2 == 0)

That's one line out of eight! And in fact, only half of that line is unique. Now, we can't extract that body in Java because that line needs access to each element of the array in turn (i). We also can't just pass the % 2 == 0 bit because Java doesn't allow currying (leaving the operator unevaluated until it has enough arguments, in this case).

If only we could pass around an uncalled set of code, ready to be executed on parameters any time we chose... Oh wait, that's what a function is!

While Java may not (yet) support passing functions around as objects, Ruby does, so let's do this thing:

def filter(arr, &criteria)
  filtered_arr = []

  arr.each do |elem|
    if criteria.call(elem)
      filtered_arr << elem
    end
  end

  filtered_arr
end

array = [1, 2, 3, 4, 5, 6]
criterion = lambda {|n| n % 2 == 0 }

filter(array, &criterion)
# => [2, 4, 6]

See that? I defined functions for general filtering and for my criteria completely independently and simply passed criterion in to filter at runtime. This decoupling gave us the benefits we wanted with almost no syntax overhead.

And in fact, because code can be separated along these lines, the collections in our standard library have already defined the most common functions for us, so lots of the time, the only thing we need to provide is a block (and not the reusable implementation).

We can rewrite the above using only methods provided by the standard library:

array.select(&:even?)
# => [2, 4, 6]

This time I didn't have to define a single function! I re-used two completely decoupled methods, Array#select and Integer#even?—methods that didn't even know about each other—to accomplish this task. 

The takeaway: if you let your language deal with repetitive details like looping and even-ness, you get to focus on the unique bits of code that differentiate your application from everyone else's. And having so many reusable methods available to us makes it not only useful, but also easy to do the right thing by default.

From Java to Ruby

Higher-order functions have transformed my code so much that I barely recognize what I was writing without them. In fact, in 2005, I stopped coding in Java and learned Ruby. Because I wasn't working in an environment where stability or fine-grained data structure manipulation was necessary, I shifted to favor of ease-of-use instead of perfect control. Yes, Java ran faster than Ruby. Yes, more people were 'doing' Java. No, Ruby would not get you a job—this was before Rails 2. But I didn't care about any of that. I wanted an easy, elegant way to make my computer work for me, to write scripts, create libraries, analyze data, and build Web applications. I wanted to move quickly, get things done. I was tired of spending too much time on boilerplate code and not enough time solving real problems.

(Update: As of 2016, my needs have shifted in the other direction, and I'm much more in favor of statically typed systems like Haskell. However, I think it was important for the pendulum to swing in the Ruby direction during the early stages of my career. I also think that Haskell's type system is far better than Java's, which is pretty strongly tied to mutable objects.)

I think the industry still hasn't grasped the elegance of the style of code I'm looking at here. I get an expressive, flexible syntax, almost like Lisp. But in a friendly language that tries, above all else, not to surprise me. I get the ability to interoperate with Unix with just a backtick, but the language also runs on Windows and other platforms. And over the last six years, this power and expressiveness has been invaluable to me. I've learned to build websites using Rack and Sinatra and Rails—in fact, the site you're looking at is powered by Ruby that I wrote—and I feel like an expert at the language.

I'm in the process of building out a couple of open source libraries. And generally, I'm satisfied with the way things are in the Ruby community. Exciting things are happening.

What I'm missing (and it's not the semi-colon)

Okay, I've spent lots of time praising Ruby for being beautiful, expressive and pragmatic. I do love the language and think it makes programming painless in lots of ways. But the Ruby community is not giving me one very important thing, something that's vital for me at work: solid tools for science and statistics. I've already leveraged Ruby at work to dramatically speed up some of our high-throughput experimoents—for processing and summarizing data—but I've always had to pipe my numbers into R for statistics and graphing. Because unfortunately, despite all the hubbub around Ruby, no one seems to be crunching numbers with it. I'm not completely comfortable with R, though, and I want a one-stop solution for my numbers needs. So I'm going to do that thing that no Ruby programmer wants to do: I'm going to learn Python.

Now, learning Python is something that is usually frowned upon in the Ruby world, probably just because it's the 'rival language' and is pretty similar. The traditional argument is: "Don't learn Python! It's a dynamically typed, multi-paradigm, interpreted scripting language, just like Ruby! And it's ugly." And all of that is true. But I've found (over four years, and especially this year) that really Ruby is focused on one thing, and that's web development.

I love Web dev, and I've done my fair share of it. But I also have a day job that depends a lot on statistics.

In theory, I could port functions from SciPy and NumPy to Ruby. It's been tried before without success (see the failed SciRuby project) and I'm pretty confident I don't want to go that route. It takes a community, and not just one person, to foster something like that. And I have other things to focus on now.

Instead, I'm going to leverage the huge data ecosystem that's grown around Python and add the language to my résumé. SciPy is unrivaled in the modern programming world, and I plan to embrace it for projects at work. What's more, Python supports higher-order functions like Ruby, so I'm not missing out on all the functional goodness I described earlier.

However, this big change is not without its problems.

Python package management

This past week, I started my foray into the Python world. This involved installing Python 2.7 and 3.1, bookmarking Dive into Python, and figuring out package management.

Er, I guess, by "figuring out", I mean "being completely baffled by".

At the moment, Python package management seems to be fragmented and complicated. I am used to typing something like gem install symbolic when I want to install something on my machine. It's standard, simple, and rarely causes problems.

In Python, though, there seem to be competing managers (easy_install and pip) and separate ways to package libraries for uploading. I'm also hearing names like setuptoolsdistutilsdistribute and virtualenv, and I have to say that the whole ecosystem isn't too clear to me yet. And the documentation tends to assume I know what all the above mean already!

After asking around, I take it I should use pip for package management—apparently Pip is the future. In fact, it looks like it's meant to mimic RubyGems. So installing SciPy should be a simple pip install numpy scipy.

Awesome. Easy as Ruby.

But wait. What's this? I see a lot of text moving down my screen. God. My computer is starting to heat up. Now I'm seeing errors all over the place. "You need a FORTRAN compiler. Found gfortran. Installation failed."

Wha?! I mean, I expected C dependencies, sure. I would hate to do math without them. But FORTRAN? Is FORTRAN something we're still installing in 2010?

I'm hoping to get this all sorted out soon and start doing some heavy stats with the new language. I'm excited to be joining a group of people who are focused on data and experiments, and not only HTTP requests, MVC, APIs, jQuery and event processing. That's not a jab at the Rubyists I know. I'm just excited to put a full language (and not just a web framework) to good use.

What to like about Ruby

Now, I want to be clear: I like Ruby better than Python. Its "developer UX" make more sense to me than Python's, and the language itself seems more expressive. I love that objects know how to `map` or `filter` themselves, which leads to elegant chaining. I love that blocks unify Ruby's closures, anonymous functions, and iteration. That Ruby has such versatile syntax that it can masquerade as C or Perl or Scheme (more on that some other time). That when I code in Ruby, it does everything the way I would expect to—without reading documentation.

But Python and Ruby aren't so different, and I want a strong data community to work with. I don't want to duplicate effort to bring solid libraries to Ruby, and Python seems like the way to go. I might even use this new language to dabble in machine learning using PyBrain and the Natural Language Toolkit, which both, well, rock my socks off. The potential for number crunching in Python seems endless.

Anyway, statistics knowledge is in demand and probably will be in the future, so I'm happy to become competent with these tools. Maybe someday, that will be viable on the Ruby platform—and I look forward to that day. For now, Python is going into my toolbelt. And I welcome the challenge that implies.