Not many people know the powerful command line options that Ruby understands. They really demonstrate how Ruby drew inspiration from Perl and is a great tool for general-purpose command line scripting.
I’ve prepared a short refactoring story to demonstrate how might use some of the
options at our disposal. Ruby can do more than I can show you here with this
example, so be sure to check out its manpage using man ruby
. Note: I have also
given a presentation on this subject at Eindhoven.rb.
Let’s say we are given the task to update some data files that we need to use on a project. The data kind of looks like CSV but contains some other stuff as well. We need to filter out some sales records based on country, which is listed as one of the fields per row. Here’s a sample from the file we are dealing with:
% wc -l
1005 data.csv
% head data.csv
# Copyright 2014 Acme corp. All rights reserved.
#
# Please do not reproduce this file without including this notice.
# ===============================================================
Name,Partner,Email,Title,Price,Country
Nikolas Hamill,Emely Langosh Sr.,nash@moen.info,Awesome Wooden Computer,42261,Puerto Rico
Friedrich Zboncak MD,Ms. Trycia Sporer,nils@treutelrodriguez.name,Sleek Wooden Hat,35701,Suriname
Marcus Nicolas,Margot Hoppe,maeve@hilll.info,Rustic Steel Shoes,40258,Argentina
Toni Ernser I,Guillermo Kihn II,clara.marvin@west.net,Sleek Cotton Pants,68332,Turks and Caicos Islands
Mayra Kerluke DDS,Marvin Lynch,sydni.schuppe@schuster.com,Incredible Steel Gloves,47017,New Zealand
Of course there are many ways to deal with data like this, including using CSV from Ruby’s standard library. Let’s assume we can’t use that option and we need to implement something manually. Here’s how such a script might look:
#!/usr/bin/env ruby -w
# This tranforms input files that look like CSV and strips comments and
# filters out every line not about "Suriname".
# Define some basic variables that control how records and fields
# are defined.
input_record_separator = "\n"
field_separator = ','
output_record_separator = "\n"
output_field_separator = ';'
filename = ARGV[0]
File.open(filename, 'r+') do |f|
# Read the entire contents of the file in question
# in an input array.
input = f.readlines(input_record_separator)
output = ''
# Loop over all the lines in the file with a counter
input.each_with_index do |last_read_line, i|
# Remove the ending newline from the line for easier
# processing.
last_read_line.chomp!(input_record_separator)
# Extract all fields in this record.
fields = last_read_line.split(field_separator)
# Only proceed for non-comment lines about Suriname
if fields[5] == 'Suriname' && !(last_read_line =~ /^# /)
# Write the output lines including the line number
# and combine fields using our custom separator
fields.unshift i
output << fields.join(output_field_separator)
output << output_record_separator
end
end
# Rewind back to the start of the file and replace all its
# contents with the content in `output`.
f.rewind
f.write output
f.flush
f.truncate(f.pos)
end
This is definitely not some of the best code I have every written, but it gets the job done. It takes the first command line argument as a file name, reads the file and then loops over all the lines in it to filter out just what we want. For the lines we want, it appends a new line to a special output string which gets written to the same file once all lines have been processed.
I can use the program as follows:
% chmod +x filter_sales
% ./filter_sales data.csv
Using default globals
In order to optimise this program we can first switch to using some built-in global variables. In order to clarify their names, we require the english library:
#!/usr/bin/env ruby -w
require 'english'
$INPUT_RECORD_SEPARATOR = "\n"
$FIELD_SEPARATOR = ','
$OUTPUT_RECORD_SEPARATOR = "\n"
$OUTPUT_FIELD_SEPARATOR = ';'
filename = ARGV[0]
File.open(filename, 'r+') do |f|
input = f.readlines(input_record_separator)
output = ''
input.each_with_index do |last_read_line, i|
$LAST_READ_LINE = last_read_line
$INPUT_LINE_NUMBER = i
$LAST_READ_LINE.chomp!($INPUT_RECORD_SEPARATOR)
$F = $LAST_READ_LINE.split($FIELD_SEPARATOR)
if $F[5] == 'Suriname' && !($LAST_READ_LINE =~ /^# /)
$F.unshift $INPUT_LINE_NUMBER
output << $F.join($OUTPUT_FIELD_SEPARATOR)
output << $OUTPUT_RECORD_SEPARATOR
end
end
f.rewind
f.write output
f.flush
f.truncate(f.pos)
end
These global variables are used by Ruby itself and are the first step in shrinking our code.
Using default values
As these global variables are used by Ruby internally, they mostly ship with sensible default values. Also they are used as defaults in sensible locations. We can therefore reduce our code like so:
#!/usr/bin/env ruby -w
require 'english'
$FIELD_SEPARATOR = ','
$OUTPUT_RECORD_SEPARATOR = "\n"
$OUTPUT_FIELD_SEPARATOR = ';'
filename = ARGV[0]
File.open(filename, 'r+') do |f|
input = f.readlines
output = ''
input.each_with_index do |last_read_line, i|
$LAST_READ_LINE = last_read_line
$INPUT_LINE_NUMBER = i
$LAST_READ_LINE.chomp!
$F = $LAST_READ_LINE.split
if $F[5] == 'Suriname' && !($LAST_READ_LINE =~ /^# /)
$F.unshift $INPUT_LINE_NUMBER
output << $F.join
output << $OUTPUT_RECORD_SEPARATOR
end
end
f.rewind
f.write output
f.flush
f.truncate(f.pos)
end
We have gotten rid of some arguments and the declaration of
$INPUT_RECORD_SEPARATOR
. We can also use IO#print
, which will join multiple
arguments together using $OUTPUT_FIELD_SEPARATOR
. It will also include a
$OUTPUT_RECORD_SEPARATOR
if it is not nil
.
#!/usr/bin/env ruby -w
require 'english'
$FIELD_SEPARATOR = ','
$OUTPUT_RECORD_SEPARATOR = "\n"
$OUTPUT_FIELD_SEPARATOR = ';'
filename = ARGV[0]
File.open(filename, 'r+') do |f|
input = f.readlines
f.rewind
input.each_with_index do |last_read_line, i|
$LAST_READ_LINE = last_read_line
$INPUT_LINE_NUMBER = i
$LAST_READ_LINE.chomp!
$F = $LAST_READ_LINE.split
if $F[5] == 'Suriname' && !($LAST_READ_LINE =~ /^# /)
$F.unshift $INPUT_LINE_NUMBER
f.print *$F
end
end
f.flush
f.truncate(f.pos)
end
This change helped us get rid of the output
variable. Next, rather than
reading the entire file into a single input
array, we can read it line by line
using a while
loop:
#!/usr/bin/env ruby -w
require 'english'
$FIELD_SEPARATOR = ','
$OUTPUT_RECORD_SEPARATOR = "\n"
$OUTPUT_FIELD_SEPARATOR = ';'
filename = ARGV[0]
File.open(filename, 'r+') do |f|
while f.gets
$LAST_READ_LINE.chomp!
$F = $LAST_READ_LINE.split
if $F[5] == 'Suriname' && !($LAST_READ_LINE =~ /^# /)
$F.unshift $INPUT_LINE_NUMBER
f.print *$F
end
end
end
We can now use IO#gets
to read a line from our file, and automatically set
$LAST_READ_LINE
and $INPUT_LINE_NUMBER
. We have lost our ability to re-write
the entire file though, so we’ll need to bring that back somehow. Luckily, we
can.
Reading and editing files in-place
By using the -n
and -i
flags, we can let Ruby read through our file using
IO#gets
and let IO#print
write straight back into the file. The -i
optionally takes a file extension to create a backup file, but omitting it skips
the backup file altogether. Let’s rewrite our program by letting Ruby use these
two flags.
#!/usr/bin/env ruby -w -n -i
require 'english'
BEGIN {
$FIELD_SEPARATOR = ','
$OUTPUT_RECORD_SEPARATOR = "\n"
$OUTPUT_FIELD_SEPARATOR = ';'
}
$LAST_READ_LINE.chomp!
$F = $LAST_READ_LINE.split
if $F[5] == 'Suriname' && !($LAST_READ_LINE =~ /^# /)
$F.unshift $INPUT_LINE_NUMBER
print *$F
end
The -n
flag wraps the script in a while gets ... end
loop. In order to set
our field and record separator variables, we need a BEGIN { ... }
block that
gets called at the start of the program – wherever it is defined in the source
code. Our calls to IO#print
now default to our single open file and the -i
flag handles writing our output back into the original file.
Also note there is a -p
flag that works mostly the same as -n
, but includes
a print $_
statement at the end of the loop. It will read and then print every
line in the file, allowing you to either skip lines using next
or modify the
current line before printing. But for now, we’ll stick with -n
.
Configuring variables using command-line options
We can use more command line switches to set some values in our program:
#!/usr/bin/env ruby -w -n -i -F, -l
require 'english'
BEGIN {
$OUTPUT_FIELD_SEPARATOR = ';'
}
$F = $LAST_READ_LINE.split
if $F[5] == 'Suriname' && !($LAST_READ_LINE =~ /^# /)
$F.unshift $INPUT_LINE_NUMBER
print *$F
end
Using the -F
flag we can specify the value for $INPUT_FIELD_SEPARATOR
, and
with -l
we can tell Ruby to assign the value of $INPUT_RECORD_SEPARATOR
to
$OUTPUT_FIELD_SEPARATOR
and remove $INPUT_FIELD_SEPARATOR
from the
$LAST_READ_LINE
using String#chomp!
. That means the input records separator
is removed when reading lines (which is what we want) and added when writing
lines (which is also what we want). Removing newlines from the input line helps
prevent double newlines in output lines.
Now, let’s use Ruby’s auto-splitting feature using -a
:
#!/usr/bin/env ruby -w -n -i -F, -l -a
require 'english'
BEGIN {
$OUTPUT_FIELD_SEPARATOR = ';'
}
if $F[5] == 'Suriname' && !($LAST_READ_LINE =~ /^# /)
$F.unshift $INPUT_LINE_NUMBER
print *$F
end
With -a
Ruby will automatically split the current line into $F
on every
iteration. Now we are getting somewhere.
Compare against current line
Ruby provides us with one more special shortcut we can use once the -n
(or
-p
) flag is used: in conditionals, regular expressions implicitly match
against the value of $LAST_READ_LINE
and ranges of numbers against the value
of $INPUT_LINE_NUMBER
. With this knowledge we can simplify our conditional:
#!/usr/bin/env ruby -w -n -i -F, -l -a
require 'english'
BEGIN {
$OUTPUT_FIELD_SEPARATOR = ';'
}
unless $F[5] != 'Suriname' || /^# /
$F.unshift $INPUT_LINE_NUMBER
print *$F
end
Shortening the code
Now we’ve got all the parts in place we can shorten our code a bit by making our
conditional a one-liner and removing the english
library and switch to the
abbreviated Perl-y global variable names:
#!/usr/bin/env ruby -w -n -i -F, -l -a
BEGIN { $, = ';' }
print $., *$F unless $F[5] != 'Suriname' || /^# /
Conclusion
So, yes, we’ve basically built a half-assed implementation of Awk in Ruby. If you know Awk (and you should!) you might as well use that. But chances are you know Ruby better. Once you get comfortable with these command line flags, Ruby becomes a very nice tool in your sysadmin toolbelt. You might write a simple script like this straight on the command line like so:
ruby -wlani -F, -e "BEGIN { $, = ';' }" -e "print $., *$F unless $F[5] != 'Suriname' || /^# /"
…or you might use some fancier tools, such as quickly parsing some YAML:
ruby -r yaml -e 'puts YAML.load(ARGF)["database"]' config/database.yml
Sometimes, the methods Awk or Sed give you are best suited for what you need to do. But sometimes you need something more, or you just don’t care to look up how to perform certain operations you know how to do in Ruby in some other language. You should always use the right tool for the job, and given Ruby’s flexibility I think it may surprise you how often that tool is Ruby.