Word Count

There's a little command line utility on *nix which I use a lot - it's wc or "word count". This is especially useful to because I live in a world where everything is plain text right up until I have to send it to someone else (and sometimes not even then). Despite its name, word count can count more than just words - it can do characters, words, lines and can tell you the length of the longest line while its at it.

Counting Lines

The biggest problem with counting lines is remembering the name of the utility, since its called "word count" and not "line count". I tend to use this for doing things like piping grep to wc and counting the lines to give me an idea of how many occurrences of something there are. I also use it to count errors in weblogs or really anything else that I could do with summarising. The syntax is something like:

grep -R TODO * | wc -l

Using a count like this is especially good for things like auditing code, where I need to know how prevalent something is - or refactoring, where I'm looking for how many of a particular pattern are outstanding. Counting lines is also very compatible with my habit of making lists in text files.

Counting Words

This is the feature that the utility was originally designed for, and as you can imagine, its pretty good at that. As with most things, this blog post started life as a text file and when I got to this point I saved it and ran:

wc -w wc_article.txt

It outputs the number of words (272) and the name of the file, which is useful if you're giving it a pattern to match.

Word Count

Its a really convenient and versatile little program; I use it often and I hope others will find it useful too.

4 thoughts on “Word Count

  1. Unix (or gnu utils for that matter) are build on the premise: do one thing, and do it well. wc is a very nice tool for counting, however, in practice there are others that might be even more important when dealing with text, numbers and reporting.

    For instance, I want to count the number of times each variable in a file is called, nicely sorted and all:

    cat test.php | grep -o -E '\$[A-Za-z0-9]+' | sort -r | uniq -c | sort -r -n

    display test.php, output to grep, which only displays the strings that match a $, followed by a word (or letters), basically, this is a variable reference in php. After this, sort all output (this is needed for uniq), then, move all output (nicely sorted) through uniq, which count the number of times duplicates are found (it stops counting when it reaches another value). It will display the count in front of the value. After that, the last thing it does, move all output again to a sort, but this time sort it in descending order by naturial sort (so you get 1, 2, 10, 11 instead of 1, 10, 11, 2).

    At the end, it will display a list like:

    pre. 23 $this
    12 $a
    4 $key
    4 $val
    3 $i

    Although there are other ways of achieving the same output, it shows you how much you can do with basically 3 "simple" gnu commands.

    My advice anyway, if you going to work on a cli, learn your commands! (tip: look at find and xargs. It won't prepare your diner, but it can do pretty much do anything else)

  2. @Lorna If you don't want to remember "word count" for doing a "line count", you can do your own alias:
    alias lc='wc -l'
    I use lots of alias to speed things up :)

    @Joshua nice one line trick!

  3. Joshua: that's a blog post in itself, thanks for contributing so nicely to my blog :)

    minterior: I tend not to use aliases, because I use lots of different machines and they never have the alias I'm looking for - great tip though, thanks!

  4. The grep utility actually has line counting build in, so instead of doing:

    shell> grep -R TODO * | wc -l

    You could do the following instead:

    shell> grep -cR TODO *

    Although for some reason I still naturally do the former. :)

Leave a Reply

Please use [code] and [/code] around any source code you wish to share.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>