Regular Expressions: curl Simplifies Web Retrieval

Regular Expressions: curl Simplifies Web Retrieval January 2002

by Cameron Laird and Kathryn Soraiz

One of the most frequent requests we receive is to automate operations on existing Web applications. In many cases, the best answer involves curl.

Descended from `httpget`

"curl is a tool for transferring files with URL syntax, supporting FTP, FTPS, HTTP, HTTPS, GOPHER, TELNET, DICT, FILE and LDAP", as stated on its home page at http://curl.haxx.se/. More precisely, curl is a portable command-line executable for convenient Web retrieval, along with an associated library, libcurl.

It's important to distinguish curl clearly from an entirely different project with the same name. Curl Corporation (http://www.curl.com/) licenses products related to its curl Web content language. This curl language was first developed at MIT and enjoys the involvement of Web founder Tim Berners-Lee. "Regular Expressions" will probably report on curl -- the language -- during the next year.

This column, however, is about a different curl. In 1997, Daniel Stenberg, a consultant based just north of Stockholm, wanted to make currency-exchange calculations available to Internet Relay Chat (IRC) users. All the necessary data are published on the Web; he just needed to automate their retrieval. This task is an interesting one in several regards. As 2002 opens, it's hard not to think of it in terms of formal Web Services (WS), where currency-market reporting is one of the canonical examples used to introduce newcomers to WS concepts. "Regular Expressions" will feature WS several times during the coming year.

Few current WS enthusiasts seem to know about Stenberg and his work. His approach was more direct and lightweight than the committee-based WS standards. He simply adopted an existing command-line open-source tool, httpget, that Brazilian Rafael Sagula had written. After a few minor adjustments, it did just what he needed.

... until he ran into a gopher-based currency Web site. Gopher is a peer to hypertext transport protocol (HTTP); that is, gopher is a standard that specifies certain kinds of networked information sharing. It's like HTTP, but different. Before long, Stenberg augmented httpget with knowledge of both gopher and file transport protocol (FTP), and found himself in charge of what he began to call urlget.

Another name change decorated a follow-up release, and, on 20 March 1998, Stenberg announced the birth of curl -- at version 4. Often written as "curl," the label glosses as "client for URLs." Since its first publication, curl has continued to grow, and its current source base is about an order of magnitude larger than it was three years ago. During that time, Stenberg and the other volunteer curl contributors have completely replaced the original httpget source; no code remains from that time.

One of the most important organizational changes during this growth has been a shift in emphasis from curl to libcurl. Original curl users primarily wanted it for straightforward automations that they launched from simple shell scripts. Lately, developers have been binding libcurl's C-coded functionality to a variety of other languages, including several of the scripting languages "Regular Expressions" regularly surveys: Dylan, Perl, PHP, Python, Rexx, Ruby, Scheme, and Tcl. Let's look at the specific benefits libcurl brings to these languages.

Networking, but Easier

Recent versions of all these languages build in networking abilities. There are abundant examples, both on the Web and in printed books, of ways to program correct client-side socket-based programs. What does libcurl contribute?

PHP is a convenient example of the answer. Since version 4.0.2, PHP has featured a built-in curl package. This package has given developers three related advantages over "raw" PHP network programming:

It's at a higher level.
It robustly solves all sorts of "little" problems that are tedious for an application developer.
libcurl tracks standards developments more closely than most programming teams can afford to do.

What we mean by the first of these is simply that libcurl is targeted at efficient, practical information retrieval. Standard language libraries are more likely to operate at the lower level of socket connections. Programs that exploit libcurl reduce their line count. To retrieve the contents of an arbitrary URL into a PHP variable only requires the following:

     <?php
       //
       // Adapted from
       //     http://curl.haxx.se/libcurl/php/examples/?ex=getpageinvar.php
       //
       $ch = curl_init();
       $sample_URL = "http://www.example.com/";
       curl_setopt($ch, curlOPT_URL, $sample_URL);
       curl_setopt($ch, curlOPT_RETURNTRANSFER, 1);
       $result=curl_exec($ch);
       curl_close($ch);
     ?>

Contrast this simplicity with all the "fiddly" detail necessary to code for a URL that might specify either HTTP or FTP.

That's not all. libcurl gains even more fans when users begin to deal with proxies, authentication, and all the other complexities HTTP has accumulated over the past decade. In principle, it's not hard to figure out the protocols and write these functions in low-level code. libcurl is for people who have more important things to do with their time.

Caltech LIGO Laboratory Senior Software Engineer Phil Ehrens, for example, prefers curl to "rolling his own." For him, use of "curl is better because it is more mature. More people have done more with it and it is pretty darn solid. [It was] written as an API for content gathering."

`curl`'s Achievements

The accomplishments of curl and libcurl are indeed impressive. It's not just that they solve the details of, say, redirected cookie-requiring HTTP login in a theoretical way. Working, well-maintained applications like GotMail (http://www.hawkins.emu.id.au/gotmail/) demonstrate curl's power in action. GotMail automates Hotmail message retrieval. We've worked in this area before, and it's a sticky problem -- one that GotMail solves well.

Stenberg collects other examples of curl and libcurl use at http://curl.haxx.se/curlprograms.html.

Hitting the Sweet Spot

One more comparison is instructive. libwww (http://www.w3.org/Library/) is a well-known library with many external similarities to libcurl. libwww is the model "general-purpose client side Web API" of the World Wide Web Consortium. Like libcurl, it knows about digest authentication, PUT, POST, and much more.

Unlike libwww, though, curl handles all URLs, not just those based on HTTP and HTTPS. Moreover, curl is available for more than just Unix and Windows, where libwww lives. curl generates and runs correctly under BeOS, QNX, Amiga, and several other operating systems. This is possible, in part, because libcurl is simpler. libwww parses HTML, is implemented with a choice of threading models, connects to external databases, and lists several other features beyond libcurl's scope. As Stenberg puts it, "libcurl is mainly for doing transportation. I've drawn the line there. Things that are inside the files, in the data, are outside what curl reads and understands."

An apparent consequence of libwww's ambition is fragility. It's an enormous library, and suffers from several outstanding known faults. Moreover, programming with libwww often doesn't feel as productive as libcurl; libwww generally requires more lines of client code than libcurl-based programs. A Debian information page (http://packages.debian.org/unstable/non-us/libwww-ssl-dev.html) summarizes: "It is quite difficult to write programs that use libwww, due to its complexity. If you do not need its advanced features, have a look at libcurl." Libwww maintainers are even considering a rewrite that would build on a base of libcurl.

That's our conclusion, too: you should have a look at libcurl. One implicit continuing theme of "Regular Expressions" involves the engineering of judgment and balance: how does the high-level expressivity of scripting languages change the way we best re-use existing code and algorithms? curl and libcurl seem to have located a sweet spot: not so low-level and tedious as straight socket programming, not so complex as libwww, but at a favorable place somewhere between the two.

Acknowledgments and Administrative News

Our thanks to Andrés Garcia for helping introduce us to curl, and to all the curl and libcurl maintainers.

UnixReview.com has agreed to double the "Regular Expressions" publishing schedule. In 2002, the column will be a monthly. See you in February.

Descended from httpget

Networking, but Easier

curl's Achievements

Acknowledgments and Administrative News

Descended from `httpget`

`curl`'s Achievements