Monday, December 14, 2009

CouchDB

CouchDB is a new kind of database. It's not a relational database, it's not objective database. CouchDB just stores documents. A document is something like java map, it has keys and values. Values can be strings, numbers, dates, lists, maps, also binary attachments. Documents are accessible by pure http, REST service, in JSON format. CouchDB is written in Erlang, so it is well scalable for many processors/cores. (At least this is what it's authors claim ;)) Let's give it a try.

Installation

On ubuntu issue:

apt-get install couchdb

During installation, couchdb user was created. Now you can run database as a background process:

sudo -i -u couchdb couchdb -b

Ok, fire up your favorite browser and open page http://localhost:5984/ You should see something like: {"couchdb":"Welcome","version":"0.10.0"}

Not very useful, is it? Try going to http://localhost:5984/_utils/ This is Futon, CouchDB admin tool. From here you can create new databases and documents and manage them.

Simple operations by Futon

Create database tryme and some document. Field _id is generated automatically, add two more fields:

firstname: "Pawel"
surname: "Stawicki"

Click "Save document". You can see that field _rev was added automatically. Every change of every document is remembered, and you can always get the old version.

Protection

So there is one document in the database, let's protect it. Go to /etc/couchdb/local.ini and edit [admins] section. Add:

admin = <admin_password>

Password will be automatically hashed after run.

If you want to enable access from any machine, change bind_address from 127.0.0.1 to 0.0.0.0

Operations by RESTful HTTP interface

CouchDB can be accessed by RESTful http interface. Curl is one nice tool for it, so install it if you don't have it already on your system:

sudo apt-get curl

Now issue:

curl http://localhost:5984

You should see known welcome message. Try

curl http://localhost:5984/tryme

and some data about tryme database should appear. We can also list all existing databases:

curl http://localhost:5984/_all_dbs

Ok, all this stuff above you can do by the browser. But you can also create documents and databases in CouchDB by REST interface. Try this:

curl -X PUT http://localhost:5984/tryme/1 -d '{ "firstname": "Leszek", "surname": "Gruchała" }'

You just created a new document. In curly braces is document in JSON format. It's id however is not long stream of characters, but just "1". Much more predictable and repeatable. UUIDs are generally better. CouchDB can generate one for you:

curl http://leonidas:5984/_uuids
It generates one id. If you need more, you can pass "count" parameter:
curl http://leonidas:5984/_uuids?count=3

So we've seen that CouchDB uses JSON as documents format, and we can do operations by HTTP. However, from now on, I'll be using Futon because it is just easier. Futon is CouchDB administration tool (http://localhost:5984/_utils).

Creating view by map/reduce

JSON is JavaScript format, and it allows to put also functions into the document. Map functions (from map/reduce) are useful to create something similar to SQL view.

Let's add another document to our database, with following fields:

firstname: "Leszek"
surname: "Kowalski"

Now choose "Temporary view" from selection box on the right. Here you can write map and reduce functions. For view, map is enough, let's leave reduce empty for now. Create function:

function(doc) {
  if (doc.firstname == 'Leszek') {
    emit(doc.firstname, doc.surname);
  }
}

It filters documents to those with firstname "Leszek" only.

So we can filter documents to only those which interest us. What about joins? Can we get some document, and relevant documents data in one query? It's also possible. Let's add addresses to those guys in database. First, we need to distinguish documents related to persons from the ones related to addresses. For that, add type field to each person, and as value enter "person". Now, we'll need their id's, so copy/paste it somewhere.

Create address as a new document, and add those fields:

type: "address"
person_id: <id_of_person>
city: "New York"

The same for two remaining persons. Put some another city there. Ok, now our map function is a little more complicated:

function(doc) {
  if (doc.type == "person") {
    emit([doc._id, 0], doc);
  } else if (doc.type == "address") {
    emit([doc.person_id, 1], doc);
  }
}

This way we have person always next to her address.

We learn another CouchDB feature here - result of map is always sorted by key. In this case key is 2-element array, in which first element is the person id, and second one is 0 in case of person, 1 in case of her address. It's not like SQL join, we don't get all the data in one document, but still it answers our needs - we have person and it's address.

If we are interested in specific person, we can add parameter to GET request. But first we need to save our view in "desing document". Click "Save As..." button, fill in design document name (e.g. "reader") and view name (e.g. "person-address") and voila.

Design documents are special documents in database for storing views. You can have different design documents e.g. for readers and for writers or administrators. In desing document you can set a lot of stuff telling CouchDB how to render it's content (means how to change it into nice html). If you want to learn more about design documents, look into CouchDB book.

GET parameters for narrowing search

Now when our view is saved, we can use GET request to get it's content, and set parameters to limit what we get to person which we are interested in. Such URL: http://localhost:5984/tryme/_design/reader/_view/person-address?startkey=["1",0]&endkey=["1",1] will return person with _id=1 and it's address documents.

Another useful parameters are:

  • key - for specific key (not range, like in case of startkey, endkey)
  • descending=true - by default view is ordered by ascending key, set this parameter to order it descending
  • group_level - sets reduce level. Default is 0.
  • group=true - behaves like group_level set to maximum
  • revs_info=true - lists revisions of specified document. Applicable only to document selected by id, not to view

To understand group_level, we'll need reduce function. First, let's add some more parameters to our persons. Let's add "position" and "salary":

To one person add:

position: "manager"
salary: 5000
To second one:
position: "developer"
salary: 3000
and to third one:
position: "developer"
salary: 3500
Now create new view, with map and reduce functions:
//map:
function(doc) {
  emit([doc.position, doc._id], doc);
}

//reduce:
function(keys, values, rereduce) {
  var salaries = 0;
  for(i = 0; i < values.length; i++) {
    salaries += values[i].salary;
  }
  return salaries;
}

Save this view as "salaries". Now open URL http://leonidas:5984/tryme/_design/bla/_view/salaries and output is:

    {"key":null,"value":11500}

All values are grouped, and key is ignored.

Add parameter: http://leonidas:5984/tryme/_design/bla/_view/salaries?group_level=1 and it returns

{"key":["developer"],"value":6500},
{"key":["manager"],"value":5000}

Values are grouped for first level of key (first element of key array).

group_level=2 returns
{"key":["developer","61fe9c6c226b978f74b76329191806b3"],"value":3000},
{"key":["developer","eb3873f48bb581df13762324b8ec0313"],"value":3500},
{"key":["manager","1"],"value":5000}

Two elements of array are taken into account. In this case, it is the same like group=true, which takes all elements of key array.

What is rereduce for?

As you noticed, reduce function has third parameter rereduce

Result of map function is kept as sorted B-tree. Let's assume we want to summarize some value from each document (tree node). Just like salaries.

First, there are keys and whole nodes (because value of our map function is whole document) passed to reduce function, and rereduce is set to false

Look at the picture, we can reduce branch of the tree into 17, another branch into 4, and third one into 2. Assume each branch has common key at specified group_level (e.g. all people from first branch are developers, second are managers and third are administrators, if group_level is set to 1).

Then reduce function is called again, with following parameters:

function(["developer", "manager", "administrator"], [17, 4, 2], true);

There is common key for whole branch, reduced scalar value for that branch, and rereduce parameter set to true.

Important thing here is to remember that not always values are whole nodes, sometimes it can be already reduced values, and in such case rereduce parameter is helpful to distinguish this situation. In fact, our reduce function wouldn't work for big amount of nodes, because it handles only nodes (only rereduce=false case), not reduced values. To work with a lot of data, we should handle also scalar values:

function(keys, values, rereduce) {
  if (rereduce) {
    return sum(values);
  } else {
    var salaries = 0;
    for(i = 0; i < values.length; i++) {
  }
}

Now it should count and sum all the salaries.

4 comments:

Koziołek said...

hm... it's look great as data source for GWT/AJAX applications use JSON to communicate. But only to store not importat data as user screen configuration or cache.

winnetou said...

Znakomity post, wyczerpujacy i dokladny. Good work!

Anonymous said...

CouchDB is NOT a new kind of database ... a well known product that exists since the "middle ages" of Computer science - Lotus Notes - has ever been a document-oriented database.

Unknown said...
This comment has been removed by a blog administrator.