Sonntag, 30. Dezember 2012

Declarative Crawling with Java, HtmlCleaner and XPath

Importance of Crawling

The Web is arguably the largest source of information humanity had ever access to. Yet it obviously cannot be viewed as a database or a structured source of information. Rather the Web is highly unstructured, and difficult to use as a source of information for computers. Finding relevant information on the Web requires knowledge about how the Web is structured. Moreover, the information usually needs to be interpreted by humans.

Yet in many cases we can teach crawlers to find the information we are interested in, and to save the relevant information to some structured database. This blog post is about how HTMLCleaner and XPath can be put to use to teach Java crawlers to extract relevant information from web sites and save it to local data or knowledge bases. Use cases for this technique are the following:

  • Extract all addresses and office hours for dentists for a given city from yellow pages.
  • Extract all German cities and zip codes from Wikipedia articles.
  • Extract all English nouns from Wiktionary.
  • Extract information about real estate (size, price, address, number of rooms, etc) from a site offering real estate for save or rent.

Declarativity in Crawling

Declarative programming languages are used to formulate problems in an easy to understand and concise way. In contrast to procedural or imperative languages they do not determine the exact way of  how a problem is solved, or how a solution is found, but leave the details to the machine which is executing the code. Examples for declarative languages include Prolog for logic programming, SQL for database access, and XPath for XML querying. When parsing XML or XHTML data we have the choice between a plethora of languages. While SAX is a very popular way for consuming large amounts of XML data, it is not declarative and thus harder to read, understand and maintain than XML queries written in XPath. On the other hand, when using SAX, the programmer has more control about how the information is extracted and thus can do complex code optimizations. 

This blog post assumes that, since crawlers must be frequently adapted to the changing structure of web sites, the declarative way of specifying XML data extraction programs in XPath is in most cases superior to the more verbose, procedural -- or event based -- approach of SAX. Moreover the bottleneck in information extraction on the Web is in many cases the Web server supplying the information -- therefore efficiency for small personal crawlers is of secondary importance.

Code Sample

Cutting a long story short, XPath is the right technology for small crawlers that must be adapted frequently and where performance is secondary. So how can we use XPath for processing HTML that in the majority of cases is not standards compliant -- i.e. may have missing starting or closing tags, invalid html characters or undefined entity references? HtmlCleaner (and also other technologies such as JTidy) comes to the rescue. It can be used to convert HTML that may cause tens or hundreds of errors when run through the W3C validation service into tidy, well-formed XHTML. And since XHTML is also well-formed XML, XPath can be used to extract data from it.

The following is a small example showing how to extract the name of all languages mentioned on the front page of Wikipedia. Most of the following lines are boilerplate code and can/should be factored out into library functions. This way, the variable part of the crawler -- i.e. the part of the code that needs to be adapted in the case that the structure of the website changes -- is reduced to the single XPath expression (highlighted green).


HtmlCleaner cleaner = new HtmlCleaner();

CleanerProperties props = cleaner.getProperties();
TagNode node = cleaner.clean(new URL("http://www.wikipedia.org"));

ByteArrayOutputStream out = new ByteArrayOutputStream();
new SimpleXmlSerializer(props).writeToStream(node, out);
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(
    new ByteArrayInputStream(out.toByteArray()));

XPathFactory xpf = XPathFactory.newInstance();
XPath xpath = xpf.newXPath();
XPathExpression xpe = 
    xpath.compile("//a[@class='link-box']/strong/text()");

NodeList list =  (NodeList) xpe.evaluate(doc, XPathConstants.NODESET);
for (int i = 0; i < list.getLength(); i ++) {
    Node n = list.item(i);
    System.err.println(n.getNodeValue());
}






Sonntag, 9. Dezember 2012

Quality of Service in MongoDB

Introduction

MongoDB is a document oriented, horizontally scalable database with javascript as its primary query language. Documents are stored in JSON format. MongoDB excels at running parallel jobs touching vast amounts of data. Its speed is achieved by technologies such as
  1. map-reduce
  2. memory-mapped files
  3. superior indexing technology and 
  4. absence of schema.
Obviously this strength comes at the price of some weaknesses such as
  1. no support for foreign key constraints
  2. no support for database triggers
  3. no built-in support for joins
  4. no guaranteed Quality of Service.
This blog post will quickly explain the fourth drawback (Quality of Service) and how to deal with it in MongoDB.


The Quality of Service Problem in MongoDB

Typically MongoDB is used to run jobs that process, analyze or transform large amounts of data, possibly in parallel on multiple sharded hosts. But MongoDB can also be used to serve as a data storage for a user-application that requires short response times such that the user does not need to wait for the system -- e.g if a webapp uses MongoDB to store user data.

If the same mongo database is used for both kinds of tasks described above, then jobs that process large amounts of data can get in the way of small read or write request that an application issues against the database. To the best of my knowledge there is no way to tell mongo to reserve a given share of CPU usage for small jobs and to "renice" long running, CPU and memory intensive jobs, like you would usually do on a UNIX or Linux operating system. This will cause applications that require fast database response times to become unusable when those long running jobs are active.


Dealing with the QoS Problem in MongoDB

A solution to this problem is to have different mongodb instances for both kinds of tasks. But what if there are jobs that need to process the data that is written to the database from an application? Then we can still have two separate instances of mongo, but the data needs to be copied from the database that receives the data from the application to the database that is used for the analysis, aggregation or cleaning of that data. In many cases the results of these jobs needs to be copied back to the application database, such that the application can read and display the results. Copying the data between the databases has significantly less impact on the responsiveness of the database that serves the application, than running the processing  jobs on this database in the first place.

Still, inserting large amounts of results into the application database at once can again cause this database to hang, especially if the database is part of a replica set that needs to be kept up to date.

The following function can thus be used to copy data matching the query query from a collection named coll from a database named source (e.g. "example1.org:27017/source") to a database named destination (e.g. "example2.org:27017/destination"), thereby only copying batches of size pageLength at a time, and sleeping sleepTime milliseconds between batches. Varying sleepTime allows to control the performance impact on the database, and thus to keep the database responsive while data analytics jobs are running.

function copyPaged(query, pageLength, sleepTime, coll) {
  var count = db[coll].find(query).count();
  var pageStart = 0;

  while ((pageStart + pageLength) < count) {
    log("pageStart: " + pageStart);
    db = connect(source);
    page = db[coll].find(query).
      skip(pageStart).limit(pageLength); 
    db = connect(destination);
    while (page.hasNext()) {
      db[coll].insert(page.next());
    }
    pageStart+=pageLength;
    sleep(sleepTime);
  }
}

Donnerstag, 18. Oktober 2012

Functional Programming Misunderstandings

Functional Programming Misunderstandings

While talking with friends and colleagues I noticed that there are (at least)  three widespread misunderstandings associated with the term "functional programming". In this post I will state those misunderstandings and explain for each one of them, why it is not true.
  1. The primary utility of functional programming is the ability to pass functions as parameters to other functions (higher order programming; functions as first-class citizens).
  2. Functional programming cannot be achieved in imperative programming or object oriented programming languages such as Java, Python or C.
  3. Functional programming is just of academic interest and quite useless in practice for its inability to manipulate state.
The following paragraphs explain what is wrong with the statements above.

Primary Utility of Functional Programming

While passing functions as parameters is very beneficial to write clean and short code by using higher order functions such as fold,  filter and map [2] known from functional languages such as Haskell or Scheme, it is not the most important aspect of functional programming.

Instead the most fundamental and beneficial part of functional programming is the isolation of side effects and the explicit flow of information over function parameters and function return values. This isolation of side effects makes code easier to understand, refactor and maintain. Moreover it allows for easier parallelization, which cannot be overestimated using the emergence of multicore processors and multi-machine clusters over the past years. This isolation of side effects also reduces the use of global variables, which are considered evil when they are used in cases where local variables can be used [1].

Functional Programming in Non-Functional Languages

While most imperative and object oriented programming languages, functions cannot be used as values (parameters and return values of other functions), higher order programming can be mimicked using function objects, i.e. objects with only a single method. The Guava Java library [5] adds support for higher order programming by adding function objects, and custom implementations for map and filter.

Yet not only higher order programming can be achieved by using function objects but side effects can be isolated just in the same way as functional programming languages enforce isolation of side effects by not supplying a global state for a program under execution. By reducing the number of functions that modify their parameters or modify global variables, we can make programs just as easy to refactor and reason about as programs written in functional languages are.

Use of Functional Programming in Industry

While originally invented as a research language, Haskell has gained significantly more users in industry over time. The Haskell Wiki has an impressive list of applications that were written in Haskell [3]. Also functional programming (i.e. reduction of side effects, use of pure functions) in imperative or object oriented languages, and using higher order programming in Ruby, Javascript, Scala, etc to remove boiler-plate code is getting more and more popular since the advantages of this coding style have been accepted by mainstream programmers.

References



Dienstag, 16. Oktober 2012

Copying Data between Mongo Databases Using Javascript only

Copying Data between Mongo Databases Using Javascript only

MongoDB is a very popular noSql database with built in replication, horizontal scalability (sharding), Json as native data format and Javascript as the native query language. This blog post deals with copying data between mongo databases.

There are various methods for copying data between different MongoDB databases. The following possibilities come to my mind:
  • Using mongodump and mongorestore. This method is very fast and can be used to copy individual collections or the entire database from the command line.
  • Using mongoexport and mongoimport. This can be used to export data selected by a custom query or entire collections from the command line.
  • Using some programming language of your choice with a mongo driver.
  • Using the mongo shell and the use <database> or the db = connect("...")primitive.
The last solution is very convenient in that it can be used directly from Javascript, and gives you control such as counting the number of documents to copy. Below is an example use case:

> mongo localhost/test
PRIMARY> db.bla.find()
{ "_id" : ObjectId("507d19c4ce52c92d953fe8a1"), "1" : "one" }
{ "_id" : ObjectId("507d19cece52c92d953fe8a2"), "2" : "two" }
PRIMARY> var docs = db.bla.find()
PRIMARY> docs.size()
2
PRIMARY> use test2
switched to db test2
PRIMARY> while(docs.hasNext()) { db.blo.insert(docs.next()); }
PRIMARY> db.blo.find()
{ "_id" : ObjectId("507d19c4ce52c92d953fe8a1"), "1" : "one" }
{ "_id" : ObjectId("507d19cece52c92d953fe8a2"), "2" : "two" }
PRIMARY>

The above session shows how to interactively copy data between two databases on the same host. Note that this can also be done for databases on different machines or databases running on different ports  of the same machine. In this case you would need to switch the connection using db = connect("www.example.org:7777/mydb") or similar instead of the use primitive. 

Also note that you cannot use the iterator ( docs.find())  before the data is inserted into the target collection, otherwise you will loose data.