Random Thoughts

Samstag, 16. März 2013

Data Integration with XSLT and Java XML Web Services

Introduction

Due to the exponential growth of the Web, social media, and online shopping, data is increasing rapidly within internet businesses. A large part of this data is XML data. XSLT -- besides XQuery, SAX, Stax, and similar technologies -- is the technology of choice when integrating XML data. XSLT is often considered to have poor performance when compared to technologies that keep only very little information in memory, such as SAX. Yet the expressiveness of XSLT and its design as an XML transformation language still make it superior to less declarative approaches such as XML programming apis for general purpose programming languages. This blog post shows how some performance problems of XML transformations can be overcome by calling XML web services from XSLT.

My own experience is that the performance of XSLT transformations accessing large XML documents can be increased by a factor of 10 by having a web service supply fragments of those large documents. Saving large documents within a database with the appropriate indexes, or keeping them as a Map within memory, will result in logarithmic access time to those fragments. Not doing so will make XSLT search the entire document, which is linear runtime.

Calling XML Web Services from XSLT

One of the week points of XSLT is that it cannot directly access databases -- although some proprietary extensions supply this functionality. As a result XSL transformations often hold large XML documents within memory, and joins between elements of different XML sources must be computed without access to indexes or hash maps. Yet what is often overseen is that the XSLT document() primitive can access XML data not only from disk or from static XML documents, but can also call Web services, which in turn can access databases and return the retrieved data to the XSL transformation. Here is a simple example:

<xsl:stylesheet version="2.0"

xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="/">

<xsl:copy-of

select="document(

'http://myserivce.example.com/get?id=1111'

)" />

</xsl:template>

</xsl:stylesheet>

The above XSLT will call an XML Web service at at the given URI. This web service can do anything that the programming language it is written in, can do. In particular it can access databases and also return XML fragments which are directly included in the XML result.

Java Web Services with Java and JAXB

Java is possibly the most frequently used language to provide XML Web services. This section briefly describes how one can use JAXB to turn a simple Java Servlet into an XML Web Service. Using an XML technology such as JAXB for generating the XML makes sense in that it is a higher level language for generating XML, and thus less error-prone, than Java itself. This is achieved in three small steps:

Create a Java class (entity) that holds the data to be serialized,
insert the appropriate annotations for generating the XML tags for the properties, and
serialize that class using a JAXB context to the servlet response.

Here is an example class annotated with JAXB annotations (steps 1 and 2) that could be used by the Web service:

@XmlRootElement(name="Person")
public class Person {

/* Default constructor required by JAXB */
public Person() { }

public Person(String name, int age) {
this.name = name;
this.age = age;
}

@XmlElement
private String name;

@XmlElement
private Integer age;

/* Getters and Setters omitted for the sake of brevity */
}

When serialized using JAXB, this will yield an XML document of the following form:

<?xml version="1.0" ?>
<Person>
<name>Some Body</name>
<age>4</age>
</Person>

See the JAXB documentation on how to generate attribute values, or serialize lists or embedded entities. Any kind of XML document can be generated with JAXB.

Here is example code on how to serialize a Person entity within a servlet:

public void doGet(
HttpServletRequest req,
HttpServletResponse resp) throws Exception {

JAXBContext contextA =
JAXBContext.newInstance(Person.class);
Marshaller m = contextA.createMarshaller();
m.marshal(new Person("Some Body", 4), resp.getWriter());

}

Mittwoch, 13. Februar 2013

Johnson's Circuit Finding Algorithm in Java

Johnson's circuit finding algorithm is used to efficiently list all elementary circuits within a directed graph. It is described in the paper "Finding all the elementary circuits in a directed graph". The algorithm relies on Tarjan's algorithm for finding all strongly connected components of a graph.

I have written an implementation of this algorithm in Java under the Apache License. The implementation can be found on github. The implementation relies on the Jung2 graph API. I hope this may be useful to somebody.

Sonntag, 30. Dezember 2012

Declarative Crawling with Java, HtmlCleaner and XPath

Importance of Crawling

The Web is arguably the largest source of information humanity had ever access to. Yet it obviously cannot be viewed as a database or a structured source of information. Rather the Web is highly unstructured, and difficult to use as a source of information for computers. Finding relevant information on the Web requires knowledge about how the Web is structured. Moreover, the information usually needs to be interpreted by humans.

Yet in many cases we can teach crawlers to find the information we are interested in, and to save the relevant information to some structured database. This blog post is about how HTMLCleaner and XPath can be put to use to teach Java crawlers to extract relevant information from web sites and save it to local data or knowledge bases. Use cases for this technique are the following:

Extract all addresses and office hours for dentists for a given city from yellow pages.
Extract all German cities and zip codes from Wikipedia articles.
Extract all English nouns from Wiktionary.
Extract information about real estate (size, price, address, number of rooms, etc) from a site offering real estate for save or rent.

Declarativity in Crawling

Declarative programming languages are used to formulate problems in an easy to understand and concise way. In contrast to procedural or imperative languages they do not determine the exact way of how a problem is solved, or how a solution is found, but leave the details to the machine which is executing the code. Examples for declarative languages include Prolog for logic programming, SQL for database access, and XPath for XML querying. When parsing XML or XHTML data we have the choice between a plethora of languages. While SAX is a very popular way for consuming large amounts of XML data, it is not declarative and thus harder to read, understand and maintain than XML queries written in XPath. On the other hand, when using SAX, the programmer has more control about how the information is extracted and thus can do complex code optimizations.

This blog post assumes that, since crawlers must be frequently adapted to the changing structure of web sites, the declarative way of specifying XML data extraction programs in XPath is in most cases superior to the more verbose, procedural -- or event based -- approach of SAX. Moreover the bottleneck in information extraction on the Web is in many cases the Web server supplying the information -- therefore efficiency for small personal crawlers is of secondary importance.

Code Sample

Cutting a long story short, XPath is the right technology for small crawlers that must be adapted frequently and where performance is secondary. So how can we use XPath for processing HTML that in the majority of cases is not standards compliant -- i.e. may have missing starting or closing tags, invalid html characters or undefined entity references? HtmlCleaner (and also other technologies such as JTidy) comes to the rescue. It can be used to convert HTML that may cause tens or hundreds of errors when run through the W3C validation service into tidy, well-formed XHTML. And since XHTML is also well-formed XML, XPath can be used to extract data from it.

The following is a small example showing how to extract the name of all languages mentioned on the front page of Wikipedia. Most of the following lines are boilerplate code and can/should be factored out into library functions. This way, the variable part of the crawler -- i.e. the part of the code that needs to be adapted in the case that the structure of the website changes -- is reduced to the single XPath expression (highlighted green).

HtmlCleaner cleaner = new HtmlCleaner();

CleanerProperties props = cleaner.getProperties();
TagNode node = cleaner.clean(new URL("http://www.wikipedia.org"));

ByteArrayOutputStream out = new ByteArrayOutputStream();
new SimpleXmlSerializer(props).writeToStream(node, out);
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(
new ByteArrayInputStream(out.toByteArray()));

XPathFactory xpf = XPathFactory.newInstance();
XPath xpath = xpf.newXPath();
XPathExpression xpe =
xpath.compile("//a[@class='link-box']/strong/text()");

NodeList list = (NodeList) xpe.evaluate(doc, XPathConstants.NODESET);
for (int i = 0; i < list.getLength(); i ++) {
Node n = list.item(i);
System.err.println(n.getNodeValue());
}

Sonntag, 9. Dezember 2012

Quality of Service in MongoDB

Introduction

MongoDB is a document oriented, horizontally scalable database with javascript as its primary query language. Documents are stored in JSON format. MongoDB excels at running parallel jobs touching vast amounts of data. Its speed is achieved by technologies such as

map-reduce
memory-mapped files
superior indexing technology and
absence of schema.

Obviously this strength comes at the price of some weaknesses such as

no support for foreign key constraints
no support for database triggers
no built-in support for joins
no guaranteed Quality of Service.

This blog post will quickly explain the fourth drawback (Quality of Service) and how to deal with it in MongoDB.

The Quality of Service Problem in MongoDB

Typically MongoDB is used to run jobs that process, analyze or transform large amounts of data, possibly in parallel on multiple sharded hosts. But MongoDB can also be used to serve as a data storage for a user-application that requires short response times such that the user does not need to wait for the system -- e.g if a webapp uses MongoDB to store user data.

If the same mongo database is used for both kinds of tasks described above, then jobs that process large amounts of data can get in the way of small read or write request that an application issues against the database. To the best of my knowledge there is no way to tell mongo to reserve a given share of CPU usage for small jobs and to "renice" long running, CPU and memory intensive jobs, like you would usually do on a UNIX or Linux operating system. This will cause applications that require fast database response times to become unusable when those long running jobs are active.

Dealing with the QoS Problem in MongoDB

A solution to this problem is to have different mongodb instances for both kinds of tasks. But what if there are jobs that need to process the data that is written to the database from an application? Then we can still have two separate instances of mongo, but the data needs to be copied from the database that receives the data from the application to the database that is used for the analysis, aggregation or cleaning of that data. In many cases the results of these jobs needs to be copied back to the application database, such that the application can read and display the results. Copying the data between the databases has significantly less impact on the responsiveness of the database that serves the application, than running the processing jobs on this database in the first place.

Still, inserting large amounts of results into the application database at once can again cause this database to hang, especially if the database is part of a replica set that needs to be kept up to date.

The following function can thus be used to copy data matching the query query from a collection named coll from a database named source (e.g. "example1.org:27017/source") to a database named destination (e.g. "example2.org:27017/destination"), thereby only copying batches of size pageLength at a time, and sleeping sleepTime milliseconds between batches. Varying sleepTime allows to control the performance impact on the database, and thus to keep the database responsive while data analytics jobs are running.

function copyPaged(query, pageLength, sleepTime, coll) {

var count = db[coll].find(query).count();

var pageStart = 0;

while ((pageStart + pageLength) < count) {

log("pageStart: " + pageStart);

db = connect(source);

page = db[coll].find(query).

skip(pageStart).limit(pageLength);

db = connect(destination);

while (page.hasNext()) {

db[coll].insert(page.next());

}

pageStart+=pageLength;

sleep(sleepTime);

}

Donnerstag, 18. Oktober 2012

Functional Programming Misunderstandings

While talking with friends and colleagues I noticed that there are (at least) three widespread misunderstandings associated with the term "functional programming". In this post I will state those misunderstandings and explain for each one of them, why it is not true.

The primary utility of functional programming is the ability to pass functions as parameters to other functions (higher order programming; functions as first-class citizens).
Functional programming cannot be achieved in imperative programming or object oriented programming languages such as Java, Python or C.
Functional programming is just of academic interest and quite useless in practice for its inability to manipulate state.

The following paragraphs explain what is wrong with the statements above.

Primary Utility of Functional Programming

While passing functions as parameters is very beneficial to write clean and short code by using higher order functions such as fold, filter and map [2] known from functional languages such as Haskell or Scheme, it is not the most important aspect of functional programming.

Instead the most fundamental and beneficial part of functional programming is the isolation of side effects and the explicit flow of information over function parameters and function return values. This isolation of side effects makes code easier to understand, refactor and maintain. Moreover it allows for easier parallelization, which cannot be overestimated using the emergence of multicore processors and multi-machine clusters over the past years. This isolation of side effects also reduces the use of global variables, which are considered evil when they are used in cases where local variables can be used [1].

Functional Programming in Non-Functional Languages

While most imperative and object oriented programming languages, functions cannot be used as values (parameters and return values of other functions), higher order programming can be mimicked using function objects, i.e. objects with only a single method. The Guava Java library [5] adds support for higher order programming by adding function objects, and custom implementations for map and filter.

Yet not only higher order programming can be achieved by using function objects but side effects can be isolated just in the same way as functional programming languages enforce isolation of side effects by not supplying a global state for a program under execution. By reducing the number of functions that modify their parameters or modify global variables, we can make programs just as easy to refactor and reason about as programs written in functional languages are.

Use of Functional Programming in Industry

While originally invented as a research language, Haskell has gained significantly more users in industry over time. The Haskell Wiki has an impressive list of applications that were written in Haskell [3]. Also functional programming (i.e. reduction of side effects, use of pure functions) in imperative or object oriented languages, and using higher order programming in Ruby, Javascript, Scala, etc to remove boiler-plate code is getting more and more popular since the advantages of this coding style have been accepted by mainstream programmers.

References

[1] Global variabels are Evil

[2] Higher Order Functions

[3] Haskell in Industry

[3] Map Reduce
[4] Side Effects
[5] Guava library

Dienstag, 16. Oktober 2012

Copying Data between Mongo Databases Using Javascript only

MongoDB is a very popular noSql database with built in replication, horizontal scalability (sharding), Json as native data format and Javascript as the native query language. This blog post deals with copying data between mongo databases.

There are various methods for copying data between different MongoDB databases. The following possibilities come to my mind:

Using mongodump and mongorestore. This method is very fast and can be used to copy individual collections or the entire database from the command line.
Using mongoexport and mongoimport. This can be used to export data selected by a custom query or entire collections from the command line.
Using some programming language of your choice with a mongo driver.
Using the mongo shell and the use <database> or the db = connect("...")primitive.

The last solution is very convenient in that it can be used directly from Javascript, and gives you control such as counting the number of documents to copy. Below is an example use case:

> mongo localhost/test

PRIMARY> db.bla.find()

{ "_id" : ObjectId("507d19c4ce52c92d953fe8a1"), "1" : "one" }

{ "_id" : ObjectId("507d19cece52c92d953fe8a2"), "2" : "two" }

PRIMARY> var docs = db.bla.find()

PRIMARY> docs.size()

PRIMARY> use test2

switched to db test2

PRIMARY> while(docs.hasNext()) { db.blo.insert(docs.next()); }

PRIMARY> db.blo.find()

{ "_id" : ObjectId("507d19c4ce52c92d953fe8a1"), "1" : "one" }

{ "_id" : ObjectId("507d19cece52c92d953fe8a2"), "2" : "two" }

PRIMARY>

The above session shows how to interactively copy data between two databases on the same host. Note that this can also be done for databases on different machines or databases running on different ports of the same machine. In this case you would need to switch the connection using db = connect("www.example.org:7777/mydb") or similar instead of the use primitive.

Also note that you cannot use the iterator ( docs.find()) before the data is inserted into the target collection, otherwise you will loose data.

Montag, 23. Mai 2011

Extracting PDF Bookmark Information with itext

iText is an open source Java and C# library for manipulation of PDFs, such as building PDF documents from scratch or insertion of new content into an existing PDF. Yet it can also be used to simply extract information from PDF documents. While the API seems to be well-structured and easy to understand, good documentation on iText is rare. This short blog post shows how to extract bookmark information from PDF documents using iText, which turns out to be quite simple once you know which classes to use.

import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.SimpleBookmark;
import java.io.IOException;
import java.util.List;
import java.util.HashMap;
import java.io.FileWriter;

public class Test {

  public static void main(String [] args) throws IOException {
    PdfReader pr = new PdfReader("myDocument.pdf");
    List<HashMap<string,object> > bookmarks = SimpleBookmark.getBookmark(pr);
    FileWriter fw = new FileWriter("myBookmarks.xml");
    SimpleBookmark.exportToXML(bookmarks, fw, "utf8", false);
  }

}