Importance of Crawling
The Web is arguably the largest source of information humanity had ever access to. Yet it obviously cannot be viewed as a database or a structured source of information. Rather the Web is highly unstructured, and difficult to use as a source of information for computers. Finding relevant information on the Web requires knowledge about how the Web is structured. Moreover, the information usually needs to be interpreted by humans.
Yet in many cases we can teach crawlers to find the information we are interested in, and to save the relevant information to some structured database. This blog post is about how HTMLCleaner and XPath can be put to use to teach Java crawlers to extract relevant information from web sites and save it to local data or knowledge bases. Use cases for this technique are the following:
Yet in many cases we can teach crawlers to find the information we are interested in, and to save the relevant information to some structured database. This blog post is about how HTMLCleaner and XPath can be put to use to teach Java crawlers to extract relevant information from web sites and save it to local data or knowledge bases. Use cases for this technique are the following:
- Extract all addresses and office hours for dentists for a given city from yellow pages.
- Extract all German cities and zip codes from Wikipedia articles.
- Extract all English nouns from Wiktionary.
- Extract information about real estate (size, price, address, number of rooms, etc) from a site offering real estate for save or rent.
Declarativity in Crawling
Declarative programming languages are used to formulate problems in an easy to understand and concise way. In contrast to procedural or imperative languages they do not determine the exact way of how a problem is solved, or how a solution is found, but leave the details to the machine which is executing the code. Examples for declarative languages include Prolog for logic programming, SQL for database access, and XPath for XML querying. When parsing XML or XHTML data we have the choice between a plethora of languages. While SAX is a very popular way for consuming large amounts of XML data, it is not declarative and thus harder to read, understand and maintain than XML queries written in XPath. On the other hand, when using SAX, the programmer has more control about how the information is extracted and thus can do complex code optimizations.
This blog post assumes that, since crawlers must be frequently adapted to the changing structure of web sites, the declarative way of specifying XML data extraction programs in XPath is in most cases superior to the more verbose, procedural -- or event based -- approach of SAX. Moreover the bottleneck in information extraction on the Web is in many cases the Web server supplying the information -- therefore efficiency for small personal crawlers is of secondary importance.
Code Sample
Cutting a long story short, XPath is the right technology for small crawlers that must be adapted frequently and where performance is secondary. So how can we use XPath for processing HTML that in the majority of cases is not standards compliant -- i.e. may have missing starting or closing tags, invalid html characters or undefined entity references? HtmlCleaner (and also other technologies such as JTidy) comes to the rescue. It can be used to convert HTML that may cause tens or hundreds of errors when run through the W3C validation service into tidy, well-formed XHTML. And since XHTML is also well-formed XML, XPath can be used to extract data from it.
The following is a small example showing how to extract the name of all languages mentioned on the front page of Wikipedia. Most of the following lines are boilerplate code and can/should be factored out into library functions. This way, the variable part of the crawler -- i.e. the part of the code that needs to be adapted in the case that the structure of the website changes -- is reduced to the single XPath expression (highlighted green).
HtmlCleaner cleaner = new HtmlCleaner();
CleanerProperties props = cleaner.getProperties();
TagNode node = cleaner.clean(new URL("http://www.wikipedia.org"));
ByteArrayOutputStream out = new ByteArrayOutputStream();
new SimpleXmlSerializer(props).writeToStream(node, out);
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(
new ByteArrayInputStream(out.toByteArray()));
XPathFactory xpf = XPathFactory.newInstance();
XPath xpath = xpf.newXPath();
XPathExpression xpe =
xpath.compile("//a[@class='link-box']/strong/text()");
NodeList list = (NodeList) xpe.evaluate(doc, XPathConstants.NODESET);
for (int i = 0; i < list.getLength(); i ++) {
Node n = list.item(i);
System.err.println(n.getNodeValue());
}