Total Pageviews

Tuesday, 13 January 2015

Open Source Web Crawlers Written in Java

I was recently quite pleased to learn that the Internet Archive’s new crawler is written in Java. Coincindentally, I had in addition to put together a list of open source projects forfull-text search engines, I put together a list of crawlers written in Java to complement that list. Here’s the list:
  1. Heritrix – Heritrix is the Internet Archive’s open-source, extensible, web-scale, archival-quality web crawler project. Heritrix is designed to respect the robots.txt exclusion directives and META robots tags .
  2. WebSPHINX – WebSPHINX ( Website-Specific Processors for HTML INformation eXtraction) is a Java class library and interactive development environment for web crawlers. A web crawler (also called a robot or spider) is a program that browses and processes Web pages automatically. WebSPHINX consists of two parts: the Crawler Workbench and the WebSPHINX class library.
  3. Nutch – Nutch provides a transparent alternative to commercial web search engines. As of June, 2003, we have successfully built a 100 million page demo system. Uses Lucene for its indexing, however provides its own Crawler implementation.
  4. WebLech – WebLech is a fully featured web site download/mirror tool in Java, which supports many features required to download websites and emulate standard web-browser behaviour as much as possible. WebLech is multithreaded and will feature a GUI console.
  5. Arale – While many bots around are focused on page indexing, Arale is primarly designed for personal use. It fits the needs of advanced web surfers and web developers.
  6. J-Spider – Based on the book “Programming Spiders, Bots and Aggregators in Java”. This book begins by showing how to create simple bots that will retrieve information from a single website. Then a spider is developed that can move from site to site as it crawls across the Web. Next we build aggregators that can take data from many sites and present a consolidated view.
  7. HyperSpider – HyperSpider (Java app) collects the link structure of a website. Data import/export from/to database and CSV-files. Export to Graphviz DOT, Resource Description Framework (RDF/DC), XML Topic Maps (XTM), Prolog, HTML. Visualization as hierarchy and map.
  8. Arachnid – Arachnid is a Java-based web spider framework. It includes a simple HTML parser object that parses an input stream containing HTML content. Simple Web spiders can be created by sub-classing Arachnid and adding a few lines of code called after each page of a Web site is parsed.
  9. Spindle- spindle is a web indexing/search tool built on top of the Lucene toolkit. It includes a HTTP spider that is used to build the index, and a search class that is used to search the index. In addition, support is provided for the Bitmechanic listlib JSP TagLib, so that a search can be added to a JSP based site without writing any Java classes.
  10. Spider – Spider is a complete standalone Java application designed to easily integrate varied datasources. XML driven framework for data retrieval from network accessible sources, scheduled pulling, highly extensible, provides hooks for custom post-processing and configuration and implemented as a Avalon/Keel framework datafeed service.
  11. LARM – LARM is a 100% Java search solution for end-users of the Jakarta Lucene search engine framework. It contains methods for indexing files, database tables, and a crawler for indexing web sites. Well, it will be. At the moment we only have some specifications. It’s up to you to turn this into a working program. Its predecessor was an experimental crawler called larm-webcrawler available from the Jakarta project.
  12. Metis – Metis is a tool to collect information from the content of web sites. This was written for the Ideahamster Group for finding the competitive intelligence weight of a web server and assists in satisfying the CI Scouting portion of the Open Source Security Testing Methodology Manual (OSSTMM).
  13. SimpleSpider – The simple spider is a real application to provide the search capability for DevelopMentor’s web site. It is also an example application, for classroom use learning about open source programming with Java.
  14. Grunk – Grunk (for GRammar UNderstanding Kernel) is a library for parsing and extracting structured metadata from semi-structured text formats. It is based on a very flexible parsing engine capable of detecting a wide variety of patterns in text formats and extracting information from them. Formats are described in a simple and powerful XML configuration from which Grunk builds a parser at runtime, so adapting Grunk to a new format does not require a coding or compilation step. Not really a crawler, but something that may prove extremely useful in crawling.
  15. CAPEK – CAPEK is an Open Source robot entirely written in Java. It gathers web pages for EGOTHOR in a sophisticated way. The pages are ordered by their pagerank, stability of the connection between Capek and the respective web-site, and many other factors.
  16. Aperture – Aperture crawls information systems such as file systems, websites, mail boxes and mail servers. It can extract full-text and metadata from many common file formats. Aperture has a flexible architecture that can be extended with custom file formats, data sources, etc., with support for deployment on OSGi platforms.
  17. Smart and Simple Web Crawler – A framework thats crawls a web site with integrated Lucene support. Support two crawling modes, Max Iterations and Max Depth. Provides a filter interface to limit the links to be crawled. Filters can be combined with AND, OR and NOT.
  18. Web Harvest – Web-Harvest collects Web pages and extracts useful data from them. It leverages technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions. Web-Harvest mainly focuses on HTML/XML based web sites. However it can be extended by custom Java libraries to augment its extraction capabilities.