An engine to check syntax of HTML files and convert them into XML format, serve along with an XML crawler.
The syntax checker includes 2 modules:
XmlSyntaxChecker
refines attributes inside opening tags, and makes the document well-formed by adding missing opening/closing tags.EntitySyntaxChecker
finds all entities that are not defined for XML and replaces all ampersand character&
with ampersand notation&
The Syntax Checker engine works as a language resolver.
Instead of using flags to mark the language's syntax, the resolver uses state-machine mechanism:
whenever it hits a particular character, the engine changes into the corresponding state.
Therefore, there is only one state
variable to manage (instead of bunch of flags), and the game is much easier.
When start crawling, the Crawler
sequentially does these steps:
- Read the rules provided in XML file () or
Rules
objects. - Fetch HTML content as text (based on crawling rules) and convert them into XML format (using
XmlSyntaxChecker
andEntitySyntaxChecker
) - Parse document into DOM tree using DOM parser and start extracting data based on extraction rules.
The extracted data present as a list of Map<String, String>
.
When initializing Crawler
, we need to pass a parameter class that implements CrawlerResultProcessor
to process the result list after finish crawling data.
CrawlerResultProcessor processor = new BookProcessor();
Crawler<BookProcessor> crawler = new Crawler<>();
crawler.setResultProcessor(processor);
...
crawler.crawl();
The CrawlerResultProcessor
has 3 degrees of processing crawled data:
- Object: anytime after successfully extracting data as a discrete object; provided in
processResultObject(Map<String, String> object)
- Fragment: anytime after finish extracting all objects in one HTML page (usually range from 20-50 objects); provided in
processResultFragmentList(List<Map<String, String>> list)
- List: only after the whole process finishes, it returns a list of all objects as a final result, provided in
processResultList(List<Map<String, String>> list)
Crawler
uses a static variable STOP
to control the work flow. To immediately stop the process, simply set it false
Crawler.STOP = false;
Once the crawler has stopped, you must set it back to true before starting any other crawling progresses.