XML Syntax Checker

An engine to check syntax of HTML files and convert them into XML format, serve along with an XML crawler.

I. Syntax Checker

The syntax checker includes 2 modules:

XmlSyntaxChecker refines attributes inside opening tags, and makes the document well-formed by adding missing opening/closing tags.
EntitySyntaxChecker finds all entities that are not defined for XML and replaces all ampersand character & with ampersand notation &

Philosophy

The Syntax Checker engine works as a language resolver. Instead of using flags to mark the language's syntax, the resolver uses state-machine mechanism: whenever it hits a particular character, the engine changes into the corresponding state. Therefore, there is only one state variable to manage (instead of bunch of flags), and the game is much easier.

XML State Diagram

Entity State Diagram

II. Crawler

Mechanism

When start crawling, the Crawler sequentially does these steps:

Read the rules provided in XML file () or Rules objects.
Fetch HTML content as text (based on crawling rules) and convert them into XML format (using XmlSyntaxChecker and EntitySyntaxChecker)
Parse document into DOM tree using DOM parser and start extracting data based on extraction rules.

Processing crawled data

The extracted data present as a list of Map<String, String>. When initializing Crawler, we need to pass a parameter class that implements CrawlerResultProcessor to process the result list after finish crawling data.

CrawlerResultProcessor processor = new BookProcessor();
Crawler<BookProcessor> crawler = new Crawler<>();
crawler.setResultProcessor(processor);
...
crawler.crawl();

The CrawlerResultProcessor has 3 degrees of processing crawled data:

Object: anytime after successfully extracting data as a discrete object; provided in processResultObject(Map<String, String> object)
Fragment: anytime after finish extracting all objects in one HTML page (usually range from 20-50 objects); provided in processResultFragmentList(List<Map<String, String>> list)
List: only after the whole process finishes, it returns a list of all objects as a final result, provided in processResultList(List<Map<String, String>> list)

Stop the crawler

Crawler uses a static variable STOP to control the work flow. To immediately stop the process, simply set it false

Crawler.STOP = false;

Once the crawler has stopped, you must set it back to true before starting any other crawling progresses.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
src/main		src/main
.gitignore		.gitignore
EntitySyntax.svg		EntitySyntax.svg
README.md		README.md
XMLSyntax.svg		XMLSyntax.svg
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

XML Syntax Checker

I. Syntax Checker

Philosophy

XML State Diagram

Entity State Diagram

II. Crawler

Mechanism

Processing crawled data

Stop the crawler

About

Releases

Packages

Contributors 2

Languages

nambach/XmlSyntaxChecker

Folders and files

Latest commit

History

Repository files navigation

XML Syntax Checker

I. Syntax Checker

Philosophy

XML State Diagram

Entity State Diagram

II. Crawler

Mechanism

Processing crawled data

Stop the crawler

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages