hQuery.php

An extremely fast and efficient web scraper that parses megabytes of HTML in a blink of an eye.

API Documentation

Features

Very fast parsing and lookup
Parses broken HTML
jQuery-like style of DOM traversal
Low memory usage
Can handle big HTML documents (I have tested up to 20Mb, but the limit is the amount of RAM you have)
Doesn't require cURL to be installed
Automatically handles redirects (301, 302, 303)
Caches response for multiple processing tasks
PHP 5+
No dependencies

Install

Just include_once 'hquery.php'; in your project and start using hQuery.

Alternatively composer require duzun/hquery

or using npm install hquery.php, require_once 'node_modules/hquery.php/hquery.php';.

Usage

Basic setup:

include_once '/path/to/libs/hquery.php';

// Set the cache path - must be a writable folder
hQuery::$cache_path = "/path/to/cache";

Open a remote HTML document

hQuery::fromUrl( string `$url`, array `$headers` = NULL, array|string `$body` = NULL, array `$options` = NULL )

$doc = hQuery::fromUrl('http://example.com/someDoc.html', ['Accept' => 'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8']);

var_dump($doc->headers); // See response headers
var_dump(hQuery::$last_http_result); // See response details of last request

For building advanced requests (POST, parameters etc) see hQuery::http_wr()

Open a local HTML document

hQuery::fromFile( string `$filename`, boolean `$use_include_path` = false, resource `$context` = NULL )

$doc = hQuery::fromFile('/path/to/filesystem/doc.html');

Load HTML from a string

hQuery::fromHTML( string `$html`, string `$url` = NULL )

$doc = hQuery::fromHTML('<html><head><title>Sample HTML Doc</title><body>Contents...</body></html>');

// Set base_url, in case the document is loaded from local source.
// Note: The base_url is used to retrive absolute URLs from relative ones
$doc->base_url = 'http://desired-host.net/path';

Processing the results

hQuery::find( string `$sel`, array|string `$attr` = NULL, hQuery_Node `$ctx` = NULL )

// Find all banners (images inside anchors)
$banners = $doc->find('a > img:parent');

// Extract links and images
$links  = array();
$images = array();
$titles = array();
foreach($banners as $pos => $a) {
    $links[$pos] = $a->attr('href');
    $titles[$pos] = trim($a->text()); // strip all HTML tags and leave just text
    $images[$pos] = $a->find('img')->attr('src');
}

// Read charset of the original document (internally it is converted to UTF-8)
$charset = $doc->charset;

// Get the size of the document ( strlen($html) )
$size = $doc->size;

Live Demo

On DUzun.Me

#TODO

Unit tests everything
Document everything
Cookie support
Add more selectors
Improve selectors to be able to select by attributes

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
docs		docs
tests		tests
tools		tools
.gitattributes		.gitattributes
.gitignore		.gitignore
.travis.yml		.travis.yml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
composer.json		composer.json
hquery.php		hquery.php
index.html		index.html
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hQuery.php

Features

Install

Usage

Basic setup:

Open a remote HTML document

hQuery::fromUrl( string `$url`, array `$headers` = NULL, array|string `$body` = NULL, array `$options` = NULL )

Open a local HTML document

hQuery::fromFile( string `$filename`, boolean `$use_include_path` = false, resource `$context` = NULL )

Load HTML from a string

hQuery::fromHTML( string `$html`, string `$url` = NULL )

Processing the results

hQuery::find( string `$sel`, array|string `$attr` = NULL, hQuery_Node `$ctx` = NULL )

Live Demo

About

Releases

Packages

Languages

License

jneto81/hQuery.php

Folders and files

Latest commit

History

Repository files navigation

hQuery.php

Features

Install

Usage

Basic setup:

Open a remote HTML document

hQuery::fromUrl( string $url, array $headers = NULL, array|string $body = NULL, array $options = NULL )

Open a local HTML document

hQuery::fromFile( string $filename, boolean $use_include_path = false, resource $context = NULL )

Load HTML from a string

hQuery::fromHTML( string $html, string $url = NULL )

Processing the results

hQuery::find( string $sel, array|string $attr = NULL, hQuery_Node $ctx = NULL )

Live Demo

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

hQuery::fromUrl( string `$url`, array `$headers` = NULL, array|string `$body` = NULL, array `$options` = NULL )

hQuery::fromFile( string `$filename`, boolean `$use_include_path` = false, resource `$context` = NULL )

hQuery::fromHTML( string `$html`, string `$url` = NULL )

hQuery::find( string `$sel`, array|string `$attr` = NULL, hQuery_Node `$ctx` = NULL )

Packages