Building Hext on *nix

On a Debian-based distribution

Install the following packages:

  • g++ ≥4.9 or clang ≥3.5
  • cmake ≥3.02
  • libboost-dev ≥1.55
  • libboost-regex-dev ≥1.55
  • libboost-program-options-dev ≥1.55
  • libgumbo-dev ≥0.10.1, which is available in Debian Sid, Debian Stretch and Ubuntu ≥15.10
or

On other *nix operating systems

You will need:

  • g++ ≥4.9 or clang ≥3.5
  • CMake ≥3.02
  • Boost ≥1.55, specifically Boost.Regex and Boost.Program_options
  • libtool and autoconf for building Gumbo
  • Download, build and install Gumbo version 0.10.1 from here

Download and extract the latest Hext release and navigate to the top-level build directory. Then call cmake and make to build the project.
If all went well you'll find the htmlext binary in the current directory.

wget https://github.com/thomastrapp/hext/archive/v0.5.2.tar.gz
tar xf *.tar.gz
cd hext*/build
cmake -DBUILD_SHARED_LIBS=On .. && make -j 2
./htmlext --help

If you wish to install Hext on your system, run make install as root. This will install the htmlext binary, libhext's header files, the libhext library and package configuration files for CMake.
Run ldconfig so your linker can find the newly installed libhext.

# run as root:
make install
# tell your linker that there's a new library:
ldconfig

Building Hext on Windows

Both htmlext and libhext can be built on Windows.
You will need:

The tricky part is getting CMake to find all the neccessary headers and libraries. This depends on how and where you installed Boost and Gumbo. But once that is done, CMake can generate a Visual Studio Solution with which Visual Studio will be able to compile the Hext project.
Note: So far no effort was made to build the Hext language bindings on Windows. But it may very well be feasible :)

Using htmlext

htmlext is a command line utility that accepts Hext snippets, matches them against HTML files and outputs JSON.

Every rule tree match gets its own JSON object containing the captured name-value pairs.

htmlext detects whether its output is written to a terminal or to a pipe. In the former case, every JSON object is pretty-printed, in the latter the output is compacted, printing one object on each line. You can force either behaviour by using --pretty or --compact.

Another notable option is --filter <key>, which will print nothing but the value of every capture whose name equals <key>, one per line.

htmlext - Extract structured content from HTML.
Usage:
htmlext [options] <hext-file> <html-file...>
Apply extraction rules from <hext-file> to each
<html-file> and print the captured content as JSON.
Options:
-x [ --hext ] <file> Add Hext from file
-i [ --html ] <file> Add HTML from file
-s [ --str ] <string> Add Hext from string
-c [ --compact ] Print one JSON object per line
-p [ --pretty ] Pretty-print JSON
-a [ --array ] Wrap results in a JSON array
-f [ --filter ] <key> Print values whose name matches <key>
-l [ --lint ] Do Hext syntax check
-h [ --help ] Print this help message
-V [ --version ] Print info and version

htmlext: Examples

# extract all hrefs from page.html
htmlext -s "<a href:h />" -i page.html
# watch every post on /r/videos with vlc
htmlext -f x -s "<a class='title' href:x />" \
-i <(curl -A "" "https://www.reddit.com/r/videos")\
| xargs vlc
# use jq's magic to display the most upvoted
# reddit thread
htmlext -a -s "
<div class='midcol'>
<div class='unvoted'
@text=~/[0-9]+/ @text:score />
</div>
<div class='entry'>
<p class='title'>
<a href:link @text:title />
</p>
</div>" \
-i <(curl -A "" https://www.reddit.com/r/programming)\
| jq 'sort_by(.score | tonumber) | last'
# apply href.hext to all html files
htmlext href.hext *.html
# download every image with wget
htmlext -s "<img src:x />" \
-f x \
-i <(curl "https://yoursite/")\
| xargs wget
# extract external links and check
# if they are dead
htmlext -s "<* href^='http' href:x />"\
-s "<* src^='http' src:x />"\
-f x \
-i <(curl https://yoursite) \
| sort \
| uniq \
| while read -r link ; do
# print link if curl fails
curl -sf "$link" > /dev/null \
|| echo "$link"
done

Check out jq, an indispensable tool when dealing with JSON in the shell.

Using libhext in a CMake project

After installing Hext you can use CMake's own FindPackage to add libhext to your project and you should be good to go!

# Enable C++14
SET(CMAKE_CXX_STANDARD 14)
# Load HextConfig.cmake
FIND_PACKAGE(Hext REQUIRED)
# Add libhext's include directory
INCLUDE_DIRECTORIES(${HEXT_INCLUDE_DIR})
# Link libhext
TARGET_LINK_LIBRARIES(your-target ${HEXT_LIBRARY})

Go to the main page of libhext's documentation for an introduction to libhext.

Building Hext for Node.js

First build and install Hext. Then, install nodejs, nodejs-dev (tested with version ≥v0.10.25), a package called nodejs-legacy (which is just a symlink from /usr/bin/node to /usr/bin/nodejs) and npm (any version should suffice).

The Node.js extension is included in the Hext project. To build the extension you will need two packages which are available through the Node.js package manager:

  • nan: Native Abstractions for Node.js
  • CMake.js: A Node.js/io.js native addon build tool
# relative to the project root
cd libhext/bindings/nodejs
# install nan (only required for building)
npm install nan --save
# install cmake-js globally (requires root)
npm install cmake-js -g
# build the project
cmake-js build

To use the extension in another project you can let npm do all the work.

# optional: install the extension for local
# projects (requires root)
cd libhext/bindings/nodejs
npm link
# now npm cann pull hext into your projects
cd /path/to/your/project
npm link hext
# alternative: use the path from the repository
cd /path/to/your/project
npm install \
/path/to/libhext/bindings/nodejs --save
# test loading hext
nodejs -e "require('hext')" && echo "It works!"

Node.js: Example Usage

This example JavaScript program demonstrates everything you can do with Hext's Node.js extension as of yet. Also check out htmlext.js which is a stripped down JavaScript port of the htmlext command line utility.
var hext = require('hext');
// hext.Html's constructor expects a single argument
// containing an UTF-8 encoded string of HTML.
var html = new hext.Html(
'<a href="one.html"> <img src="one.jpg" /> </a>' +
'<a href="two.html"> <img src="two.jpg" /> </a>' +
'<a href="three.html"><img src="three.jpg" /></a>');
// hext.Rule's constructor expects a single argument
// containing a Hext snippet.
// Throws an Error on invalid syntax, with
// Error.message containing the error description.
var rule = new hext.Rule('<a href:link>' +
' <img src:image />' +
'</a>');
// hext.Rule.extract expects an argument of type
// hext.Html. Returns an Array containing Objects
// which contain key-value pairs of type String.
var result = rule.extract(html);
// print each key-value pair
for(var i in result)
{
for(var key in result[i])
console.log(key, "->", result[i][key]);
console.log()
}

Building Hext for Python

First build and install Hext. Then, install python, python-dev (both version ≥2.7) and swig (version ≥3.0).

The Python extension is included in the Hext project. Navigate to its build directory. Call cmake and make to build the extension.

# relative to the project root
cd libhext/bindings/python/build
cmake .. && make

The Hext Python extension is comprised of two files: hext.py and a shared library called _hext.so. Unfortunately, there's no automated way of installing the extension yet, so you'll have to take care of this part yourself :)
Running python -m site will show you all the locations Python will check in search of an extension. Use the directory given by python -m site --user-site to install the extension for the current user only.

# list all locations where python looks for
# extensions
python -m site
# create the directory where python expects
# user installed extensions (if it doesn't
# exist already)
mkdir --parents $(python -m site --user-site)
# copy both files to this directory
cp hext.py _hext.so \
$(python -m site --user-site)

Python: Example Usage

This Python script demonstrates everything you can do with Hext's Python extension as of yet. Also check out htmlext.py which is a stripped down Python port of the htmlext command line utility.
import hext
# hext.Html's constructor expects a single argument
# containing an UTF-8 encoded string of HTML.
html = hext.Html("""
<a href="one.html"> <img src="one.jpg" /> </a>
<a href="two.html"> <img src="two.jpg" /> </a>
<a href="three.html"><img src="three.jpg" /></a>""")
# hext.Rule's constructor expects a single argument
# containing a Hext snippet.
# Throws an exception of type ValueError on invalid syntax.
rule = hext.Rule("<a href:link> <img src:image /> </a>")
# hext.Rule.extract expects an argument of type hext.Html.
# Returns a list of dictionaries.
results = rule.extract(html)
# print each key-value pair
for group in results:
for key in group:
print '{}: {}'.format(key, group[key])
print
Scrapy is one of the most prolific tools when it comes to crawling the web. This scrapy spider is taken from Scrapy's dirbot example and modified to use Hext instead of XPath for extracting content.
import hext
from scrapy.spiders import Spider
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/",
]
rule = hext.Rule("""<div class="title-and-desc">
<a href:url @text:name />
<div @text:description />
</div>""")
def parse(self, response):
return self.rule.extract(hext.Html(response.body))

Building Hext for Ruby

First build and install Hext. Then, install ruby, ruby-dev (both version ≥2.1) and swig (version ≥3.0).

The Ruby extension is included in the Hext project. Navigate to its build directory. Call cmake and make to build the extension.

# relative to the project root
cd libhext/bindings/ruby/build
cmake .. && make

The Hext Ruby extension consists of a single shared library called hext.so. Unfortunately, there's no automated way of installing the extension yet, so you'll have to take care of this part yourself :)
Running ruby -e 'puts $LOAD_PATH' will show you all the locations Ruby will check in search of an extension. Copying the shared library to any one of these locations should suffice.

# list all locations where ruby looks for
# extensions
ruby -e 'puts $LOAD_PATH'
# Example: install the extension system wide
# (requires root)
cp hext.so /usr/local/lib/site_ruby
# Alternative: Run ruby with the -I parameter,
# where <path> is the path containing the extension.
ruby -I<path> your-script.rb

Ruby: Example Usage

This Ruby script demonstrates everything you can do with Hext's Ruby extension as of yet. Also check out htmlext.rb which is a stripped down Ruby port of the htmlext command line utility.
require 'hext'
# Hext::Html's initializer expects a single argument
# containing an UTF-8 encoded string of HTML.
html = Hext::Html.new(<<-'HTML_INPUT')
<a href="one.html"> <img src="one.jpg" /> </a>
<a href="two.html"> <img src="two.jpg" /> </a>
<a href="three.html"><img src="three.jpg" /></a>
HTML_INPUT
# Hext::Rule's initializer expects a single argument
# containing a Hext snippet.
# Raises an exception of type ArgumentError on invalid syntax.
rule = Hext::Rule.new("<a href:link> <img src:image /> </a>")
# Hext::Rule.extract expects an argument of type Hext::Html.
# Returns an Array of Hashes which contain key-value pairs
# of type String.
result = rule.extract(html)
# print each key-value pair
result.each do |map|
map.each do |key, value|
puts "#{key}: #{value}"
end
puts
end

Building Hext for PHP

First build and install Hext. Then, install php5, php5-dev (both version ≥5.6) and swig (version ≥3.0).

The PHP extension is included in the Hext project. Navigate to its build directory. Call cmake and make to build the extension.

# relative to the project root
cd libhext/bindings/php/build
cmake .. && make

The Hext PHP extension is comprised of two files: hext.php and a shared library called hext.so. In typical PHP setups the dl() function which loads dynamic libraries at runtime is disabled. Therefore in most cases the extension must be loaded at PHP's startup. There are two ways to do this:

  • php.ini: Add the line "extension=/path/to/hext.so" to your php.ini file, ideally below the section called "Dynamic Extensions".
  • PHP CLI's -d parameter: Run PHP with the argument "-d extension=/path/to/hext.so".

And lastly, PHP needs to be able to find hext.php. This file may be anywhere in your include_path.

PHP: Example Usage

This PHP script demonstrates everything you can do with Hext's PHP extension as of yet. Also check out htmlext.php which is a stripped down PHP port of the htmlext command line utility.
<?php
require 'hext.php';
# HextHtml's constructor expects a single argument
# containing an UTF-8 encoded string of HTML.
$html = new HextHtml(
'<a href="one.html"> <img src="one.jpg" /> </a>
<a href="two.html"> <img src="two.jpg" /> </a>
<a href="three.html"><img src="three.jpg" /></a>');
# HextRule's constructor expects a single argument
# containing a Hext snippet.
# Throws an exception on invalid syntax.
$rule = new HextRule('<a href:link>'.
' <img src:image />'.
'</a>');
# HextRule->extract expects an argument of type HextHtml.
# Returns an array containing arrays which contain
# key-value pairs of type string.
$result = $rule->extract($html);
# print each key-value pair
foreach($result as $map)
{
foreach($map as $key => $value)
echo "$key: $value\n";
echo "\n";
}