Get Hext

The easiest way to use Hext is with Python, Node, a Browser or on the command line. If you think Hext should be available for another language, package manager, operating system or architecture, please raise an issue on Github.

Install Method Languages Platforms Arch
pip install hext   Usage | Source Python v3.6-3.10 PyPI, htmlext macOS, Linux x86-64
npm install hext   Usage | Source Node v14-18 npmjs macOS, Linux x86-64
hext.js   Usage | Source JavaScript (WebAssembly) Browser, Node All
Build from source   Instructions | Github C++, Python, Node, Ruby, PHP All All

Hext on the command line

htmlext is a command line utility that accepts Hext templates, matches them against HTML files and outputs JSON.

Every rule tree match gets its own JSON object containing the captured name-value pairs.

htmlext detects whether its output is written to a terminal or to a pipe. In the former case, every JSON object is pretty-printed, in the latter the output is compacted, printing one object on each line. You can force either behavior by using --pretty or --compact.

Another notable option is --filter <key>, which will print nothing but the value of every capture whose name equals <key>, one per line.

htmlext - Extract structured content from HTML.
Usage:
htmlext [options] <hext-file> <html-file...>
Apply extraction rules from <hext-file> to each
<html-file> and print the captured content as JSON.
Options:
-x [ --hext ] <file> Add Hext from file
-i [ --html ] <file> Add HTML from file
-s [ --str ] <string> Add Hext from string
-c [ --compact ] Print one JSON object per line
-p [ --pretty ] Pretty-print JSON
-a [ --array ] Wrap results in a JSON array
-f [ --filter ] <key> Print values whose name matches <key>
-l [ --lint ] Do Hext syntax check
-m [ --max-searches ] <amount> Abort after this many searches. The
default is 0, which never aborts.
-h [ --help ] Print this help message
-V [ --version ] Print info and version

Some examples using htmlext

# extract all hrefs from page.html
htmlext -s "<a href:h />" -i page.html
# watch every post on /r/videos with vlc
htmlext -f x -s "<a class='title' href:x />" \
-i <(curl -A "" "https://www.reddit.com/r/videos/")\
| xargs vlc
# use jq's magic to display the most upvoted
# reddit thread
htmlext -a -s "
<div class='midcol'>
<div class='unvoted' title:score />
</div>
<div class='entry'>
<div>
<p class='title'>
<a href:link @text:title />
</p>
</div>
</div>" \
-i <(curl -A "" https://www.reddit.com/r/programming/)\
| jq 'sort_by(.score | tonumber) | last'
# apply href.hext to all html files
htmlext href.hext *.html
# download every image with wget
htmlext -s "<img src:x />" \
-f x \
-i <(curl "https://yoursite/")\
| xargs wget
# extract external links and check
# if they are dead
htmlext -s "<* href^='http' href:x />"\
-s "<* src^='http' src:x />"\
-f x \
-i <(curl https://yoursite) \
| sort \
| uniq \
| while read -r link ; do
# print link if curl fails
curl -sf "$link" > /dev/null \
|| echo "$link"
done

Check out jq, an indispensable tool when dealing with JSON in the shell.

Hext for Python

Run  pip install hext , which installs Hext as a python module and the htmlext command line utility. See the example below for usage instructions. Also check out htmlext.py which is a stripped down Python port of the htmlext command line utility.
import hext
# hext.Html's constructor expects a single argument
# containing an UTF-8 encoded string of HTML.
html = hext.Html("""
<a href="one.html"> <img src="one.jpg" /> </a>
<a href="two.html"> <img src="two.jpg" /> </a>
<a href="three.html"><img src="three.jpg" /></a>""")
# hext.Rule's constructor expects a single argument
# containing a Hext snippet.
# Throws an exception of type ValueError on invalid syntax.
rule = hext.Rule("<a href:link> <img src:image /> </a>")
# hext.Rule.extract expects an argument of type hext.Html.
# Returns a list of dictionaries.
results = rule.extract(html)
# hext.Rule.extract has a second, optional parameter
# of type unsigned int, called max_searches.
# The search for matching elements is aborted by
# throwing an exception after this limit is reached.
# The default is 0, which never aborts. If running
# untrusted hext templates, it is recommend to set
# max_searches to some high value, like 10000, to
# protect against resource exhaustion.
# results = rule.extract(html, 10000);
# print each key-value pair
for group in results:
for key in group:
print '{}: {}'.format(key, group[key])
print
Scrapy is one of the most prolific tools when it comes to crawling the web. For an example on how to use Hext with Scrapy, checkout github.com/html-extract/hext-scrapy-quotesbot.

Hext for Node

Run  npm install hext , which installs Hext as a node module. See the example below for usage instructions. Also check out htmlext.js which is a stripped down JavaScript port of the htmlext command line utility.
var hext = require('hext');
// hext.Html's constructor expects a single argument
// containing an UTF-8 encoded string of HTML.
var html = new hext.Html(
'<a href="one.html"> <img src="one.jpg" /> </a>' +
'<a href="two.html"> <img src="two.jpg" /> </a>' +
'<a href="three.html"><img src="three.jpg" /></a>');
// hext.Rule's constructor expects a single argument
// containing a Hext snippet.
// Throws an Error on invalid syntax, with
// Error.message containing the error description.
var rule = new hext.Rule('<a href:link>' +
' <img src:image />' +
'</a>');
// hext.Rule.extract expects an argument of type
// hext.Html. Returns an Array containing Objects
// which contain key-value pairs of type String.
var result = rule.extract(html);
// hext.Rule.extract has a second, optional parameter
// of type unsigned int, called max_searches.
// The search for matching elements is aborted by
// throwing an exception after this limit is reached.
// The default is 0, which never aborts. If running
// untrusted hext templates, it is recommend to set
// max_searches to some high value, like 10000, to
// protect against resource exhaustion.
// var result = rule.extract(html, 10000);
// print each key-value pair
for(var i in result)
{
for(var key in result[i])
console.log(key, "->", result[i][key]);
console.log()
}

Hext for JavaScript

Hext for JavaScript is available on a CDN. See the examples below for usage instructions.
<script src="https://cdn.jsdelivr.net/gh/html-extract/hext.js@v1.0.3/dist/hext.js"></script>
<script>
(function() {
// loadHext() returns a promise
loadHext().then(hext => {
// hext.Html's constructor expects a single argument
// containing an UTF-8 encoded string of HTML.
const html = new hext.Html(
'<a href="one.html"> <img src="one.jpg" /> </a>' +
'<a href="two.html"> <img src="two.jpg" /> </a>' +
'<a href="three.html"><img src="three.jpg" /></a>');
// hext.Rule's constructor expects a single argument
// containing a Hext snippet.
// Throws an Error on invalid syntax, with
// Error.message containing the error description.
const rule = new hext.Rule('<a href:link>' +
' <img src:image />' +
'</a>');
// hext.Rule.extract expects an argument of type
// hext.Html. Returns an Array containing Objects
// which contain key-value pairs of type String.
const result = rule.extract(html);
// hext.Rule.extract has a second, optional parameter
// of type unsigned int, called max_searches.
// The search for matching elements is aborted by
// throwing an exception after this limit is reached.
// The default is 0, which never aborts. If running
// untrusted hext templates, it is recommend to set
// max_searches to some high value, like 10000, to
// protect against resource exhaustion.
// const result = rule.extract(html, 10000);
// print each key-value pair
for(var i in result)
{
for(var key in result[i])
console.log(key, "->", result[i][key]);
console.log()
}
});
})();
</script>
Hext for JavaScript is also available as hext.mjs, which is a JavaScript module.
<script type="module">
import hext from "https://cdn.jsdelivr.net/gh/html-extract/hext.js@v1.0.3/dist/hext.mjs";
const html = new hext.Html("<ul><li>Hello</li><li>World</li></ul>");
const rule = new hext.Rule("<li @text:my_text />");
const result = rule.extract(html).map(x => x.my_text).join(", ");
console.log(result); // "Hello, World"
</script>
Using hext.js also works in Node:
const loadHext = require('./hext.js');
loadHext().then(hext => {
const html = new hext.Html("<ul><li>Hello</li><li>World</li></ul>");
const rule = new hext.Rule("<li @text:my_text />");
const result = rule.extract(html).map(x => x.my_text).join(", ");
console.log(result); // "Hello, World"
});
To test whether a specific browser supports hext.js, navigate to hext.thomastrapp.com/hext.js-test.

Hext for Ruby

Build and install Hext for Ruby. See the example below for usage instructions. Also check out htmlext.rb which is a stripped down Ruby port of the htmlext command line utility.
require 'hext'
# Hext::Html's initializer expects a single argument
# containing an UTF-8 encoded string of HTML.
html = Hext::Html.new(<<-'HTML_INPUT')
<a href="one.html"> <img src="one.jpg" /> </a>
<a href="two.html"> <img src="two.jpg" /> </a>
<a href="three.html"><img src="three.jpg" /></a>
HTML_INPUT
# Hext::Rule's initializer expects a single argument
# containing a Hext snippet.
# Raises an exception of type ArgumentError on invalid syntax.
rule = Hext::Rule.new("<a href:link> <img src:image /> </a>")
# Hext::Rule.extract expects an argument of type Hext::Html.
# Returns an Array of Hashes which contain key-value pairs
# of type String.
result = rule.extract(html)
# hext.Rule.extract has a second, optional parameter
# of type unsigned int, called max_searches.
# The search for matching elements is aborted by
# throwing an exception after this limit is reached.
# The default is 0, which never aborts. If running
# untrusted hext templates, it is recommend to set
# max_searches to some high value, like 10000, to
# protect against resource exhaustion.
# result = rule.extract(html, 10000);
# print each key-value pair
result.each do |map|
map.each do |key, value|
puts "#{key}: #{value}"
end
puts
end

Hext for PHP

Build and install Hext for PHP. See the example below for usage instructions. Also check out htmlext.php which is a stripped down PHP port of the htmlext command line utility.
<?php
require 'hext.php';
# HextHtml's constructor expects a single argument
# containing an UTF-8 encoded string of HTML.
$html = new HextHtml(
'<a href="one.html"> <img src="one.jpg" /> </a>
<a href="two.html"> <img src="two.jpg" /> </a>
<a href="three.html"><img src="three.jpg" /></a>');
# HextRule's constructor expects a single argument
# containing a Hext snippet.
# Throws an exception on invalid syntax.
$rule = new HextRule('<a href:link>'.
' <img src:image />'.
'</a>');
# HextRule->extract expects an argument of type HextHtml.
# Returns an array containing arrays which contain
# key-value pairs of type string.
$result = $rule->extract($html);
# hext.Rule.extract has a second, optional parameter
# of type unsigned int, called max_searches.
# The search for matching elements is aborted by
# throwing an exception after this limit is reached.
# The default is 0, which never aborts. If running
# untrusted hext templates, it is recommend to set
# max_searches to some high value, like 10000, to
# protect against resource exhaustion.
# $result = $rule->extract($html, 10000);
# print each key-value pair
foreach($result as $map)
{
foreach($map as $key => $value)
echo "$key: $value\n";
echo "\n";
}

Building Hext from Source

On a Debian-based distribution

Install the following packages:

  • g++ ≥7.3 or clang ≥6.0
  • cmake ≥3.8
  • libboost-dev ≥1.55
  • libboost-regex-dev ≥1.55
  • libboost-program-options-dev ≥1.55
  • libgumbo-dev ≥0.10.1
  • rapidjson-dev ≥1.1.0

On other systems (*nix, OS X, Win)

You will need:

  • g++ ≥7.3 or clang ≥6.0
  • CMake ≥3.8
  • Boost ≥1.55, specifically Boost.Regex and Boost.Program_options
  • libtool and autoconf for building Gumbo
  • Build and install Gumbo (v0.10.1)
  • RapidJSON, which is a header only library (v1.1.0)

Download and extract the latest Hext release and navigate to the top-level build directory. Then call cmake and make to build the project.
If all went well you'll find the htmlext binary in the current directory.

wget https://github.com/html-extract/hext/archive/v1.0.3.tar.gz
tar xf *.tar.gz
cd hext*/build
cmake -DBUILD_SHARED_LIBS=On .. && make -j 2
./htmlext --help

If you wish to install Hext on your system, run make install as root. This will install the htmlext binary, libhext's header files, the libhext library and package configuration files for CMake.
Run ldconfig so your linker can find the newly installed libhext.

# run as root:
make install
# tell your linker that there's a new library:
ldconfig

Using libhext in a CMake Project

After building and installing Hext you can use CMake's own FindPackage to add libhext to your project and you should be good to go.

# Load HextConfig.cmake
find_package(Hext)
# Link libhext
target_link_libraries(your-target hext::hext)
Implementation details of libhext are covered in the libhext C++ library overview and libhext's code documentation.