Get Hext
The easiest way to use Hext is with Python, Node, a Browser or on the command line. If you think Hext should be available for another language, package manager, operating system or architecture, please raise an issue on Github.
Install Method | Languages | Platforms | Arch |
---|---|---|---|
pip install hext Usage | Source | Python v3.9-3.13 PyPI, htmlext | macOS, Linux | x86-64, arm64 |
npm install hext Usage | Source | Node v18-23 npmjs | macOS, Linux | x86-64, arm64 |
npm install hext.js Usage | Source | JavaScript (WebAssembly) | Browser, Node | All |
Build from source Instructions | Github | C++, Python, Node, Ruby, PHP | All | All |
Hext on the command line
htmlext is a command line utility that accepts Hext templates, matches them against HTML files and outputs JSON.
Every rule tree match gets its own JSON object containing the captured name-value pairs.
htmlext detects whether its output is written to a terminal or to a pipe. In the former case, every JSON object is pretty-printed, in the latter the output is compacted, printing one object on each line. You can force either behavior by using --pretty or --compact.
Another notable option is --filter <key>, which will print nothing but the value of every capture whose name equals <key>, one per line.
htmlext - Extract structured content from HTML.
Usage:
htmlext [options] <hext-file> <html-file...>
Apply extraction rules from <hext-file> to each
<html-file> and print the captured content as JSON.
Options:
-x [ --hext ] <file> Add Hext from file
-i [ --html ] <file> Add HTML from file
-s [ --str ] <string> Add Hext from string
-c [ --compact ] Print one JSON object per line
-p [ --pretty ] Pretty-print JSON
-a [ --array ] Wrap results in a JSON array
-f [ --filter ] <key> Print values whose name matches <key>
-l [ --lint ] Do Hext syntax check
-m [ --max-searches ] <amount> Abort after this many searches. The
default is 0, which never aborts.
-h [ --help ] Print this help message
-V [ --version ] Print info and version
Some examples using htmlext
# extract all hrefs from page.html
htmlext -s "<a href:h />" -i page.html
# use jq to display the most upvoted reddit thread
htmlext -a -s "
<div class='midcol'>
<div class='unvoted' title:score />
</div>
<div class='entry'>
<div>
<p class='title'>
<a href:link @text:title />
</p>
</div>
</div>" \
-i <(curl -A "F" https://old.reddit.com/r/programming/)\
| jq 'sort_by(.score | tonumber) | last'
# apply href.hext to all html files
htmlext href.hext *.html
# download every image with wget
htmlext -s "<img src:x />" \
-f x \
-i <(curl "https://yoursite/")\
| xargs wget
# extract external links and check
# if they are dead
htmlext -s "<* href^='http' href:x />"\
-s "<* src^='http' src:x />"\
-f x \
-i <(curl https://yoursite) \
| sort \
| uniq \
| while read -r link ; do
# print link if curl fails
curl -sf "$link" > /dev/null \
|| echo "$link"
done
Check out jq, an indispensable tool when dealing with JSON in the shell.
Hext for Python
import hext
# hext.Html's constructor expects a single argument
# containing an UTF-8 encoded string of HTML.
html = hext.Html("""
<a href="one.html"> <img src="one.jpg" /> </a>
<a href="two.html"> <img src="two.jpg" /> </a>
<a href="three.html"><img src="three.jpg" /></a>""")
# hext.Rule's constructor expects a single argument
# containing a Hext snippet.
# Throws an exception of type ValueError on invalid syntax.
rule = hext.Rule("<a href:link> <img src:image /> </a>")
# hext.Rule.extract expects an argument of type hext.Html.
# Returns a list of dictionaries.
results = rule.extract(html)
# hext.Rule.extract has a second, optional parameter
# of type unsigned int, called max_searches.
# The search for matching elements is aborted by
# throwing an exception after this limit is reached.
# The default is 0, which never aborts. If running
# untrusted hext templates, it is recommend to set
# max_searches to some high value, like 10000, to
# protect against resource exhaustion.
# results = rule.extract(html, 10000);
# print each key-value pair
for group in results:
for key in group:
print '{}: {}'.format(key, group[key])
print
Hext for Node
var hext = require('hext');
// hext.Html's constructor expects a single argument
// containing an UTF-8 encoded string of HTML.
var html = new hext.Html(
'<a href="one.html"> <img src="one.jpg" /> </a>' +
'<a href="two.html"> <img src="two.jpg" /> </a>' +
'<a href="three.html"><img src="three.jpg" /></a>');
// hext.Rule's constructor expects a single argument
// containing a Hext snippet.
// Throws an Error on invalid syntax, with
// Error.message containing the error description.
var rule = new hext.Rule('<a href:link>' +
' <img src:image />' +
'</a>');
// hext.Rule.extract expects an argument of type
// hext.Html. Returns an Array containing Objects
// which contain key-value pairs of type String.
var result = rule.extract(html);
// hext.Rule.extract has a second, optional parameter
// of type unsigned int, called max_searches.
// The search for matching elements is aborted by
// throwing an exception after this limit is reached.
// The default is 0, which never aborts. If running
// untrusted hext templates, it is recommend to set
// max_searches to some high value, like 10000, to
// protect against resource exhaustion.
// var result = rule.extract(html, 10000);
// print each key-value pair
for(var i in result)
{
for(var key in result[i])
console.log(key, "->", result[i][key]);
console.log()
}
Hext for JavaScript
src="https://cdn.jsdelivr.net/gh/html-extract/hext.js@v1.0.12/dist/hext.js"
(function() {
// loadHext() returns a promise
loadHext().then(hext => {
// hext.Html's constructor expects a single argument
// containing an UTF-8 encoded string of HTML.
const html = new hext.Html(
'<a href="one.html"> <img src="one.jpg" /> </a>' +
'<a href="two.html"> <img src="two.jpg" /> </a>' +
'<a href="three.html"><img src="three.jpg" /></a>');
// hext.Rule's constructor expects a single argument
// containing a Hext snippet.
// Throws an Error on invalid syntax, with
// Error.message containing the error description.
const rule = new hext.Rule('<a href:link>' +
' <img src:image />' +
'</a>');
// hext.Rule.extract expects an argument of type
// hext.Html. Returns an Array containing Objects
// which contain key-value pairs of type String.
const result = rule.extract(html);
// hext.Rule.extract has a second, optional parameter
// of type unsigned int, called max_searches.
// The search for matching elements is aborted by
// throwing an exception after this limit is reached.
// The default is 0, which never aborts. If running
// untrusted hext templates, it is recommend to set
// max_searches to some high value, like 10000, to
// protect against resource exhaustion.
// const result = rule.extract(html, 10000);
// print each key-value pair
for(var i in result)
{
for(var key in result[i])
console.log(key, "->", result[i][key]);
console.log()
}
});
})();
type="module"
import hext from "https://cdn.jsdelivr.net/gh/html-extract/hext.js@v1.0.12/dist/hext.mjs";
const html = new hext.Html("<ul><li>Hello</li><li>World</li></ul>");
const rule = new hext.Rule("<li @text:my_text />");
const result = rule.extract(html).map(x => x.my_text).join(", ");
console.log(result); // "Hello, World"
const loadHext = require('./hext.js');
loadHext().then(hext => {
const html = new hext.Html("<ul><li>Hello</li><li>World</li></ul>");
const rule = new hext.Rule("<li @text:my_text />");
const result = rule.extract(html).map(x => x.my_text).join(", ");
console.log(result); // "Hello, World"
});
Hext for Ruby
require 'hext'
# Hext::Html's initializer expects a single argument
# containing an UTF-8 encoded string of HTML.
html = Hext::Html.new(<<-'HTML_INPUT')
<a href="one.html"> <img src="one.jpg" /> </a>
<a href="two.html"> <img src="two.jpg" /> </a>
<a href="three.html"><img src="three.jpg" /></a>
HTML_INPUT
# Hext::Rule's initializer expects a single argument
# containing a Hext snippet.
# Raises an exception of type ArgumentError on invalid syntax.
rule = Hext::Rule.new("<a href:link> <img src:image /> </a>")
# Hext::Rule.extract expects an argument of type Hext::Html.
# Returns an Array of Hashes which contain key-value pairs
# of type String.
result = rule.extract(html)
# hext.Rule.extract has a second, optional parameter
# of type unsigned int, called max_searches.
# The search for matching elements is aborted by
# throwing an exception after this limit is reached.
# The default is 0, which never aborts. If running
# untrusted hext templates, it is recommend to set
# max_searches to some high value, like 10000, to
# protect against resource exhaustion.
# result = rule.extract(html, 10000);
# print each key-value pair
result.each do |map|
map.each do |key, value|
puts "#{key}: #{value}"
end
puts
end
Hext for PHP
<?php
require 'hext.php';
# HextHtml's constructor expects a single argument
# containing an UTF-8 encoded string of HTML.
$html = new HextHtml(
'<a href="one.html"> <img src="one.jpg" /> </a>
<a href="two.html"> <img src="two.jpg" /> </a>
<a href="three.html"><img src="three.jpg" /></a>');
# HextRule's constructor expects a single argument
# containing a Hext snippet.
# Throws an exception on invalid syntax.
$rule = new HextRule('<a href:link>'.
' <img src:image />'.
'</a>');
# HextRule->extract expects an argument of type HextHtml.
# Returns an array containing arrays which contain
# key-value pairs of type string.
$result = $rule->extract($html);
# hext.Rule.extract has a second, optional parameter
# of type unsigned int, called max_searches.
# The search for matching elements is aborted by
# throwing an exception after this limit is reached.
# The default is 0, which never aborts. If running
# untrusted hext templates, it is recommend to set
# max_searches to some high value, like 10000, to
# protect against resource exhaustion.
# $result = $rule->extract($html, 10000);
# print each key-value pair
foreach($result as $map)
{
foreach($map as $key => $value)
echo "$key: $value\n";
echo "\n";
}
Building Hext from Source
On a Debian-based distribution
Install the following packages:
- g++ ≥7.3 or clang ≥6.0
- cmake ≥3.8
- libboost-dev ≥1.55
- libboost-regex-dev ≥1.55
- libboost-program-options-dev ≥1.55
- libgumbo-dev ≥0.10.1
- rapidjson-dev ≥1.1.0
Download and extract the latest Hext release and navigate to the
top-level build directory.
Then call cmake and make to build the project.
If all went well you'll find the htmlext binary in the current directory.
wget https://github.com/html-extract/hext/archive/v1.0.12.tar.gz
tar xf *.tar.gz
cd hext*/build
cmake -DBUILD_SHARED_LIBS=On .. && make -j 2
./htmlext --help
If you wish to install Hext on your system, run make install as root.
This will install the htmlext binary, libhext's header files, the
libhext library and package configuration files for CMake.
Run ldconfig so your linker can find the newly installed libhext.
# run as root:
make install
# tell your linker that there's a new library:
ldconfig
Using libhext in a CMake Project
After building and installing Hext you can use CMake's own FindPackage to add libhext to your project and you should be good to go.
# Load HextConfig.cmake
find_package(Hext)
# Link libhext
target_link_libraries(your-target hext::hext)