# Hext template:
<a href:link @text:title />
<!-- Html input: -->
<a href="one.html"> Page 1</a>
<a href="two.html"> Page 2</a>
<a href="three.html">Page 3</a>
Output:
{"link": "one.html",   "title": "Page 1"}
{"link": "two.html",   "title": "Page 2"}
{"link": "three.html", "title": "Page 3"}

Hext — Extract Data from HTML

Hext is a domain-specific language for extracting structured data from HTML documents. Learn how to hext in the documentation. Also, there is an editor below, where you can try Hext from the comfort of your browser.

./htmlext

htmlext is a command line utility that applies Hext templates to HTML documents and outputs JSON. For example, to extract all links:
$ htmlext -s "<a href:x />" -i <(curl "example.com")

libhext

libhext is a C++ library that contains a Hext parser but also allows for some customization. Find out more in libhext's documentation. There are language bindings for Python, Node, JavaScript, Ruby and PHP.

Free Software

Hext is released under the terms of the Apache License and therefore suitable for inclusion in both open and closed source software. The project is publicly available on Github — Contributions are welcome!

Try Hext in your Browser!

Load an example:  

Hext

# Extract links and their text <a href:link @text:title />

HTML

<!DOCTYPE html> <html lang="en"> <head> <meta charset="utf-8"> <title>Example</title> </head> <body> <a href="one.html"> Page 1</a> <a href="two.html"> Page 2</a> <a href="three.html">Page 3</a> </body> </html>