Hext

# Hext template:            
<a href:link @text:title />

<!-- Html input:            -->
<a href="one.html">  Page 1</a>
<a href="two.html">  Page 2</a>
<a href="three.html">Page 3</a>

Output:
{"link": "one.html",   "title": "Page 1"}
{"link": "two.html",   "title": "Page 2"}
{"link": "three.html", "title": "Page 3"}

Hext is a domain-specific language for extracting structured data from HTML documents. Learn how to hext in the documentation. Also, there is an editor below, where you can try Hext from the comfort of your browser.

htmlext is a command line utility that applies Hext templates to HTML documents and outputs JSON. For example, to extract all links:

$ htmlext -s "<a href:x />" -i <(curl "example.com")

libhext is a C++ library that contains a Hext parser but also allows for some customization. Find out more in libhext's documentation. There are language bindings for Python, Node, JavaScript, Ruby and PHP.

Hext is released under the terms of the Apache License and therefore suitable for inclusion in both open and closed source software. The project is publicly available on Github — Contributions are welcome!

Load an example:

Hext — Extract Data from HTML

./htmlext

libhext

Free Software

Try Hext in your Browser!

HTML

Switch Editor: Ctrl+Alt+X — Submit: Ctrl+Enter

How?

Who?

Hext?