Hext — Documentation

Suppose you want to extract all hyperlinks from a web page. Hyperlinks have an anchor tag <a> , an attribute called href and a text that visitors can click. The Hext template on the right below extracts exactly that. Let's break it down one by one.

# Extract links and their text
<a href:link @text:title />

Load example in editor

*** Image containing a detailed description of the Hext template mentioned above ***

If this rule matches an HTML element it will produce key-value pairs, where the key is the name of the capture and the value contains the extracted content:

<!-- Html input:            -->
<a href="one.html">  Page 1</a>
<a href="two.html">  Page 2</a>
<a href="three.html">Page 3</a>

Hext's output:
{"link": "one.html",   "title": "Page 1"}
{"link": "two.html",   "title": "Page 2"}
{"link": "three.html", "title": "Page 3"}

Hext's strength is that you can put together something that looks like the HTML you want to extract data from. Hext templates can be thought of as a counterpart to web templates. Web developers typically use templates to embed data in HTML and that's why most content on the web does have some sort of structure which is to Hext's advantage.

The template on the right below collects all submissions from a Hackernews listing. For example, if applied to
news.ycombinator.com/newest
you'll get a list of the most recent submissions with each item containing the rank, title, href, score, user and comment_count.

There are multiple ways to use Hext:

The htmlext command line utility applies Hext to HTML documents and returns JSON containing the captured data.
The libhext library contains a Hext parser but also lets you build a Hext rule tree manually.
The Hext language bindings: Python, Node, JavaScript, Ruby, PHP

<tr class="athing">
  { <span class="rank" @text:rank /> }
  { <span class="titleline">
      <a href:href @text:title />
    </span> }
</tr>
<?tr> {
  <span class="subline">
    <span class="score" @text:score />
    <a @text:user />
    <a:last-child @text:filter(/\d+/):comment_count />
  </span>
} </tr>

Also, you can test Hext in the "Try Hext in your Browser" section: Just paste any HTML from the web into the editor and have a go!

The Hext template on the right below matches <a> containing at least two elements: A <span> followed by <img>.
Notice how only the first span is matched for id in the example below. This is because rules take turns to match elements. The second <span> gets tested by the rule for image, which doesn't match and the element gets skipped. The rule then continues with <img>, which matches. Then, the first rule (id) takes over again. And indeed it matches the third <span>, but the result is discarded because the second rule (image) doesn't match: there are no more elements left.

<a href:link>
  <span @text:id />
  <img src:image />
</a>

Load example in editor

<!-- Html input: -->
<a href="/coffee">
  <span>#1</span>
  <span>Coffee</span>
  <img src="coffee.jpg" />
  <span>turns nights into code</span>
</a>
<a href="/beer">
  <span>#2</span>
  <span>Beer</span>
  <img src="beer.jpg" />
  <span>improves dance skill by 70%</span>
</a>

Hext's output:
{
  "id": "#1",
  "image": "coffee.jpg",
  "link": "/coffee"
}

{
  "id": "#2",
  "image": "beer.jpg",
  "link": "/beer"
}

A Hext template is a set of rules that is matched against an HTML document. Each HTML element is tested by the first rule of a Hext template. A rule begins with either an HTML tag or an asterisk which matches every HTML element.

# match every element with any tag
<* />
# match every <div> element
<div />

Rules may have children. If a rule matches an HTML element, the rule's children are matched against the element's children, and must each produce at least one match.

# match <a> elements that have at
# least one child element <img>
<a><img /></a>

Rules may have siblings. If a rule matches an HTML element, all the rule's siblings are matched against the element's siblings. While adjacency is not required, matching elements must still appear in the same order as their rule counterparts.

# match <h1> elements that are followed
# by a <p> element
<h1/><p/>

Rules may have nested rules. If a rule matches an HTML element, all the rule's nested rules are matched against the element's inner HTML. A nested rule can find a match at any depth.

# match <div> elements that have a
# descendant <a> at any depth
<div> { <a/> } </div>

Rules may be optional. A Hext template only matches if each rule finds its match, unless it is marked optional with a question mark. These rules are simply skipped if no match is found. Mandatory rules always take precedence over optional rules.

# match <h1>, optionally followed by <time>,
# followed by a paragraph
<h1/><?time/><p/>

Rules may be greedy. A rule marked with a plus sign does not stop at the first match, instead it continually searches for a match until a mandatory rule takes precedence or until there are no more elements left.
A greedy rule can also be made optional, that is, it may match none or many.

# match <h1>, followed by at least one <p>
<h1/><+p /># match <h1>, followed by zero or more <p>
<h1/><?+p />

Rules may contain match patterns. Match patterns further refine which HTML elements to match. There are three kinds of match patterns: Attribute matches, Built-in Function matches and Element Traits.
Attributes may be compared against a string or a regular expression. There are six match operators that determine the type of comparison: = contains word, *= contains, ^= begins with, $= ends with, == identical and =~/regex/. Regular expressions may also be embedded in quotes, e.g. =~"\d+", which is particularly useful when matching URLs.
Match patterns may be negated by an exclamation mark.

# match <a> having attribute "name"
<a name />
# match <a> having attribute "href"
# beginning with "https:"
<a href^="https:" />
# match <a> having attribute href containing
# the string "post-" followed by a number
<a href=~/post-\d+/ />
# match <a> not having attribute "href" beginning
# with "https:"
<a href^="https:"! />

Built-in functions turn an HTML element into a string. As with attributes, match operators may be used to test for specific contents. There are three built-in functions: @text, @inner-html and @strip-tags. @text is the most powerful, as it turns an HTML element into clean and readable text.

# match <h1> whose content starts with "News: "
<h1 @text^="News: " />
# match <h1> whose content matches a case
# insensitive regex
<h1 @text=~/section \d+.\d+/i />

Element traits describe the position of an HTML element relative to its parent or general properties, for example its amount of children or attributes. The intent is to replicate CSS pseudo-classes.
Chained traits require elements to match each trait.
:not(traits) negates element traits.

# match <li> who are the first child of their parent
<li:first-child />
# match <li> who are the first and the last child of
# their parent (i.e. the only child)
<li:first-child:last-child />
# match every third <li>
<li:nth-child(3n) />
# match <li> that are not the first child
<li:not(:first-child) />

Rules may contain captures. After a successful and complete match of a Hext template the data is extracted. There are two possible sources for data: HTML attributes and the result of built-in functions. Each capture must be named.

# match <a> having attribute "href" and store
# the attribute's value as "link"
<a href:link />
# match <article> and store its content as "content"
<article @text:content />

Attribute captures may be suffixed by a question mark, indicating that the attribute is optional, i.e. a matching element may have this attribute, but doesn't need to.

# match <a> having attribute "href" and optionally
# having attribute "name", store their values in
# "link" and "anchor"
<a href:link name:anchor? />

Captures may contain string pipes. String pipes transform the content of an HTML attribute or the result of a built-in function.
String pipes may be chained.

# match <a> having attribute "href", prepend
# "https://example.com/" to its value and store
# it as "url"
<a href:prepend("https://example.com/"):url />

A particularly useful string pipe is :filter. It transforms a string according to a regular expression. This regular expression may contain a capture group which isolates a result from the rest of the match.

# extract the first number, store as first_num
<h1 @text:filter(/[0-9]+/):first_num />
# extract digits only, store as phone_nr
<div @text:filter(/Phone: ([0-9]+)/):phone_nr />

Hext Reference

:empty
:child-count(<amount>)

Select elements that have a certain amount of children.
Text nodes are not considered to be children, i.e. an element that only contains text is considered to be empty.

# match: <div></div>, <div>text</div>, ..
<div:empty />
# match: <div><a></a><b></b></div>,
#        <div>Children: <a></a><b></b></div>, ..
<div:child-count(2) />

:attribute-count(<amount>)

Select elements that have a certain amount of attributes.

# match: <div></div>, <div><a></a></div>
<div:attribute-count(0) />
# match: <div class="" style="" ></div>
#        <b data-a="" data-b=""></b>
<*:attribute-count(2) />

:not(<:trait>)

Select elements that don't match certain traits.

# match all elements that are not empty
<*:not(:empty) />
# match all elements that are not empty and
# do not have an attribute count of two
<*:not(:empty:attribute-count(2)) />

:type-matches(<regex>)

Select elements with tag names that match the given regex.

# match paragraphs or headings
<*:type-matches(/^(p)$|^(h[1-6])$/) />
# match <CustomTag> or <custom-tag>
<*:type-matches(/^custom[-]?tag$/i) />

:first-child
:first-of-type
:last-child
:last-of-type
:only-child
:nth-child(<nth-pattern>)
:nth-of-type(<nth-pattern>)
:nth-last-child(<nth-pattern>)
:nth-last-of-type(<nth-pattern>)
:only-of-type

Select elements by position within their parent.
<nth-pattern> may be given in the form of an+b or as the shorthands even and odd.
See MDN for a detailed description.

# <ul>
#   <li>These pseudo</li>  <!-- match -->
#   <li>classes     </li>
#   <li>work just   </li>  <!-- match -->
#   <li>like they   </li>
#   <li>do in CSS   </li>  <!-- match -->
# </ul>
<li:nth-child(2n+1) />

="<string>"

Matches subjects that contain all of the given words in any order. Word boundaries are the beginning and end of the subject or spaces.

# match: class="item",
#        class="item menu", ..
# does not match: class="first-item",
#                 class="menuitem", ..
<* class="item" />
# match: class="article sub head",
#        class="head article", ..
# does not match: class="particle head",
#                 class="article-head", ..
<* class="article head" />

*="<string>"

Matches subjects that contain the given string.

# match: href="http://youtube.com/",
#        href="youtube", ..
<* href*="youtube" />

^="<string>"

Matches subjects that begin with the given string.

# match: <p>Hello, this is HAL</p>,
#        <p>Hello</p>, ..
# does not match: <p>hello</p>,
#                 <p>Oh, Hello</p>, ..
<* @text^="Hello" />

$="<string>"

Matches subjects that end with the given string.

# match: href="igel.jpg", href="franz.jpg", ..
<* href$=".jpg" />

=="<string>"

Matches subjects that are equal to the given string.

# match: class="left aligned list"
# does not match anything else
<* class=="left aligned list" />

=~/<regex>/[opt]
=~"<regex>"[opt]

Matches subjects that match the given regular expression.
There are two options for regular expressions:
i: case insensitive and c: collate (locale aware character groups)

# match: class="menuItem-23",
#        class="menuItem-42-23", ..
<* class=~/item-\d+/i />

:trim
:trim("<characters>")

Trims characters from the beginning and the end of a string. Trims spaces by default. If given an argument, trims all given characters. Does not handle unicode.

# trim all left and right spaces
<* title:trim:name />
# trim all left and right spaces and dashes
<* title:trim(" -"):name />

:collapsews

Trims whitespace from beginning and end and collapses multiple whitespace to a single space.

# Turns this:
#   <a title="  Lots   of spaces in this   title">
# Into this:
#   "Lots of spaces in this title"
<* title:collapsews:name />

:tolower
:toupper

Changes all characters to lower or upper case. Does not handle unicode.

<a title:toupper:link_title />
<a title:tolower:link_title />

:prepend("<string>")
:append("<string>")

Prepends or appends a given string.

# turn relative URLs into absolute URLs
<a href:prepend("https://example.com/"):url />
# append foo
<a title:append(" ..and foo!"):title />

:filter(/<regex>/[opt])

Filters a string according to a given regex. A regex containing a capture group will produce only the matched content of that capture group, otherwise the whole regex match is returned. All capture groups after the first one will be ignored.
There are two options for regular expressions:
i: case insensitive and c: collate (locale aware character groups)

# save the trailing number contained in attr. href
<a href:filter(/\d+$/):user_id />
# save the number after "post-"
<a href:filter(/post-(\d+)/):post_id />

:replace(/<regex>/[opt], "<string>")

Replaces a portion matched by the given regex with a string. Backreferences can be used to address capture groups (detailed description).
There are two options for regular expressions:
i: case insensitive and c: collate (locale aware character groups)

# replace all instances of foo with bar
<p @text:replace(/foo/, "bar"):bar_text />
# remove all numbers
<p @text:replace(/\d+/, ""):nonum />
# use capture groups and backreferences
<h1 @text:replace(/(\d+): (.*)/, "$2 ($1)"):head />

@text

Returns the inner text. Trims left and right whitespace and collapses multiple whitespace to a single space. The content of some elements will be embedded in spaces (basically all non-inline elements, like <div> or <h1>).
The intent is to mimic functions like jQuery's text(), IE's innerText() or textContent().
Does not strip metadata content.

# Turns this:
#   <article>
#     <h1>   Which   Highway?</h1>Highway 61!
#   </article>
# Into this:
#   "Which Highway? Highway 61!"
<article @text:content />

@inner-html

Serializes the inner HTML to a string.

# Turns this:
#   <div><h1>Which Highway?</h1> Highway 61! </div>
# Into this:
#   "<h1>Which Highway?</h1> Highway 61! "
# And the filter turns it into this:
#   "Highway 61"
<div @inner-html:filter(/Highway \d+/):number />

@strip-tags

Returns the inner HTML with all tags removed.

# Turns this:
#   <div><h1>Which Highway?</h1>  Highway 61!</div>
# Into this:
#   "Which Highway?  Highway 61! "
<div @strip-tags:content />

What the heck is Hext?

Hext Reference

Starting with an Example

The Big Picture

Another Example

How Hext Matches Elements

How Hext Captures Data

Limitations

See also:

Element Traits

:empty
:child-count(<amount>)

:attribute-count(<amount>)

:not(<:trait>)

:type-matches(<regex>)

:first-child
:first-of-type
:last-child
:last-of-type
:only-child
:nth-child(<nth-pattern>)
:nth-of-type(<nth-pattern>)
:nth-last-child(<nth-pattern>)
:nth-last-of-type(<nth-pattern>)
:only-of-type

Match Operators

="<string>"

*="<string>"

^="<string>"

$="<string>"

=="<string>"

=~/<regex>/[opt]
=~"<regex>"[opt]

String Pipes

:trim
:trim("<characters>")

:collapsews

:tolower
:toupper

:prepend("<string>")
:append("<string>")

:filter(/<regex>/[opt])

:replace(/<regex>/[opt], "<string>")

Built-in Functions

@text

@inner-html

@strip-tags

How?

Who?

Hext?

What the heck is Hext?

Hext Reference

Starting with an Example

The Big Picture

Another Example

How Hext Matches Elements

How Hext Captures Data

Limitations

See also:

Element Traits

:empty :child-count(<amount>)

:attribute-count(<amount>)

:not(<:trait>)

:type-matches(<regex>)

:first-child :first-of-type :last-child :last-of-type :only-child :nth-child(<nth-pattern>) :nth-of-type(<nth-pattern>) :nth-last-child(<nth-pattern>) :nth-last-of-type(<nth-pattern>) :only-of-type

Match Operators

="<string>"

*="<string>"

^="<string>"

$="<string>"

=="<string>"

=~/<regex>/[opt] =~"<regex>"[opt]

String Pipes

:trim :trim("<characters>")

:collapsews

:tolower :toupper

:prepend("<string>") :append("<string>")

:filter(/<regex>/[opt])

:replace(/<regex>/[opt], "<string>")

Built-in Functions

@text

@inner-html

@strip-tags

How?

Who?

Hext?

:empty
:child-count(<amount>)

:first-child
:first-of-type
:last-child
:last-of-type
:only-child
:nth-child(<nth-pattern>)
:nth-of-type(<nth-pattern>)
:nth-last-child(<nth-pattern>)
:nth-last-of-type(<nth-pattern>)
:only-of-type

=~/<regex>/[opt]
=~"<regex>"[opt]

:trim
:trim("<characters>")

:tolower
:toupper

:prepend("<string>")
:append("<string>")