Starting with an Example

Suppose you want to extract all hyperlinks from a web page. Hyperlinks have an anchor tag <a> , an attribute called href and a text that visitors can click. The Hext snippet on the right below extracts exactly that. Let's break it down one by one.
# Extract links and their text
<a href:link @text:title />
If this rule matches an HTML element it will produce key-value pairs, where the key is the name of the capture and the value contains the extracted content:
<!-- Html input: -->
<a href="one.html"> Page 1</a>
<a href="two.html"> Page 2</a>
<a href="three.html">Page 3</a>
Hext's output:
{"link": "one.html",   "title": "Page 1"}
{"link": "two.html",   "title": "Page 2"}
{"link": "three.html", "title": "Page 3"}

The Big Picture

Hext's strength is that you can put together something that looks like the HTML you want to extract data from. Hext snippets can be thought of as a counterpart to web templates. Web developers typically use templates to embed data in HTML and that's why most content on the web does have some sort of structure which is to Hext's advantage.

The snippet on the right below collects all Youtube videos from a Youtuber's channel page. For example, if applied to https://www.youtube.com/user/CppCon/videos you'll get a list of the most recent talks at CppCon with each item containing the duration, title, link, view_count and date_published.

There are currently three ways to use Hext:

Also, you can test Hext in the "Try Hext in your Browser" section: Just paste any HTML from the web into the editor and have a go!

<div class="yt-lockup-thumbnail">
<span>
<span class="video-time" @text:duration />
</span>
</div>
<div class="yt-lockup-content">
<h3>
<a @text:title
href:prepend("https://youtube.com"):link />
</h3>
<div class="yt-lockup-meta">
<ul>
<li @text:filter(/[^ ]+/):view_count />
<li @text:date_published />
</ul>
</div>
</div>

Another Example

The Hext snippet on the right below matches <a> containing at least two elements: A <span> followed by <img>.
Notice how only the first span is matched for id in the example below. This is because rules take turns to match elements. The second <span> gets tested by the rule for image, which doesn't match and the element gets skipped. The rule then continues with <img>, which matches. Then, the first rule (id) takes over again. And indeed it matches the third <span>, but the result is discarded because the second rule (image) doesn't match: there are no more elements left.
<a href:link>
<span @text:id />
<img src:image />
</a>
<!-- Html input: -->
<a href="/coffee">
<span>#1</span>
<span>Coffee</span>
<img src="coffee.jpg" />
<span>turns nights into code</span>
</a>
<a href="/beer">
<span>#2</span>
<span>Beer</span>
<img src="beer.jpg" />
<span>improves dance skill by 70%</span>
</a>
Hext's output:
{
  "id": "#1",
  "image": "coffee.jpg",
  "link": "/coffee"
}

{
  "id": "#2",
  "image": "beer.jpg",
  "link": "/beer"
}

How Hext Matches Elements

A Hext snippet is a set of rules that is matched against an HTML document. Each HTML element is tested by the first rule of a Hext snippet. A rule begins with either a valid HTML tag or an asterisk which matches every HTML element.

# match every element with any tag
<* />
# match every <div> element
<div />

Rules may have children. If a rule matches an HTML element, the rule's children are matched against the element's children, and must each produce at least one match.

# match <a> elements that have at
# least one child element <img>
<a><img /></a>

Rules may have siblings. If a rule matches an HTML element, all the rule's siblings are matched against the element's siblings. While adjacency is not required, matching elements must still appear in the same order as their rule counterparts.

# match <h1> elements that are followed
# by a <p> element
<h1/><p/>

Rules may be optional. A Hext snippet only matches if each rule finds its match, unless it is marked optional with a question mark. These rules are simply skipped if no match is found. Mandatory rules always take precedence over optional rules.

# match <h1>, optionally followed by <time>,
# followed by a paragraph
<h1/><?time/><p/>

Rules may contain match patterns. Match patterns further refine which HTML elements to match. There are three kinds of match patterns: Attribute matches, Built-in Function matches and Element Traits.
Attributes may be compared against a string or a regular expression. There are six match operators that determine the type of comparison: = contains word, *= contains, ^= begins with, $= ends with, == identical and =~/regex/. Regular expressions may also be embedded in quotes, e.g. =~"\d+", which is particularly useful when matching URLs.
Match patterns may be negated by an exclamation mark.

# match <a> having attribute "name"
<a name />
# match <a> having attribute "href"
# beginning with "https:"
<a href^="https:" />
# match <a> having attribute href containing
# the string "post-" followed by a number
<a href=~/post-\d+/ />
# match <a> not having attribute "href" beginning
# with "https:"
<a href^="https:"! />

Built-in functions turn an HTML element into a string. As with attributes, match operators may be used to test for specific contents. There are three built-in functions: @text, @inner-html and @strip-tags. @text is the most powerful, as it turns an HTML element into clean and readable text.

# match <h1> whose content starts with "News: "
<h1 @text^="News: " />
# match <h1> whose content matches a case
# insensitive regex
<h1 @text=~/section \d+.\d+/i />

Element traits describe the position of an HTML element relative to its parent or general properties, for example its amount of children or attributes. The intent is to replicate CSS pseudo-classes.
Chained traits require elements to match each trait.
:not(traits) negates element traits.

# match <li> who are the first child of their parent
<li:first-child />
# match <li> who are the first and the last child of
# their parent (i.e. the only child)
<li:first-child:last-child />
# match every third <li>
<li:nth-child(3n) />
# match <li> that are not the first child
<li:not(:first-child) />

How Hext Captures Data

Rules may contain captures. After a successful and complete match of a Hext snippet the data is extracted. There are two possible sources for data: HTML attributes and the result of built-in functions. Each capture must be named.

# match <a> having attribute "href" and store
# the attribute's value as "link"
<a href:link />
# match <article> and store its content as "content"
<article @text:content />

Attribute captures may be suffixed by a question mark, indicating that the attribute is optional, i.e. a matching element may have this attribute, but doesn't need to.

# match <a> having attribute "href" and optionally
# having attribute "name", store their values in
# "link" and "anchor"
<a href:link name:anchor? />

Captures may contain string pipes. String pipes transform the content of an HTML attribute or the result of a built-in function.
String pipes may be chained.

# match <a> having attribute "href", prepend
# "https://example.com/" to its value and store
# it as "url"
<a href:prepend("https://example.com/"):url />

A particularly useful string pipe is :filter. It transforms a string according to a regular expression. This regular expression may contain a capture group which isolates a result from the rest of the match.

# extract the first number, store as first_num
<h1 @text:filter(/[0-9]+/):first_num />
# extract digits only, store as phone_nr
<div @text:filter(/Phone: ([0-9]+)/):phone_nr />

Limitations

Hext aims to make simple extractions easy; if you have bigger problems you probably need bigger tools ¯\_(ツ)_/¯

See also:

Element Traits

Element traits describe the position of an HTML element relative to its parent or general properties, for example its amount of children or attributes. The intent is to replicate CSS pseudo-classes.
Chained traits require elements to match each trait.

:empty
:child-count(<amount>)

Select elements that have a certain amount of children.
Text nodes are not considered to be children, i.e. an element that only contains text is considered to be empty.

# match: <div></div>, <div>text</div>, ..
<div:empty />
# match: <div><a></a><b></b></div>,
# <div>Children: <a></a><b></b></div>, ..
<div:child-count(2) />

:attribute-count(<amount>)

Select elements that have a certain amount of attributes.

# match: <div></div>, <div><a></a></div>
<div:attribute-count(0) />
# match: <div class="" style="" ></div>
# <b data-a="" data-b=""></b>
<*:attribute-count(2) />

:not(<:trait>)

Select elements who do not match certain traits.

# match all elements that are not empty
<*:not(:empty) />
# match all elements that are not empty and
# do not have an attribute count of two
<*:not(:empty:attribute-count(2)) />

:first-child
:first-of-type
:last-child
:last-of-type
:only-child
:nth-child(<nth-pattern>)
:nth-of-type(<nth-pattern>)
:nth-last-child(<nth-pattern>)
:nth-last-of-type(<nth-pattern>)
:only-of-type

Select elements by position within their parent.
<nth-pattern> may be given in the form of an+b or as the shorthands even and odd.
See MDN for a detailed description.

# <ul>
# <li>These pseudo</li> <!-- match -->
# <li>classes </li>
# <li>work just </li> <!-- match -->
# <li>like they </li>
# <li>do in CSS </li> <!-- match -->
# </ul>
<li:nth-child(2n+1) />

Match Operators

Match operators compare an attribute's value or the output of a built-in function against a string or a regular expression.

="<string>"

Matches subjects that contain all of the given words in any order. Word boundaries are the beginning and end of the subject or spaces.

# match: class="item",
# class="item menu", ..
# does not match: class="first-item",
# class="menuitem", ..
<* class="item" />
# match: class="article sub head",
# class="head article", ..
# does not match: class="particle head",
# class="article-head", ..
<* class="article head" />

*="<string>"

Matches subjects that contain the given string.

# match: href="http://youtube.com/",
# href="youtube", ..
<* href*="youtube" />

^="<string>"

Matches subjects that begin with the given string.

# match: <p>Hello, this is HAL</p>,
# <p>Hello</p>, ..
# does not match: <p>hello</p>,
# <p>Oh, Hello</p>, ..
<* @text^="Hello" />

$="<string>"

Matches subjects that end with the given string.

# match: href="igel.jpg", href="franz.jpg", ..
<* href$=".jpg" />

=="<string>"

Matches subjects that are equal to the given string.

# match: class="left aligned list"
# does not match anything else
<* class=="left aligned list" />

=~/<regex>/[opt]
=~"<regex>"[opt]

Matches subjects that match the given regular expression.
There are two options for regular expressions:
i: case insensitive and c: collate (locale aware character groups)

# match: class="menuItem-23",
# class="menuItem-42-23", ..
<* class=~/item-\d+/i />

String Pipes

String pipes transform strings before they are captured. String pipes can be chained.

:trim
:trim("<characters>")

Trims characters from the beginning and the end of a string. Trims spaces by default. If given an argument, trims all given characters. Does not handle unicode.

# trim all left and right spaces
<* title:trim:name />
# trim all left and right spaces and dashes
<* title:trim(" -"):name />

:collapsews

Trims whitespace from beginning and end and collapses multiple whitespace to a single space.

# Turns this:
# <a title=" Lots of spaces in this title">
# Into this:
# "Lots of spaces in this title"
<* title:collapsews:name />

:tolower
:toupper

Changes all characters to lower or upper case. Does not handle unicode.

<a title:toupper:link_title />
<a title:tolower:link_title />

:prepend("<string>")
:append("<string>")

Prepends or appends a given string.

# turn relative URLs into absolute URLs
<a href:prepend("https://example.com/"):url />
# append foo
<a title:append(" ..and foo!"):title />

:filter(/<regex>/[opt])

Filters a string according to a given regex. A regex containing a capture group will produce only the matched content of that capture group, otherwise the whole regex match is returned. All capture groups after the first one will be ignored.
There are two options for regular expressions:
i: case insensitive and c: collate (locale aware character groups)

# save the trailing number contained in attr. href
<a href:filter(/\d+$/):user_id />
# save the number after "post-"
<a href:filter(/post-(\d+)/):post_id />

:replace(/<regex>/[opt], "<string>")

Replaces a portion matched by the given regex with a string. Backreferences can be used to address capture groups (detailed description).
There are two options for regular expressions:
i: case insensitive and c: collate (locale aware character groups)

# replace all instances of foo with bar
<p @text:replace(/foo/, "bar"):bar_text />
# remove all numbers
<p @text:replace(/\d+/, ""):nonum />
# use capture groups and backreferences
<h1 @text:replace(/(\d+): (.*)/, "$2 ($1)"):head />

Built-in Functions

Built-in functions transform an element into a string.

@text

Returns the inner text. Trims left and right whitespace and collapses multiple whitespace to a single space. The content of some elements will be embedded in spaces (basically all non-inline elements, like <div> or <h1>).
The intent is to mimic functions like jQuery's text(), IE's innerText() or textContent().
Does not strip metadata content.

# Turns this:
# <article>
# <h1> Which Highway?</h1>Highway 61!
# </article>
# Into this:
# "Which Highway? Highway 61!"
<article @text:content />

@inner-html

Serializes the inner HTML to a string.

# Turns this:
# <div><h1>Which Highway?</h1> Highway 61! </div>
# Into this:
# "<h1>Which Highway?</h1> Highway 61! "
# And the filter turns it into this:
# "Highway 61"
<div @inner-html:filter(/Highway \d+/):number />

@strip-tags

Returns the inner HTML with all tags removed.

# Turns this:
# <div><h1>Which Highway?</h1> Highway 61!</div>
# Into this:
# "Which Highway? Highway 61! "
<div @strip-tags:content />