libhext C++ library overview

Libhext is the C++ library that is the building block of Hext. Hext is a domain-specific language for extracting structured data from HTML documents. The language itself is explained in its documentation. Also, you can try hext from the comfort of your browser here.
libhext's code documentation is available here.

There are two ways to use libhext:

1. Using hext::ParseHext

If Hext offers all the features you require, this is the way to go. The following example constructs a simple rule that extracts links containing images from HTML documents.

#include <hext/ParseHext.h>
#include <hext/Html.h>
#include <iostream>
int main()
{
// Build a rule that extracts links containing images
auto rule = hext::ParseHext("<a href:link @text:title>"
" <img src:image />"
"</a>");
// Some example HTML input
auto html = hext::Html(
"<a href='/coffee'>"
" <span>#1</span>"
" <span>Coffee</span>"
" <img src='coffee.jpg' />"
" <span>turns nights into code</span>"
"</a>"
"<a href='/beer'>"
" <span>#2</span>"
" <span>Beer</span>"
" <img src='beer.jpg' />"
" <span>improves dance skill by 70%</span>"
"</a>");
// Do the actual extraction. Rule::extract returns
// a std::vector<std::multimap<std::string, std::string>>
// where each multimap contains a complete rule match.
auto results = rule.extract(html);
// Print all key-value pairs from each rule match
for(auto result : results)
{
for(auto pair : result)
std::cout << pair.first << ": " << pair.second << "\n";
std::cout << "\n";
}
// Output:
// image: coffee.jpg
// link: /coffee
// title: #1 Coffee turns nights into code
//
// image: beer.jpg
// link: /beer
// title: #2 Beer improves dance skill by 70%
return 0;
}

2. Overview of libhext

Before diving into an example, let's introduce some components that are exposed by libhext's public API.
If you are in a rush take a look at 4. Overview of libhext's extendable types at the bottom of this page.

  • hext::Rule: As seen above, Rules are at the center of it all. Rules contain all the information that is required to match and capture data. Rules are trees: A Rule may have children and siblings, which themselves are Rules. Each Rule contains a vector of hext::Match and hext::Capture, these are responsible for matching and capturing individual HTML elements.
    Rule::extract expects an object of type hext::Html and returns a hext::Result, which is a type alias for std::vector<std::multimap<std::string, std::string>>. Each multimap contains a complete Rule tree match, where each string pair contains the name and value returned by a Capture.
  • hext::Html: hext::Html's constructor expects a const char * containing UTF-8 encoded HTML. hext::Html does not copy the buffer, therefore the buffer must outlive the object.
  • hext::Match: This is the common base class for all matching mechanisms in Hext, with the exception of HTML tags, which are matched by the rule itself. You can create your own by inheriting Match and overriding Match::matches. Rules accept Matches via Rule::append_match.
    The following subclasses of Match are available out of the box: AttributeCountMatch, AttributeMatch, ChildCountMatch, FunctionMatch, FunctionValueMatch, NegateMatch, NthChildMatch and OnlyChildMatch.
  • hext::Capture: This is the common base class for all capture mechanisms in Hext. As with hext::Match you can create your own by inheriting Capture. Rules accept Captures via Rule::append_capture.
    Captures may extract one name-value pair. A capture's name does not have to be unique. There are two subclasses of hext::Capture available out of the box: AttributeCapture, which extracts content from an HTML element's attribute and FunctionCapture, which captures the result of a function that accepts HTML elements as its argument.
  • hext::StringPipe: Both AttributeCapture and FunctionCapture may be given a StringPipe. StringPipes transform a string before it is captured. For example, there is hext::CollapseWsPipe which trims and collapses whitespace in a string, or hext::RegexPipe which filters a string according to a regular expression. StringPipes are linked lists and may therefore be chained.
    The following subclasses of StringPipe are available out of the box: AppendPipe, CasePipe, CollapseWsPipe, PrependPipe, RegexPipe, RegexReplacePipe and TrimPipe.
  • hext::ValueTest: ValueTests are an easy way to match HTML elements by the contents of an attribute or by the result of a built-in function. You can create your own by inheriting ValueTest and overriding ValueTest::test. ValueTests are passed to the constructor of FunctionValueMatch or AttributeMatch.
    The following subclasses of ValueTest are available out of the box: BeginsWithTest, ContainsTest, ContainsWordsTest, EndsWithTest, EqualsTest, NegateTest, RegexTest.
  • hext::AttributeMatch: An AttributeMatch is a Match that decides whether an HTML element matches by looking at one specific HTML attribute. The actual comparison is delegated to a hext::ValueTest. Can not be inherited (use composition).
  • hext::FunctionValueMatch: A FunctionValueMatch calls a function and passes the result to a ValueTest. The type of function is a hext::CaptureFunction, which is a typedef for std::function<std::string (const GumboNode *)>: A function that accepts a GumboNode and returns a std::string. Libhext exposes Gumbo, an HTML5 parsing library written in C99. This is where the GumboNode is coming from. Gumbo is an incredible piece of work and very well documented.
    Libhext comes with three built-in functions: TextBuiltin, InnerHtmlBuiltin and StripTagsBuiltin.
  • hext::Cloneable: Cloneable is a CRTP that provides a clone function which calls the copy constructor of its subclass.
  • hext::HtmlTag: An enum containing all valid HTML tags, plus HtmlTag::ANY, which matches any HTML tag (translates to <* /> in Hext).

3. Building rules manually — An example

Hext offers ways to match HTML attributes, element traits and the result of built-in functions against a regex or a string literal. But lets just say for the sake of this example that you have this huge database of filenames, and you only want to match <img> elements whose src attribute contains one of these filenames.

#include <hext/Hext.h> // include all of hext
#include <iostream>
#include <unordered_set>
class DatabaseTest : public hext::Cloneable<DatabaseTest, hext::ValueTest>
{
public:
DatabaseTest()
: database_({"water.jpg", "beer.jpg", "wine.jpg", "milk.jpg"})
{}
bool test(const char * value) const override
{
return this->database_.count(std::string(value)) > 0;
}
private:
std::unordered_set<std::string> database_;
};
int main()
{
// Construct a rule that matches <img> elements
hext::Rule rule(hext::HtmlTag::IMG);
rule.append_match<hext::AttributeMatch>(
// that have an attribute called "src"
"src",
// whose value returns true for DatabaseTest::test.
std::make_unique<DatabaseTest>());
// If this rule matches, store the attribute "alt" as "title"
rule.append_capture<hext::AttributeCapture>("alt", "title");
// Example input. Imagine this is a huge document with many
// elements in between.
auto html = hext::Html(
"<img src='beer.jpg' alt='Beer' />"
"<img src='tea.jpg' alt='Tea' />"
"<img src='pilsener.jpg' alt='Pilsener' />"
"<img src='weizen.jpg' alt='Weizen' />"
"<img src='wine.jpg' alt='Wine' />"
"<img src='juice.jpg' alt='Juice' />"
"<img src='milk.jpg' alt='Milk' />");
auto results = rule.extract(html);
// Print all key-value pairs from each rule match
for(auto result : results)
for(auto pair : result)
std::cout << pair.first << ": " << pair.second << "\n";
// Output:
// title: Beer
// title: Wine
// title: Milk
return 0;
}

4. Overview of libhext's extendable types

The following four abstract base classes can be extended to build your own matching and extraction mechanisms.

Type Gets received by
hext::Match
A Match decides whether a single HTML element is matching.
hext::Capture
A Capture extracts one key-value pair from a single HTML element.
hext::ValueTest
A ValueTests checks whether a string has a certain content.
hext::StringPipe
A StringPipe is a linked list that transforms strings.

Note: When inheriting these types you'll need to override the pure virtual function "clone". The easiest way to do this is by inheriting hext::Cloneable. Cloneable is a CRTP that provides a clone function which calls the copy constructor of its subclass. For example, to extend hext::Match you can inherit hext::Cloneable<YourType, hext::Match> instead of inheriting hext::Match directly.