libhext C++ library overview
Libhext is the C++ library that is the building block of Hext.
Hext is a domain-specific language for extracting structured data from HTML
documents. The language itself is explained in its
documentation.
Also, you can try hext from the comfort of your browser here.
libhext's code documentation is available here.
There are two ways to use libhext:
- One is by using hext::ParseHext, which parses a string containing Hext and returns an object of type hext::Rule. Rules match and capture HTML elements.
- The other way is by constructing a hext::Rule manually. This allows building your own matching criteria (by extending hext::Match) and extraction mechanisms (by extending hext::Capture). Take a look at 2. Overview of libhext and 3. Building rules manually.
1. Using hext::ParseHext
If Hext offers all the features you require, this is the way to go. The following example constructs a simple rule that extracts links containing images from HTML documents.
#include <hext/ParseHext.h>
#include <hext/Html.h>
#include <iostream>
int main()
{
// Build a rule that extracts links containing images
auto rule = hext::ParseHext("<a href:link @text:title>"
" <img src:image />"
"</a>");
// Some example HTML input
auto html = hext::Html(
"<a href='/coffee'>"
" <span>#1</span>"
" <span>Coffee</span>"
" <img src='coffee.jpg' />"
" <span>turns nights into code</span>"
"</a>"
"<a href='/beer'>"
" <span>#2</span>"
" <span>Beer</span>"
" <img src='beer.jpg' />"
" <span>improves dance skill by 70%</span>"
"</a>");
// Do the actual extraction. Rule::extract returns
// a std::vector<std::multimap<std::string, std::string>>
// where each multimap contains a complete rule match.
auto results = rule.extract(html);
// Print all key-value pairs from each rule match
for(auto result : results)
{
for(auto pair : result)
std::cout << pair.first << ": " << pair.second << "\n";
std::cout << "\n";
}
// Output:
// image: coffee.jpg
// link: /coffee
// title: #1 Coffee turns nights into code
//
// image: beer.jpg
// link: /beer
// title: #2 Beer improves dance skill by 70%
return 0;
}
2. Overview of libhext
Before diving into an example, let's introduce some components that
are exposed by libhext's public API.
If you are in a rush take a look at
4. Overview of libhext's extendable types at the bottom of this page.
-
hext::Rule:
As seen above, Rules are at the center of it all. Rules contain all
the information that is required to match and capture data.
Rules are trees: A Rule may have children and siblings, which
themselves are Rules. Each Rule contains a vector of hext::Match and
hext::Capture, these are responsible for matching and capturing
individual HTML elements.
Rule::extract expects an object of type hext::Html and returns a hext::Result, which is a type alias for std::vector<std::multimap<std::string, std::string>>. Each multimap contains a complete Rule tree match, where each string pair contains the name and value returned by a Capture. - hext::Html: hext::Html's constructor expects a const char * containing UTF-8 encoded HTML. hext::Html does not copy the buffer, therefore the buffer must outlive the object.
-
hext::Match:
This is the common base class for all matching mechanisms in Hext,
with the exception of HTML tags, which are matched by the rule itself.
You can create your own by inheriting Match and overriding
Match::matches.
Rules accept Matches via
Rule::append_match.
The following subclasses of Match are available out of the box: AttributeCountMatch, AttributeMatch, ChildCountMatch, FunctionMatch, FunctionValueMatch, NegateMatch, NthChildMatch and OnlyChildMatch. -
hext::Capture:
This is the common base class for all capture mechanisms in Hext.
As with hext::Match you can create your own by inheriting Capture.
Rules accept Captures via
Rule::append_capture.
Captures may extract one name-value pair. A capture's name does not have to be unique. There are two subclasses of hext::Capture available out of the box: AttributeCapture, which extracts content from an HTML element's attribute and FunctionCapture, which captures the result of a function that accepts HTML elements as its argument. -
hext::StringPipe:
Both AttributeCapture and FunctionCapture may be given a StringPipe.
StringPipes transform a string before it is captured. For example,
there is
hext::CollapseWsPipe
which trims and collapses whitespace in a string, or
hext::RegexPipe which filters a string according to a regular
expression. StringPipes are linked lists and may therefore be chained.
The following subclasses of StringPipe are available out of the box: AppendPipe, CasePipe, CollapseWsPipe, PrependPipe, RegexPipe, RegexReplacePipe and TrimPipe. -
hext::ValueTest: ValueTests are an easy way to match HTML
elements by the contents of an attribute or by the result of a
built-in function. You can create your own by inheriting
ValueTest and overriding
ValueTest::test.
ValueTests are passed to the constructor of FunctionValueMatch or
AttributeMatch.
The following subclasses of ValueTest are available out of the box: BeginsWithTest, ContainsTest, ContainsWordsTest, EndsWithTest, EqualsTest, NegateTest, RegexTest. - hext::AttributeMatch: An AttributeMatch is a Match that decides whether an HTML element matches by looking at one specific HTML attribute. The actual comparison is delegated to a hext::ValueTest. Can not be inherited (use composition).
-
hext::FunctionValueMatch:
A FunctionValueMatch calls a function and passes the result to a
ValueTest. The type of function is a
hext::CaptureFunction, which is a
typedef for std::function<std::string (const GumboNode *)>: A
function that accepts a GumboNode and returns a std::string.
Libhext exposes
Gumbo,
an HTML5 parsing library written in C99. This is where the GumboNode
is coming from. Gumbo is an incredible piece of work and very well
documented.
Libhext comes with three built-in functions: TextBuiltin, InnerHtmlBuiltin and StripTagsBuiltin. - hext::Cloneable: Cloneable is a CRTP that provides a clone function which calls the copy constructor of its subclass.
- hext::HtmlTag: An enum containing all valid HTML tags, plus HtmlTag::ANY, which matches any HTML tag (translates to <* /> in Hext).
3. Building rules manually — An example
Hext offers ways to match HTML attributes, element traits and the result of built-in functions against a regex or a string literal. But lets just say for the sake of this example that you have this huge database of filenames, and you only want to match <img> elements whose src attribute contains one of these filenames.
#include <hext/Hext.h> // include all of hext
#include <iostream>
#include <unordered_set>
class DatabaseTest : public hext::Cloneable<DatabaseTest, hext::ValueTest>
{
public:
DatabaseTest()
: database_({"water.jpg", "beer.jpg", "wine.jpg", "milk.jpg"})
{}
bool test(const char * value) const override
{
return this->database_.count(std::string(value)) > 0;
}
private:
std::unordered_set<std::string> database_;
};
int main()
{
// Construct a rule that matches <img> elements
hext::Rule rule(hext::HtmlTag::IMG);
rule.append_match<hext::AttributeMatch>(
// that have an attribute called "src"
"src",
// whose value returns true for DatabaseTest::test.
std::make_unique<DatabaseTest>());
// If this rule matches, store the attribute "alt" as "title"
rule.append_capture<hext::AttributeCapture>("alt", "title");
// Example input. Imagine this is a huge document with many
// elements in between.
auto html = hext::Html(
"<img src='beer.jpg' alt='Beer' />"
"<img src='tea.jpg' alt='Tea' />"
"<img src='pilsener.jpg' alt='Pilsener' />"
"<img src='weizen.jpg' alt='Weizen' />"
"<img src='wine.jpg' alt='Wine' />"
"<img src='juice.jpg' alt='Juice' />"
"<img src='milk.jpg' alt='Milk' />");
auto results = rule.extract(html);
// Print all key-value pairs from each rule match
for(auto result : results)
for(auto pair : result)
std::cout << pair.first << ": " << pair.second << "\n";
// Output:
// title: Beer
// title: Wine
// title: Milk
return 0;
}
4. Overview of libhext's extendable types
The following four abstract base classes can be extended to build your own matching and extraction mechanisms.
Type | Gets received by |
---|---|
hext::Match A Match decides whether a single HTML element is matching. |
|
hext::Capture A Capture extracts one key-value pair from a single HTML element. |
|
hext::ValueTest A ValueTests checks whether a string has a certain content. |
|
hext::StringPipe
A StringPipe is a linked list that transforms strings. |
|
Note: When inheriting these types you'll need to override the pure virtual function "clone". The easiest way to do this is by inheriting hext::Cloneable. Cloneable is a CRTP that provides a clone function which calls the copy constructor of its subclass. For example, to extend hext::Match you can inherit hext::Cloneable<YourType, hext::Match> instead of inheriting hext::Match directly.