libhext: C++ Library Documentation  0.8.2-e5d504d
Public Member Functions | List of all members
hext::Rule Class Reference

Extracts values from HTML. More...

Public Member Functions

 Rule (HtmlTag tag=HtmlTag::ANY, bool optional=false, bool greedy=false) noexcept
 Constructs a Rule with a known HTML tag. More...
 
 Rule (std::string tag, bool optional=false, bool greedy=false) noexcept
 Constructs a Rule with the HTML tag given as a string. More...
 
 ~Rule () noexcept=default
 
 Rule (Rule &&) noexcept=default
 
 Rule (const Rule &other)
 
Ruleoperator= (Rule &&) noexcept=default
 
Ruleoperator= (const Rule &other)
 
const Rulechild () const noexcept
 Returns the first child or nullptr if childless. More...
 
const Rulenext () const noexcept
 Returns the next rule or nullptr if no following rule. More...
 
Rulechild () noexcept
 Returns the first child or nullptr if childless. More...
 
Rulenext () noexcept
 Returns the next rule or nullptr if no following rule. More...
 
Ruleappend_child (Rule new_child)
 Appends a child. More...
 
Ruleappend_next (Rule sibling)
 Appends a following Rule. More...
 
Ruleappend_match (std::unique_ptr< Match > match)
 Appends a Match. More...
 
template<typename MatchType , typename... Args>
Ruleappend_match (Args &&... arg)
 Emplaces a Match. More...
 
Ruleappend_capture (std::unique_ptr< Capture > cap)
 Appends a Capture. More...
 
template<typename CaptureType , typename... Args>
Ruleappend_capture (Args &&... arg)
 Emplaces a Capture. More...
 
HtmlTag get_tag () const noexcept
 Returns the HtmlTag this rule matches. More...
 
Ruleset_tag (HtmlTag tag) noexcept
 Sets the HtmlTag this rule matches. More...
 
bool is_optional () const noexcept
 Returns true if this rule is optional, i.e. if a match has to be found. More...
 
Ruleset_optional (bool optional) noexcept
 Sets whether this rule is optional, i.e. More...
 
bool is_greedy () const noexcept
 Returns true if this rule is to be matched repeatedly. More...
 
Ruleset_greedy (bool greedy) noexcept
 Sets whether this rule is to be matched repeatedly. More...
 
std::optional< std::string > get_tagname () const
 Get custom HTML tag name. More...
 
Ruleset_tagname (const std::string &tagname)
 Set custom HTML tag name. More...
 
hext::Result extract (const Html &html) const
 Recursively extracts values from an hext::HTML. More...
 
hext::Result extract (const GumboNode *node) const
 Recursively extracts values from a GumboNode. More...
 
bool matches (const GumboNode *node) const
 Returns true if this Rule matches node. More...
 
std::vector< ResultPaircapture (const GumboNode *node) const
 Returns the result of applying every Capture to node. More...
 

Detailed Description

Extracts values from HTML.

A Rule defines how to match and capture HTML nodes. It can be applied to a GumboNode tree, where it recursively tries to find matches.

Example:
// create a rule that matches anchor elements, ..
Rule anchor(HtmlTag::A);
// .. which must have an attribute called "href"
anchor.append_match<AttributeMatch>("href")
// capture attribute href and save it as "link"
.append_capture<AttributeCapture>("href", "link");
{
// create a rule that matches image elements
// capture attribute src and save it as "img"
img.append_capture<AttributeCapture>("src", "img");
// append the image-rule to the anchor-rule
anchor.append_child(std::move(img));
}
// anchor is now equivalent to the following hext:
// <a href:link><img src:img/></a>
Html html(
"<div><a href='/bob'> <img src='bob.jpg'/> </a></div>"
"<div><a href='/alice'><img src='alice.jpg'/></a></div>"
"<div><a href='/carol'><img src='carol.jpg'/></a></div>");
hext::Result result = anchor.extract(html);
// result will be equivalent to this:
// vector{
// map{
// {"link", "/bob"}
// {"img", "bob.jpg"}
// },
// map{
// {"link", "/alice"}
// {"img", "alice.jpg"}
// },
// map{
// {"link", "/carol"}
// {"img", "carol.jpg"}
// },
// }

Definition at line 88 of file Rule.h.

Constructor & Destructor Documentation

◆ Rule() [1/4]

hext::Rule::Rule ( HtmlTag  tag = HtmlTag::ANY,
bool  optional = false,
bool  greedy = false 
)
explicitnoexcept

Constructs a Rule with a known HTML tag.

Parameters
tagThe HtmlTag that this rule matches. Default: Match any tag.
optionalA subtree matches only if all mandatory rules were matched. Optional rules on the other hand are ignored if not found. Default: Rule is mandatory.
greedyWhether this rule should be repeated once a match is found. Default: Rule is matched once.

◆ Rule() [2/4]

hext::Rule::Rule ( std::string  tag,
bool  optional = false,
bool  greedy = false 
)
explicitnoexcept

Constructs a Rule with the HTML tag given as a string.

Parameters
tagThe HTML tagname that this rule matches. Custom/unknown HTML tags are allowed. If the tagname is a standard-HTML tag, it is converted to an HtmlTag.
optionalA subtree matches only if all mandatory rules were matched. Optional rules on the other hand are ignored if not found. Default: Rule is mandatory.
greedyWhether this rule should be repeated once a match is found. Default: Rule is matched once.

◆ ~Rule()

hext::Rule::~Rule ( )
defaultnoexcept

◆ Rule() [3/4]

hext::Rule::Rule ( Rule &&  )
defaultnoexcept

◆ Rule() [4/4]

hext::Rule::Rule ( const Rule other)

Member Function Documentation

◆ append_capture() [1/2]

Rule& hext::Rule::append_capture ( std::unique_ptr< Capture cap)

Appends a Capture.

Parameters
capThe Capture to append.
Returns
A reference for this Rule to enable method chaining.

◆ append_capture() [2/2]

template<typename CaptureType , typename... Args>
Rule& hext::Rule::append_capture ( Args &&...  arg)
inline

Emplaces a Capture.

Forwards arguments to std::make_unique.

Returns
A reference for this Rule to enable method chaining.

Definition at line 181 of file Rule.h.

◆ append_child()

Rule& hext::Rule::append_child ( Rule  new_child)

Appends a child.

Parameters
new_childThe Rule to append.
Returns
A reference for this Rule to enable method chaining.

◆ append_match() [1/2]

Rule& hext::Rule::append_match ( std::unique_ptr< Match match)

Appends a Match.

Parameters
matchThe Match to append.
Returns
A reference for this Rule to enable method chaining.

◆ append_match() [2/2]

template<typename MatchType , typename... Args>
Rule& hext::Rule::append_match ( Args &&...  arg)
inline

Emplaces a Match.

Forwards arguments to std::make_unique.

Returns
A reference for this Rule to enable method chaining.

Definition at line 164 of file Rule.h.

◆ append_next()

Rule& hext::Rule::append_next ( Rule  sibling)

Appends a following Rule.

Parameters
siblingThe Rule to append.
Returns
A reference for this Rule to enable method chaining.

◆ capture()

std::vector<ResultPair> hext::Rule::capture ( const GumboNode *  node) const

Returns the result of applying every Capture to node.

Parameters
nodeA GumboNode that is to be captured.

◆ child() [1/2]

const Rule* hext::Rule::child ( ) const
noexcept

Returns the first child or nullptr if childless.

◆ child() [2/2]

Rule* hext::Rule::child ( )
noexcept

Returns the first child or nullptr if childless.

◆ extract() [1/2]

hext::Result hext::Rule::extract ( const Html html) const

Recursively extracts values from an hext::HTML.

Returns
A vector containing maps filled with the captured name value pairs.

◆ extract() [2/2]

hext::Result hext::Rule::extract ( const GumboNode *  node) const

Recursively extracts values from a GumboNode.

Returns
A vector containing maps filled with the captured name value pairs.

◆ get_tag()

HtmlTag hext::Rule::get_tag ( ) const
noexcept

Returns the HtmlTag this rule matches.

◆ get_tagname()

std::optional<std::string> hext::Rule::get_tagname ( ) const

Get custom HTML tag name.

Returns
Empty optional if no custom HTML tag name.

◆ is_greedy()

bool hext::Rule::is_greedy ( ) const
noexcept

Returns true if this rule is to be matched repeatedly.

◆ is_optional()

bool hext::Rule::is_optional ( ) const
noexcept

Returns true if this rule is optional, i.e. if a match has to be found.

◆ matches()

bool hext::Rule::matches ( const GumboNode *  node) const

Returns true if this Rule matches node.

Parameters
nodeA GumboNode that is to be matched.

◆ next() [1/2]

const Rule* hext::Rule::next ( ) const
noexcept

Returns the next rule or nullptr if no following rule.

◆ next() [2/2]

Rule* hext::Rule::next ( )
noexcept

Returns the next rule or nullptr if no following rule.

◆ operator=() [1/2]

Rule& hext::Rule::operator= ( Rule &&  )
defaultnoexcept

◆ operator=() [2/2]

Rule& hext::Rule::operator= ( const Rule other)

◆ set_greedy()

Rule& hext::Rule::set_greedy ( bool  greedy)
noexcept

Sets whether this rule is to be matched repeatedly.

Returns
A reference for this Rule to enable method chaining.

◆ set_optional()

Rule& hext::Rule::set_optional ( bool  optional)
noexcept

Sets whether this rule is optional, i.e.

if a match has to be found.

Returns
A reference for this Rule to enable method chaining.

◆ set_tag()

Rule& hext::Rule::set_tag ( HtmlTag  tag)
noexcept

Sets the HtmlTag this rule matches.

Returns
A reference for this Rule to enable method chaining.

◆ set_tagname()

Rule& hext::Rule::set_tagname ( const std::string &  tagname)

Set custom HTML tag name.

Note
The HTML tag name is only matched if this Rule's HtmlTag equals HtmlTag::UNKNOWN.
Returns
A reference for this Rule to enable method chaining.

The documentation for this class was generated from the following file: