libhext: C++ Library Documentation  1.0.8-3ad0ae4
Public Member Functions | List of all members
hext::Rule Class Reference

Extracts values from HTML. More...

Public Member Functions

 Rule (HtmlTag tag=HtmlTag::ANY, bool optional=false, bool greedy=false) noexcept
 Constructs a Rule with a known HTML tag. More...
 
 Rule (std::string tag, bool optional=false, bool greedy=false) noexcept
 Constructs a Rule with the HTML tag given as a string. More...
 
 ~Rule () noexcept=default
 
 Rule (Rule &&) noexcept=default
 
 Rule (const Rule &other)
 
Ruleoperator= (Rule &&) noexcept=default
 
Ruleoperator= (const Rule &other)
 
const Rulechild () const noexcept
 Returns the child or nullptr if childless. More...
 
const Rulenext () const noexcept
 Returns the next rule or nullptr if no following rule. More...
 
const std::vector< Rule > & nested () const noexcept
 Returns the nested rules. More...
 
Rulechild () noexcept
 Returns the child or nullptr if childless. More...
 
Rulenext () noexcept
 Returns the next rule or nullptr if no following rule. More...
 
std::vector< Rule > & nested () noexcept
 Returns the nested rules. More...
 
Ruleappend_child (Rule new_child)
 Appends a child. More...
 
Ruleappend_next (Rule sibling)
 Appends a following Rule. More...
 
Ruleappend_nested (Rule nested)
 Appends a nested Rule. More...
 
Ruleappend_match (std::unique_ptr< Match > match)
 Appends a Match. More...
 
template<typename MatchType , typename... Args>
Ruleappend_match (Args &&... arg)
 Emplaces a Match. More...
 
Ruleappend_capture (std::unique_ptr< Capture > cap)
 Appends a Capture. More...
 
template<typename CaptureType , typename... Args>
Ruleappend_capture (Args &&... arg)
 Emplaces a Capture. More...
 
HtmlTag get_tag () const noexcept
 Returns the HtmlTag this rule matches. More...
 
Ruleset_tag (HtmlTag tag) noexcept
 Sets the HtmlTag this rule matches. More...
 
bool is_optional () const noexcept
 Returns true if this rule is optional, i.e. if a match has to be found. More...
 
Ruleset_optional (bool optional) noexcept
 Sets whether this rule is optional, i.e. More...
 
bool is_greedy () const noexcept
 Returns true if this rule is to be matched repeatedly. More...
 
Ruleset_greedy (bool greedy) noexcept
 Sets whether this rule is to be matched repeatedly. More...
 
std::optional< std::string > get_tagname () const
 Get custom HTML tag name. More...
 
Ruleset_tagname (const std::string &tagname)
 Set custom HTML tag name. More...
 
hext::Result extract (const Html &html, std::uint64_t max_searches=0) const
 Recursively extracts values from an hext::HTML. More...
 
hext::Result extract (const GumboNode *node, std::uint64_t max_searches=0) const
 Recursively extracts values from a GumboNode. More...
 
bool matches (const GumboNode *node) const
 Returns true if this Rule matches node. More...
 
std::vector< ResultPaircapture (const GumboNode *node) const
 Returns the result of applying every Capture to node. More...
 

Detailed Description

Extracts values from HTML.

A Rule defines how to match and capture HTML nodes. It can be applied to a GumboNode tree, where it recursively tries to find matches.

Example:
// create a rule that matches anchor elements, ..
Rule anchor(HtmlTag::A);
// .. which must have an attribute called "href"
anchor.append_match<AttributeMatch>("href")
// capture attribute href and save it as "link"
.append_capture<AttributeCapture>("href", "link");
{
// create a rule that matches image elements
// capture attribute src and save it as "img"
img.append_capture<AttributeCapture>("src", "img");
// append the image-rule to the anchor-rule
anchor.append_child(std::move(img));
}
// anchor is now equivalent to the following hext:
// <a href:link><img src:img/></a>
Html html(
"<div><a href='/bob'> <img src='bob.jpg'/> </a></div>"
"<div><a href='/alice'><img src='alice.jpg'/></a></div>"
"<div><a href='/carol'><img src='carol.jpg'/></a></div>");
hext::Result result = anchor.extract(html);
// result will be equivalent to this:
// vector{
// map{
// {"link", "/bob"}
// {"img", "bob.jpg"}
// },
// map{
// {"link", "/alice"}
// {"img", "alice.jpg"}
// },
// map{
// {"link", "/carol"}
// {"img", "carol.jpg"}
// },
// }
Rule(HtmlTag tag=HtmlTag::ANY, bool optional=false, bool greedy=false) noexcept
Constructs a Rule with a known HTML tag.
@ A
https://developer.mozilla.org/en-US/docs/Web/HTML/Element/a
@ IMG
https://developer.mozilla.org/en-US/docs/Web/HTML/Element/img
std::vector< ResultMap > Result
A vector containing ResultMap.
Definition: Result.h:45

Definition at line 89 of file Rule.h.

Constructor & Destructor Documentation

◆ Rule() [1/4]

hext::Rule::Rule ( HtmlTag  tag = HtmlTag::ANY,
bool  optional = false,
bool  greedy = false 
)
explicitnoexcept

Constructs a Rule with a known HTML tag.

Parameters
tagThe HtmlTag that this rule matches. Default: Match any tag.
optionalA subtree matches only if all mandatory rules were matched. Optional rules on the other hand are ignored if not found. Default: Rule is mandatory.
greedyWhether this rule should be repeated once a match is found. Default: Rule is matched once.

◆ Rule() [2/4]

hext::Rule::Rule ( std::string  tag,
bool  optional = false,
bool  greedy = false 
)
explicitnoexcept

Constructs a Rule with the HTML tag given as a string.

Parameters
tagThe HTML tagname that this rule matches. Custom/unknown HTML tags are allowed. If the tagname is a standard-HTML tag, it is converted to an HtmlTag.
optionalA subtree matches only if all mandatory rules were matched. Optional rules on the other hand are ignored if not found. Default: Rule is mandatory.
greedyWhether this rule should be repeated once a match is found. Default: Rule is matched once.

◆ ~Rule()

hext::Rule::~Rule ( )
defaultnoexcept

◆ Rule() [3/4]

hext::Rule::Rule ( Rule &&  )
defaultnoexcept

◆ Rule() [4/4]

hext::Rule::Rule ( const Rule other)

Member Function Documentation

◆ append_capture() [1/2]

template<typename CaptureType , typename... Args>
Rule& hext::Rule::append_capture ( Args &&...  arg)
inline

Emplaces a Capture.

Forwards arguments to std::make_unique.

Returns
A reference for this Rule to enable method chaining.

Definition at line 194 of file Rule.h.

◆ append_capture() [2/2]

Rule& hext::Rule::append_capture ( std::unique_ptr< Capture cap)

Appends a Capture.

Parameters
capThe Capture to append.
Returns
A reference for this Rule to enable method chaining.

◆ append_child()

Rule& hext::Rule::append_child ( Rule  new_child)

Appends a child.

Parameters
new_childThe Rule to append.
Returns
A reference for this Rule to enable method chaining.

◆ append_match() [1/2]

template<typename MatchType , typename... Args>
Rule& hext::Rule::append_match ( Args &&...  arg)
inline

Emplaces a Match.

Forwards arguments to std::make_unique.

Returns
A reference for this Rule to enable method chaining.

Definition at line 177 of file Rule.h.

◆ append_match() [2/2]

Rule& hext::Rule::append_match ( std::unique_ptr< Match match)

Appends a Match.

Parameters
matchThe Match to append.
Returns
A reference for this Rule to enable method chaining.

◆ append_nested()

Rule& hext::Rule::append_nested ( Rule  nested)

Appends a nested Rule.

Parameters
nestedThe Rule to append.
Returns
A reference for this Rule to enable method chaining.

◆ append_next()

Rule& hext::Rule::append_next ( Rule  sibling)

Appends a following Rule.

Parameters
siblingThe Rule to append.
Returns
A reference for this Rule to enable method chaining.

◆ capture()

std::vector<ResultPair> hext::Rule::capture ( const GumboNode *  node) const

Returns the result of applying every Capture to node.

Parameters
nodeA GumboNode that is to be captured.

◆ child() [1/2]

const Rule* hext::Rule::child ( ) const
noexcept

Returns the child or nullptr if childless.

◆ child() [2/2]

Rule* hext::Rule::child ( )
noexcept

Returns the child or nullptr if childless.

◆ extract() [1/2]

hext::Result hext::Rule::extract ( const GumboNode *  node,
std::uint64_t  max_searches = 0 
) const

Recursively extracts values from a GumboNode.

Parameters
max_searchesAbort extraction by throwing a MaxSearchError after doing this amount of searches in the given GumboNode.
Returns
A vector containing maps filled with the captured name value pairs.

◆ extract() [2/2]

hext::Result hext::Rule::extract ( const Html html,
std::uint64_t  max_searches = 0 
) const

Recursively extracts values from an hext::HTML.

Parameters
max_searchesAbort extraction by throwing a MaxSearchError after doing this amount of searches in the given Html.
Returns
A vector containing maps filled with the captured name value pairs.

◆ get_tag()

HtmlTag hext::Rule::get_tag ( ) const
noexcept

Returns the HtmlTag this rule matches.

◆ get_tagname()

std::optional<std::string> hext::Rule::get_tagname ( ) const

Get custom HTML tag name.

Returns
Empty optional if no custom HTML tag name.

◆ is_greedy()

bool hext::Rule::is_greedy ( ) const
noexcept

Returns true if this rule is to be matched repeatedly.

◆ is_optional()

bool hext::Rule::is_optional ( ) const
noexcept

Returns true if this rule is optional, i.e. if a match has to be found.

◆ matches()

bool hext::Rule::matches ( const GumboNode *  node) const

Returns true if this Rule matches node.

Parameters
nodeA GumboNode that is to be matched.

◆ nested() [1/2]

const std::vector<Rule>& hext::Rule::nested ( ) const
noexcept

Returns the nested rules.

◆ nested() [2/2]

std::vector<Rule>& hext::Rule::nested ( )
noexcept

Returns the nested rules.

◆ next() [1/2]

const Rule* hext::Rule::next ( ) const
noexcept

Returns the next rule or nullptr if no following rule.

◆ next() [2/2]

Rule* hext::Rule::next ( )
noexcept

Returns the next rule or nullptr if no following rule.

◆ operator=() [1/2]

Rule& hext::Rule::operator= ( const Rule other)

◆ operator=() [2/2]

Rule& hext::Rule::operator= ( Rule &&  )
defaultnoexcept

◆ set_greedy()

Rule& hext::Rule::set_greedy ( bool  greedy)
noexcept

Sets whether this rule is to be matched repeatedly.

Returns
A reference for this Rule to enable method chaining.

◆ set_optional()

Rule& hext::Rule::set_optional ( bool  optional)
noexcept

Sets whether this rule is optional, i.e.

if a match has to be found.

Returns
A reference for this Rule to enable method chaining.

◆ set_tag()

Rule& hext::Rule::set_tag ( HtmlTag  tag)
noexcept

Sets the HtmlTag this rule matches.

Returns
A reference for this Rule to enable method chaining.

◆ set_tagname()

Rule& hext::Rule::set_tagname ( const std::string &  tagname)

Set custom HTML tag name.

Note
The HTML tag name is only matched if this Rule's HtmlTag equals HtmlTag::UNKNOWN.
Returns
A reference for this Rule to enable method chaining.

The documentation for this class was generated from the following file: