Aniket Salunkhe Learning: Regular Expression Parsing

Regular Expression Parsing

The GNU C Library provides pattern matching facilities for two kinds of patterns: regular expressions and file-name wildcards.

Wildcard Matching: Matching a wildcard pattern against a single string.
Globbing: Finding the files that match a wildcard pattern.
Regular Expressions: Matching regular expressions against strings.
Word Expansion: Expanding shell variables, nested commands, arithmetic, and wildcards. This is what the shell does with shell commands.

Few details of 'Regular Expression parsing' are mentioned below.

Regular Expression parsing

Regular Expression parsing in C

The GNU C Library supports two interfaces for matching regular expressions.

standard POSIX.2 interface, and
the other is what the GNU C Library has had for many years.

Both interfaces are declared in the header file regex.h. If define _POSIX_C_SOURCE, then only the POSIX.2 functions, structures, and constants are declared.

Most of the regular expressions are implemented via a small set of functions.

In POSIX, they are probably part of standard library.

PCREs are available via libpcre (man pcreapi : for details of its functions)

Glib provides a convenient wrapper for libpcre, which contains much additional functionality

The POSIX standard specifies a set of C functions to parse the regular expressions whose grammar it defines, and those functions have been wrapped by hundreds of POSIX-standard tools like sed, awk, and grep, or to deal with little text-parsing details in code.

NOTE: If one want to break a string down into tokens before and after a single-character delimiter, strtok will work.

NOTE: “regular expression tutorial” => man 7 regex , man perlre

Based on an old documentation (http://man7.org/linux/man-pages/man3/pcreposix.3.html),

POSIX-style APIs are just wrapper that ultimately call the PCRE native API.

The POSIX and PCRE interfaces have a common four-step procedure:

include regex.h or pcre.h [pcreposix.h of pcreposix lib]
regcomp() or pcre_compile() : Compile the regex

POSIX Regexp Compilation

[OPTIONAL 2 PCRE] pcre_study() : Optimize the regex

regexec() or pcre_exec() : Run a string through the compiled regex, to find the regex and report the result

Matching POSIX Regexps

[OPTIONAL 1] regmatch_t[ ] or pcre_get_substring() : If substrings in the regular expression are marked to pull out, copy them from the base string using the offsets returned by the regexec() or pcre_exec() function.

Subexpression Complications

[OPTIONAL 1] pcre_free_substring() : Free up the substring

regfree() or pcre_free() : Free the internal memory used by the compiled regex. Free up resources for the regular expression. Regexp Cleanup

[OPTIONAL 2 PCRE] pcre_free_study() : Free up the EXTRA PCRE value (may be NULL)

Regular Expressions

A regular expression defines a search pattern for strings. The abbreviation for regular expression is regex. The search pattern can be anything from,

a simple character
a fixed string or
a complex expression containing special characters describing the pattern.

The pattern defined by the regex may match one or several times or not at all for a given string.

The process of analyzing or modifying a text with a regex is called: “The regular expression is applied to the text/string”.

The pattern defined by the regex is applied on the text from left to right. Once a source character has been used in a match, it cannot be reused. For example, the regex aba will match ababababa only two times (aba_aba__).

Following are different regex elements.

Basic symbols

Regular Expression		Description
.		Matches any single character, except newline
X		a character X
XZ		a string XZ. Finds X directly followed by Z
X\|Z	\|	Finds X or Z
[abc]	[ ]	Set definition, can match the letter a or b or c.
[abc][vz]		Set definition, can match a or b or c followed by either v or z.
[^abc]	[^	This pattern matches any character except a or b or c. When a caret appears as the first character inside square brackets, it negates the pattern.
[a-d1-7]	[ - ]	Ranges: matches a letter between a and d and figures from 1 to 7, but not d1.
\		an escape character in regular expressions, to escape a special character

Assertions

Regular Expression		Description
^regex	^	Start of string Finds regex that must match at the beginning of the line
\A		Start of string, ignores m flag
regex$	$	End of string Finds regex that must match at the end of the line
\Z		End of string, ignores m flag
$		Checks if a line end follows
\b		Matches a word boundary where a word character is [a-zA-Z0-9_]
\B		Non-word boundary
[\b]		Backspace character
\G		Start of match

Regular Expression	Description	Example
(?=...)	Positive lookahead
(?!...) (?!pattern)	Negative lookahead provides the possibility to exclude a pattern a string should not be followed by another string.	a(?!b) This will match "a" if "a" is not followed by "b".
(?<=...)	Positive lookbehind
(?<!...)	Negative lookbehind
(?()\|)	Conditional

\Q..\E

Remove special meaning

Meta characters / Character classes

Meta characters have a pre-defined meaning and make certain common patterns easier to use.

Regular Expression	Description	Similar to
\d	A digit	[0-9]
\D	A non-digit	[^0-9]
\s	A whitespace character	[ \t\n\x0b\r\f]
\S	A non-whitespace character	[^ \t\n\x0b\r\f]
\w	A word character	[a-zA-Z_0-9]
\W	A non-word character	[^\w]
\S+	Several non-whitespace characters

These meta characters have the same first letter as their representation, e.g., digit, space, word, and boundary. Uppercase symbols define the opposite.

Quantifier

A quantifier defines how often an element can occur. The symbols ?, *, + and {} define the quantity of the regular expressions

Regular Expression	Description	Similar to	Examples
*	Occurs 0 or more times	{0,}	X* finds no or several letter X, .* finds any character sequence
+	Occurs 1 or more times	{1,}	X+- Finds one or several letter X
?	Occurs 0 or 1 times	{0,1}	X? finds no or exactly one letter X
{X}	Occurs X number of times, { } describes the order of the preceding liberal		\d{3} searches for three digits, .{10} for any character sequence of length 10.
{X,Y}	Occurs between X and Y times		\d{1,4} means \d must occur at least once and at a maximum of four.
{X,}	X or more times
*?	? after a quantifier makes it a reluctant quantifier. It tries to find the smallest match. This makes the regular expression stop at the first match.

Special Character

\n	Newline
\r	Carriage return
\t	Tab
\0	Null character
\YYY	Octal character YY
\xYY	Hexadecimal character YY
\x{YY}	Hexadecimeal character YY
\cY	Control character Y

Posix Classes

[:alnum:]	Letters and digits
[:alpha:]	Letters
[:ascii:]	Ascii codes 0 - 127
[:blank:]	Space or tab only
[:cntrl:]	Control characters
[:digit:]	Decimal digits
[:graph:]	Visible characters, except space
[:lower:]	Lowercase letters
[:print:]	Visible characters
[:punct:]	Visible punctuation characters
[:space:]	Whitespace
[:upper:]	Uppercase letters
[:word:]	Word characters
[:xdigit:]	Hexadecimal digits

Grouping and back reference

Regular Expression		Description
()	Round brackets	Group part of regular expression
In a pattern, part of regular expression can be grouped with round brackets. This allows to assign a repetition operator to a complete group.
$	$x	Back reference to a group x
In addition, a group can be captured by creating a back reference to the part of the regular expression. A back reference stores the part of the String which matched the group. This allows to use this part in the replacement. Via the $ can refer to a group. e.g. $1 is the first group, $2 the second, etc.

Examples

whitespace between a word character and . or , (\w)(\s+)([\.,])
the text between the two title elements (?i)(<title.*?>)(.+?)()

Groups

(...)	Capturing group
(?P<Y>...)	Capturing group named Y
(?:...)	Non-capturing group
(?>...)	Atomic group
(?\|...)	Duplicate group numbers
\Y	Match the Y'th captured group
(?P=Y)	Match the named group Y
(?R)	Recurse into entire pattern
(?Y)	Recurse into numbered group Y
(?&Y)	Recurse into named group Y
\g{Y}	Match the named or numbered group Y
\g<Y>	Recurse into named or numbered group Y
(?#...)	Comment

Mode modifiers / Flags

Mode modifiers used to specify modes inside the regular expression. It can be added to the start of the regex. To specify multiple modes, simply put them together, e.g. (?ismXXX).

Mode	Description
(?i)	makes the regex case insensitive
(?s)	for "single line mode" makes the dot match all characters, including line breaks (. matches newline as well)
(?m)	for "multi-line mode" makes the ^ and $ match at start and end of each line in the subject string.
x	allow spaces & commnets
J	Duplicate group names allowed
U	Ungreedy quantifiers
(?iLmsux) … Set flags within regex

Backslashes - escape character in Programming

The backslash \ is also an escape character in programmings Strings, with a predefined meaning.

Regular Expression in programming	Description
use double backslash \\	define a single backslash
\\w in your regex	define \w
\\\\	to use backslash as a literal

Aniket Salunkhe Learning

Menu-Submenu

Regular Expression Parsing

Regular Expression parsing

Regular Expression parsing in C

Regular Expressions

Basic symbols

Assertions

Meta characters / Character classes

Quantifier

Special Character

Posix Classes

Grouping and back reference

Groups

Mode modifiers / Flags

Backslashes - escape character in Programming

Search This Blog

Labels