Regexp

by moodyharsh
regexp

regular expressions

regular expressions is for matching known grammar

a string is a sequence of quantified unicode characters or strings.

in a string can be
1.a character x{1,1} == x
2.a character ? (x|y)
3.a repeated character x{1,2}
4.a sequence of characters m{1,1}a{1,1}i{1,1}l{1,1} == mail
4.a sequence of characters ? mail|rama
5.repeated strings (ma!){1,2}
6.escape character ,\Q \E

() is used for both grouping and match extraction
the start and end indices can obviously be extracted

in a string the position characters are given by
1.bos ^
2.eos $
3.word boundary \b
4.non word boundary \B
5.negative look ahead (?!E)
6.positive look ahead (?=E)
7.end of previous search \G (multiline mode)

special shorthands
for quantifiers
1.one or no time ? == {0,1}
2.one to inf +
3.0 to inf *
for characters
1.feeds \f \r \n \t \v
2.spaces \s \S
3.any digit \d == [0-9] \D
4.any hex,oct \xhhhh \Ohhh
5.any character .(dot)
6.word \w \W
7.bell \a

globbing and regexp
c == c
? == .

  • == .*
    [ ]

character class [ ]
instead of long (x|y|z)
[x,y,z]
[x-z]
[^x,y,z]
[a-z&&[^x-z]]

substitution
s/regexp1/regexp2/flags

while programming in C++ extra \ is needed
althogh regexp is best a line parser, multiline parsing can be done by setting m mode
regexp are best used with utf-8. although utf-2,utf-21,utf-16,utf-32 can be used.

from java regexp
greedy matches biggest
reluctant matches smalllest first(additional ?)
possessive matches extracts first and doesnt backoff(additional+)