User Tools

Site Tools


cqp:regular-expressions-basics

[ Collection: Introduction to CQP ]

3e. Regular expressions (basics)

This section introduces regular expressions — sequences of symbols and characters that can describe classes of strings and shows how they can be used inside the attribute of an attribute-value pair in CQP. It presupposes that you have read Sections 1, 2 3a, 3b, and 3c.

So far, the values we provided for the different attributes in our queries corresponded exactly to one string: [word="love"] will match tokens where the wordform is exactly love, [pos="NN1"] will match tokens where the pos tag is exactly NN1, etc. This is often sufficient, but sometimes we want to look for more than one string in a single query. For example, we may wan to search for the strings love a little and live a little. We could do this using two separate queries:

[word="love"%c] [word="a"%c] [word="little"%c]
[word="live"%c] [word="a"%c] [word="little"%c]

But note that the queries only differ by a single character! Would it not be great if there were a way of searching for a string consisting of an l followed by an o or an i followed by the sequence ve?

This is where regular expressions come in: among other things, they allow us to specify a character class instead of a specific character.

Character classes

The largest character class is represented by the period: . – this stands for “any character”. Using this character class, we can combine the two queries shown above:

[word="l.ve"%c] [word="a"%c] [word="little"%c]

Try the query query using the BNC, and you will see that it finds all instances of the two strings love a little and live a little. It would theoretically also find stings with other characters in the positon of the period, such as leve, lxve, l3ve or l;ve, but these do not occur in the BNC. This character class is useful if you don't know what to expect in a particular position of a string. For example, you may be interested in what words differ from love in just the first character. The query [word=".ove"%c] will provide the answer (run it over the BNC).

In many cases, you will want to be more specific than this. In these cases, you can define your own class containing exactly the characters you want it to contain, by listing them between square brackets. For example, if you want to find all cases of love and dove, but not move, cove, hove, etc., you can define a class containing just the characters l and d: [ld]. Try running the query [word="[ld]ove"%c]. Using a self-defined character class [oi], you can make the query shown above more precise:

[word="l[oi]ve"%c] [word="a"%c] [word="little"%c]

You can also define character classes negatively, by specifying what characters they should not contain. You do this by adding a caret (^) at the beginning of your list of characters: [^ld]ove will find all words beginning with any character except l or d, followed by ove.

Many versions of regular expressions, including the one used in CQP, provide a range of predefined character classes, that are more specific than .. Some useful examples of such classes are:

  • [[:alpha:]] – all alphabetic characters (roughly, all letters)
  • [[:digit:]] – all numeric characters (roughly, all numbers)
  • [[:alnum:]] – all alphanumeric characters (roughly, all letters and numbers)
  • [[:upper:]] – all upper-case alphabetic characters
  • [[:lower:]] – all lower-case alphabetic characters
  • [[:punct:]] – all punctuation marks
  • [[:blank:]] – all whitespace characters (e.g. spaces, tabstops)
  • [[:space:]] – all spaces

You are unlikely to need the last two, since CQP does not allow whitespace within tokens, so none of the tokens in any of our corpora contain spaces, tabstops, etc.

Quantification

Using character classes in your queries will often make things easier for us, but they are still constrained in one respect: they only apply to a single position in a string, but we may want to specify a sequence of characters. For example, we may want to find all cases of the interjection oh (as in Oh, my love). The o in this interjection is often repeated number of times to indicate the length or intensity of the interjection — in the BNC, we find oh, ooh, oooh, ooooh, oooooh, ooooooh, ooooooooh, and ooooooooooooooooh. But how do we find them?

Regular expressions provide ways of specifying how often a character (or character class) should occur. There are three general quantifiers:

  • ? – the preceding character (class) may occur zero times or once
  • * – the preceding character (class) may occur between zero and infinitely many times
  • + – the preceding character (class) may occur between one and infinitely many times

For example, the query [word="o+h"%c] will return all of the variants listed above (it corresponds to “one ore more occurrences of o followed by an h”). Note that the interjection is often spelled without a final h – the query [word="o+h?"%c] would also return these cases (it corresponds to “one ore more occurrences of o followed by zero or one occurence(s) of h”) – try it. In fact, the final h may also be repeated to indicate intensity, so an even better query would be [word="o+h*"%c] (which corresponds to “one ore more occurrences of o followed by zero or more occurence(s) of h”) – again, try it.

These quantifiers can also be applied to character classes. For example, the following query would find all words that begin with an l and end with the sequence ve (for example, love, live, leave and legislative):

[word="l.+ve"%c]

If we want to be more specific and just find words starting with l followed by one or more vowels followed by ve, we can combine the + with a self-defined character class:

[word="l[aeiou]+ve"%c]

This will give us love, live, leave and a few others.

Note that regular expressions are case-sensitive. The above queries will find upper- and lowercase versions of the specified strings because they use the %c flag, but without this flag, we would have to define a character class for every position that contains the upper- and lowercase letter:

[word="[Ll][AEIOUaeiou]+[Vv][Ee]"]

Instead of using the general-purpose quantifiers ?, * and +, we can also specify exact numbers or ranges, using the notation with curly brackets that you have already seen in the preceding section.

  • {n} – exactly n occurrences (e.g. [word="o{3}h"%c] will return oooh)
  • {min,} – at least min occurrences (e.g. [word="o{3,}h"%c] will return oooh and all cases with more than three o's)
  • {,max} – at most max occurrences (e.g. [word="o{,3}h"%c] will return h, oh, ooh and oooh)
  • {min,max} – between min and max occurrences (e.g. [word="o{3,6}h"%c] will return oooh, ooooh, oooooh, and ooooooh)

Grouping and alternatives

In some cases, even a combination of character classes and quantifiers is not enough. For example, if we want to find all cases of the strings love me and leave me, we could use the following query:

[word="l[oea]{1,2}ve"%c] [word="me"%c]

But what if we also wanted fo find the string fool me? We could try to construct this query using character classes and quantifiers, like this:

[word="[lf][oea]{1,2}[vl]e?"%c] [word="me"%c]

The query corresponds to “an l or f, followed by one or two occurrences of any of the characters o, e and a, followed by a v or an l, followed by zero or one occurrence(s) of the character e”. This will return the three strings you want, but also the string feel me; also, it is quite difficult to read.

In such cases, it is easier to define a group of strings instead of classes of characters. This is done by enclosing the group of strings in parentheses and using the pipe symbol | to separate the strings from each other (it means “or”):

[word="(love|leave|fool)"%c] [word="me"%c]

We can also use grouping inside a string. The following will find love me and leave me:

[word="l(o|ea)ve"%c] [word="me"%c]

Summary and outlook

This section has shown you how to create complex regular expressions. As you read other sections, always think about how you might apply them. Section 5a will show you how to use regular expressions outside of individual tokens to create very complex queries.

[ Introduction to CQP: Section 1Section 2Section 3Section 4Section 5Section 6 ]

cqp/regular-expressions-basics.txt · Last modified: 2024/01/21 07:15 by aamoakuh