User Tools

Site Tools


cqp:complex-queries

[ Collection: Introduction to CQP ]

3c. Complex Queries

This section introduces complex queries, i.e., queries for sequences of tokens rather than just a single token. It presupposes that you have read Section 1 and Section 2.

Sequences of value-attribute pairs

So far, we have only looked at queries involving a single token – we looked for love, for nouns, etc. Corpus linguistic investigations often start from individual words, so such queries are very typical. However, there are many research questions that involve multi-word expressions, i.e., sequences of tokens, and CQP allows us to construct complex queries for such cases, that consist of a sequence of attribute-value pairs, each enclosed in its own set of square brackets: [attribute="value"] [attribute="value"].

For example, we might want to search for true love only. The query for this would look as follows (note the %c to make the query case insensitive):

[word="true"%c] [word="love"%c]

Enter this query at the prompt and hit RETURN. You will see that now your concordance consists only of cases of the sequence true love.

Or, we might feel nostalgic and want to search for lost love. The query would look like this:

[word="lost"%c] [word="love"%c]

Of course, we can construct sequences where we are using different attributes at different positions. For example, we might be interested in all the different types of love in the corpus, i.e., all sequences of an adjective and the word love. The query would look like this (AJ0 is the tag the BNC uses for uninflected adjectives):

[pos="AJ0"] [hw="love"]

Run this query and look at the result. Your concordance will now contain all such sequences, for example, imaginative love, sexual love, Greek love, modern love and free love. It would be nice if there was a simple way of creating a list of all adjectives preceding the word love, and in fact, there is such a way, which will be described in Section 4a.

Excluding elements from sequences

Recall from Section 2 that we can also search for tokens that do not have a particular property (for example, we searched for all words other than love using the query [hw!="love"]. It was pointed out that this possibility is particularly useful in complex queries: Note that line 16 of the last concordance we created contains the match long love, where love is actually the first element of the compound love affair. If we are interested in adjectives modifying the word love, such cases would confound the results, so we could specify that the word love should not be followed by a noun:

[pos="AJ0"] [hw="love"] [pos!="NN1"]

(Strictly speaking, this only excludes nouns in the singular, you can use the notation with the parentheses and the pipe symbol described in Section 3a to exclude both singular and plural nouns.)

Sequences with gaps

Of course, linguistic structures are often more complex than a simple sequence of tokens – they may contain optional, variable positions (for example, a noun phrase consists of an optional determiner, optionally followed by one or more adjectives, followed by a noun (or, in the case of compounds, a sequence of nouns). CQP allows us to construct queries that take this optionality into account.

Let us use a simpler example than that of a noun phrase. Let us say that we are interested in falling in love – the fixed expression, not the event denoted by it. We could construct the following query:

[hw="fall"] [word="in"%c] [word="love"%c]

However, this will only find cases where the three words occur in an uninterrupted sequence (try it). It is possible, that an adverb occurs between fall and in –– as in the line I fell so hard in love with you from the song Just one look by the great Doris Troy. We can adjust our query to take this into account by simply inserting an empty pair of square brackets in the appropriate position:

[hw="fall"] [] [word="in"%c] [word="love"%c]

This empty pair of brackets means “any token”. If you run this query, you will find sequences like fall completely in love, fall passionately in love, fall hopelessly in love, and so on.

However, this does not solve our problem yet, because now the query only finds cases where something occurs between fall and in – it does not take into account, that this is optional. To do this, we can attach a pair of curly braces to the token, containing two numbers separated by a comma, with the first number specifying the minimal number of times that the element must occur, and the second one specifying the maximal number: {min, max}. For example, if we want to specify that zero or one token may occur between fall and in, the query would look like this:

[hw="fall"] []{0,1} [word="in"%c] [word="love"%c]

Try the query, you will see that it now includes sequences with and without an adverb (you will have to scroll down a little bit before you see the first case of an adverb).

In the line from the song I just cited, there are actually two elements between fall and in. We could take this into account by adjusting the relevant part of the query to []{0,2}, but what if we want to find only those cases where there are two elements between fall and in? In this case, we simply put a single number between the curly braces – this then means “exactly this number of times”:

[hw="fall"] []{2} [word="in"%c][word="love"%c]

You will find that there are a few such cases in the BNC.

Summary and outlook

This section has shown you how to create concordances using complex queries. Building on this, you can look at the following sections in any order:

[ Introduction to CQP: Section 1Section 2Section 3Section 4Section 5Section 6 ]

cqp/complex-queries.txt · Last modified: 2024/01/21 07:11 by aamoakuh