2 Multiple perspectives

As the term n-gram implies, there are other word chunkings we could consider besides unigrams, and in fact there are alternatives to how we determine the unigrams. Above I used spaCy to lemmatize words and treated these lemmas as the unigrams. In this section, we will look at unigrams defined as noun phrases, whare are one or more words that fulfill the function of a noun in a sentence. This is possible, because spaCy can parse with an awareness of the parts of speech, We’ll look first at noun phrase unigrams from the titles of submissions, then from the body text of submissions. Following that we’ll briefly look at bigrams generated by all of the above techniques.

2.1 An example: “car”

Why do unigrams, bigrams and noun phrases provide different perspectives? Consider car, which can stand alone as a unigram or be part of these longer n-grams (the trigrams each include a car-related bigram). Suppose each of the car n-grams below can be found in a distinct sentence of text. Then the count of each type are as follows:

type	ngram	n
unigram	car	17
bigram	car door, car electronics, car destroyed, car software, car computer, car insurance, car lcd, car navigational, car repair, car train	10
trigram	car computer system, car insurance policy, car lcd display, car navigational system, car repair shop, car train collision	6

Unigrams are less precise than bigrams and trigrams, however unigrams are helpful in signaling a general topic and occur at much greater frequency than bigrams and trigrams in the same corpus.

I find the n-grams in two ways:

Naively: simply count the occurrence of each n-gram by treating words in their base form (“lemma”) as a unigram or bigram; and
With awareness of grammatical structure to identify noun phrases.

The results of the two approaches differ considerably, primarily because with noun phrases, a subset of a larger noun phrase is not counted as a shorter n-gram.

type	ngram	n
unigram	car	17
noun phrase unigram	car	1
bigram	car door, car electronics, car destroyed, car software, car computer, car insurance, car lcd, car navigational, car repair, car train	10
noun phrase bigram	car door, car electronics, car destroyed, car software	4
trigram	car computer system, car insurance policy, car lcd display, car navigational system, car repair shop, car train collision	6
noun phrase trigram	car computer system, car insurance policy, car lcd display, car navigational system, car repair shop, car train collision	6

2.2 Different perspectives: summary by type

These distinctions are evident in the twenty most common unigrams, bigrams and noun phrase unigrams. Note first the big difference in frequencies as seen in the x axis. The character of the lists is different too:

Unigrams are for the most part general risk topics
Noun phrase unigrams from titles are mostly organizations, countries and adjectives related to countries. noun phrase unigrams from body text are a wider mix of people, concepts, products, and organizations.
Simple bigrams are more particular risk topics–or shards of them. For example, “credit card” and “buffer overflow” are sources of many risks, while “driving car” is likely part of a trigram, for example, “self driving car” and “autonomous driving car;” “air traffic” is probably part of “air traffic control.”

Figure 2.1: Most common n-grams–all years

2.3 Unigrams from titles

The Visualizations and discussion in The lens uses lemmatized words as unigrams. Below we’ll look at unigrams derrived from noun phrases, which provide a different perspective.

Counting the frequency of noun phrase unigrams in five-year periods and calculating a percentage of all mentions in each period makes visible some trends.

Common topics

Aerospace has been a common source of topics, with references to manufacturers Airbus and Boeing as well as specific planes (A320, A380), rockets (Ariane), and topics related to space exploration (NASA, Mariner, mars).
Whereas privacy, encryption and data are consistently in the top-five in the ranking of simple unigrams we looked at earlier, here they are missing. Instead the list includes many more place-related terms: British, Dutch, German, Japanese, Russia and Russian, China and Chinese, Canadian, American, California, San Francisco, etc. We also see more organizations: congress, IRS, Stanford, Intel, Amazon, Skype, NSA, and the same technology companies: Microsoft, Facebook, Google, and Apple.

Compare this to Figure 1.3

Figure 2.2: Noun phrase unigram frequency: five-year periods

2.4 Unigrams from body text

There is a lot more body text than title text: 22 times more. However, noun phrase unigrams do not include words that are not part of noun phrases or are part of multi-word noun phrases, so the set of words we have to work with is only about one tenth as large: (1,355 unique, noun phrase unigrams) compared to (15,971 unique, simple unigrams. In the body text, the most common noun phrase is one word in length, as seen in Figure 2.3:

Figure 2.3: Number of words in body text noun phrases

Below are the most frequent noun phrase n-grams for each value of \(n\) in the submission body text. The 8-word n-gram is part of a frequent contributor’s email signature (“uunet!…”).

Number of words	Most popular n-gram
1	clipper
2	edward snowden
3	michel e. kabay
4	natl computer security assn
5	sei conference on risk management
6	sei customer relations software engineering institute
8	u of toronto zoology uunet!attcan!utzoo!henry
17	software reliability software safety computer security formal methods tools process models real-time systems networks embedded systems

The 17-word ngram is most of a table in a call for papers for COMPASS ’94 in Volume 15 Issue 34, which after text processing, reads as follows:

Papers, panel session proposals, tutorial proposals, and tools fair proposals are
solicited in the following areas:

Software Reliability     Software Safety     Computer Security
Formal Methods           Tools               Process Models
Real-Time Systems        Networks            Embedded Systems
V&V Practices            Certification       Standards

Looking at relative frequency of the noun phrase unigrams from body text within each time period in Figure 2.4, we can see the track of technological evolution: uucp, arpanet, and clipper in the early years, followed by java, javascript and pgp in the middle years, then siri, bitcoin, iphone and android in the later years. USB shows up in the last three five-year periods.

Compare this figure to Figure 1.3 and 2.2

Figure 2.4: Noun phrase unigram frequency: five-year periods

2.5 Bigrams

Bigrams are two words that occur adjacent to each other in the text. Figure 2.5 show the most common simple bigrams in each 5-year period. These are derived from the submission titles. Note the very low number of occurrences.

“Grandma wind” is third bigram in the 2010-2014 column. It illustrates some of the challenges and limitations in doing this kind of textual analysis. It’s part of the title of the thread “Don’t throw away Grandma’s wind-up desk clock” started by Danny Burstein in Volume 26.

Grandma’s was lemmatized to “grandma”
wind-up was separated into two words
up was filtered out, because it’s a short, common word
So we have grandma wind as our bigram

Figure 2.5: Simple bigrams

Figure 2.6 presents the same visualization using noun phrase bigrams from titles. Note the similarly low frequencies.

Here again we can see limitations: AT&T and fighter planes F15, F-16, and F22 should not be considered bigrams.

Figure 2.6: Noun phrase bigrams from titles

The most common bigrams sourced from bodytext noun phrases are shown in 2.7. Again place-related bigrams dominate the top of the list. This could be due in part to these bigrams being part of email signatures that I imperfectly filtered out. Further down the list we do see n-grams that we otherwise might miss, including college board, Fannie Mae, B-2, Kapersky Lab, Ashley Madison, and Homeland Security.

Figure 2.7: noun phrase bigrams from body text

There is no one “correct” way to collect n-grams from a corpus. Using multiple techniques provides a richer view into the topics that have been addressed in the RISKS Forum.

By Daniel Moul (heydanielmoul via gmail)

CC-BY This work is licensed under a Creative Commons Attribution 4.0 International License