3 Limitations and other notes | Through the lens of the RISKS Forum

3.1 Limitations

Rankings are sensitive to data cleaning steps and misclassifications. Synonyms and changing terminology, which often overlap for years, reduce the visibility of some topics.

There is a long tail of possible data corrections, including misspellings in the original, differences in US and British spellings, and sometimes overly blunt regular expressions used to clean the source text. I manually corrected for some where I noticed a word variant showing up in one of the plots; this has the effect of boosting the signal of n-grams that were already rising to the top. I wasn’t thorough or completely consistent. Since we are focused on the n-grams that are most common, this long tail need not be resolved along its length for us to have an interesting data set.
Preferred terms change, and their use overlaps in time. For example, “cellular phone” and “cellular telephone” gave way to “cell phone” in the US and “mobile phone” most other places. I transformed all of these terms to “mobile phone”, so we don’t lose this important topic. Another example: “e-voting”, “online voting”, and “internet voting”; in this case I did not attempt to consolidate them into one term, assuming “vote” or “voting” would rise on its own merits. In contrast, I did transform “fake news” into “fake_news” so we don’t lose this important concept in the unigram plots.
While I did review and correct spaCy’s classification of the most common nounphrases, I could do more to correct misclassifications, for example, organizations or media misclassified as persons and vice versa. I used spacyr, which is is a wrapper for the spaCy library. I used the medium-sized English language model (en_core_web_md); see https://spacy.io/models/en#en_core_web_md.
In addition to excluding predefined stop words, I removed the following words, because they are not interesting, or (as in the case of “risk”, “computer”, “system” and "software), they are too common:

"Uninteresting" words
Excluded from unigrams and bigrams

can, chapter, chapter eleven, chapter sixteen, chapter twelve
computer, computer risk, et al, get, go, include
kb, man, may, million, new, re
review, risk, risks, say, she, software
system, their, them, they, use, via
what, who, you

Since submissions’ body text often includes a signature section at the end, I removed the following names of companies, places, and submitters. I also removed media sources that were commonly mentioned. I am not suggesting the people with the names included below are uninteresting people! Quite the contrary, as frequent contributors to the RISKS Forum, it’s necessary to filter out their names so that the topics they write about (rather than their names) rise to the top. Likely I could do more to filter out words in signatures. It’s possible that I misclassified some people below as authors when in fact they were the subject of posts.

"Uninteresting" words
Excluded from unigrams and bigrams

computers publishing, georgia institute of technology, ims health holdings, isect ltd. zealand
john wiley sons, lexis-nexis, lgs inc., macmillan computer publishing, mcgraw-hill ryerson
mgmt consultant, sri international, tech 1, tttech, bath ba1 1px
canton st., flight international, manvers stree, north america, west german
alan wexelblat, amos shapir, carpenter, catalin cimpanu, charlie osborne
dave horsfall, dave kennedy, david, dewayne hendricks, dewayne hendricks
don norman, gabe goldberg, gabe goldberg, gene, henry
herb lin, jim h., jim horning, john, klaus brunnstein
lauren, lucian constantin, mark, melissa, monte
monty solomon, monty solomon, nick brown, paul, paul robinson
peter, peter ladkin, peter mellor, pgn, pgn - ed
pgn-ed, rob, robert dorsett, ross anderson, sean gallagher
slade, solomon, ted samson, weinstein, wirchenko
ars technica, bbc news, buzzfeed news, cbc news, daythink
dejanews, edupage, executive news service, fox news, gawker media
idg news service, ieee spectrum, infoword, infoworld, infoworld home
internationale medieninformatik, la times, los angeles times, mercury news, mit technology review
nbc news, newsscan, newsscan daily, ny times, pc world
redmond magazine, risks-forum digest, san francisco chronicle, svenska dagbladet, verge
wall street journal, washington post, wsj, zdnet

To improve clarity and accuracy in representation, I made other small transformations and corrections not detailed here.

3.2 Acknowledgements

Thousands of people have contributed content to the RISKS Forum over the last 35+ years–no one more than the moderator Peter G. Neumann.

Special thanks to Lindsay Marshall and Newcastle University for hosting the RISKS Digest archive, which was the source for this analysis. The Internet Archive first indexed the archive in December 1996.

This analysis would have been a lot harder to accomplish without the high quality, free software in the R ecosystem: R itself, RStudio, the packages listed below and their many prerequisite packages. Most of these packages are available through one of the CRAN mirrors; CRAN is an initiative of the R Project.

File system, paths and files

library(here)
library(fs)
library(openxlsx)

Scraping web pages

library(rvest)
library(xml2)

Data cleaning, manipulation and more

library(tidyverse)
library(lubridate)

Text cleaning and manipulation

library(janitor)
library(stringi)
library(pluralize)
library(glue)

Textual analysis

library(spacyr)
library(tidytext)
library(tidylo)

Visualization

library(hrbrthemes)
library(scales)
library(ggrepel)
library(ggtext)
library(ggridges)
library(patchwork)
library(gt)

Other

library(attempt)

3.3 Document history

Date	Changes
2020-11-09	Published at dmoul.github.io with data through part of August 2020
2020-11-21	Updated narrative and acknowledgments; updated data through part of November 2020

By Daniel Moul (heydanielmoul via gmail)

CC-BY This work is licensed under a Creative Commons Attribution 4.0 International License