Web-Mining Project - I: Graph Analysis of a Story Book (Part 1)
by Yifei Zhou
1. Introduction
In this assignment, I mainly applied some analytical graph-based techniques combined with NLP (Natural Language Processing) approaches to do some the textual analysis for a book. One intent of this assignment is to create a correlation graph focused on the most frequent words occurred in a book. Besides, another one aim is to reveal the main theme based on the graphs created by different chapters of the book, like first half part and second half part.
2. Dataset Preprocessing
The first thing is dataset preprocessing. As we all known, an original version of a book might not be regular or clear, because it consists of an amount of sentences and some recognized or unrecognized symbols. Thus, in order to get a good view of a book, the preprocessing step seems more necessary.
When it comes to preprocessing step, it is mainly made of two sections, primary preprocessing and advanced processing. The primary processing is mainly about processing of ordinary text (strings), such as sentence and word tokenization, stop-words removing and punctuation removing. The basic aim of primary preprocessing is to clean the original text, and therefore, this approach only performed in lower level. After that, the second stage mainly appears in semantic level which is called advanced preprocessing. Based on the result provided by previous stage, I mainly extracted the meaning and internal relationship between words and chapters. The methods which I used in this stage are like Part-Of-Speech, Named entity extraction and N-grams.
2.1 Required Library Loading
The code segment below shows how to load the required libraries to perform this task. First of all, I had to configure the `JAVA_HOME` variable because I need to apply a package called rJava to do textual analysis. The stringr package is used to do some tasks which is related to string process. And then, there are two core packages for `NLP`, `openNLP` and `NLP` [1A & 1B]. And `tidytext` and `widyr` are both helpful for the correlation analysis. After that, `dplyr` is used to perform some operations based on data frame, and `pluralize` is a package used to singularize the words. Besides, igraph is as the fundamental support package for graph plotting and `RColorBrewer` is mainly about color. [Note: I replaced the tidyverse package because this package would cause conflict with NLP package].
library(stringr)
library(igraph)
library(rJava)
library(openNLP)
library(NLP)
library(dplyr)
library(pluralize)
library(Matrix)
library(RColorBrewer)
library(ggplot2)
library(tidytext) # Configure the Java Path for the inetent
library(widyr)
library(igraphdata)
2.2 Text Preprocessing
The second approach is about text preprocessing. As I mentioned previously in section 1, this step was made of two levels, primary level and advanced level. For the primary level, I did all what I mentioned in section 1, and for the secondary level, I mainly applied Noun word detection and named entity detection, this is due to the fact that the noun words could help us to reveal the story details, and named entity like person might help us to recognize the internal relationship between characters.
2.2.1 Part-Of-Speech Detection
The code segment below shows the how to extract all noun words using Part-Of-Speech.
Pre_Processing<-function(raw_text,annotator_list)
{
chapter_text<-raw_text %>% gsub("[\r\n]+","",.) %>% gsub('^"',"",.) %>% gsub('"$',"",.) %>% gsub('^\'',"",.) %>% gsub('$\'',"",.) #Remove the punctuation
chapter_text<-as.String(chapter_text) #changet the primary text as astring object(RJava)
text_annotations<-NLP::annotate(chapter_text,annotator_list) #annotate the text
text_doc<-AnnotatedPlainTextDocument(chapter_text,text_annotations) #Futher annotation
person_entities<-as.character(entites(text_doc,kind="person")) #Extract the named entity (person)
location_entities<-as.character(entites(text_doc,kind="location")) #Extract the named entity (location)
organization_entities<-as.character(entites(text_doc,kind="organization")) #Extract the namedentity (Organization)
date_entities<-as.character(entites(text_doc,kind="date")) #Extract the named entity (Date)
all_entities<-as.character(entites(text_doc)) #Extract all types of entity
accum_entities<<-c(accum_entities,all_entities)
all_nouns<-Find_Noun(text_doc,m_type=c("NN","NNP","NNS")) #Fetch those tags with NN*
all_nouns<-Flat_Noun(all_nouns,all_entities) #Normaized noun words list(distinguished with named enitites)
all_nouns<-singularize(all_nouns)
all_nouns<-all_nouns[str_detect(all_nouns,"^[0-9a-zA-Z\\-]+$")]
list(person=person_entities,location=location_entities,organization=organization_entities,date=date_entities,nouns=all_nouns)
}
2.2.2 Named Entity Detection
The code segment below illustrate the details of named entity extraction.
entites<-function(doc,kind)
{
m_s<-doc$content
a<-annotation(doc)
if(hasArg(kind)){
#get the content(text) of document
#Annotate the docuemnt
#Check whether specify which kind of enitites want to be extracted
k<-sapply(a$features,`[[`,"kind") #kind specify which type of named entity
m_s[a[k==kind]] #return found entitied
}
else{
m_s[a[a$type=="entity"]] #If not specified, return all found entites
}
}
2.2.3 Noun-Words Detection
In this section, it mainly gives the illustration of the process of all ordinary noun words extraction.
Find_Noun<-function(doc,m_type) #Get the words with specified POS tags.
{
m_s<-doc$content
a<-annotation(doc) #annotate the documents, after annotated, a has two type s(Sentence and word)
if(hasArg(doc))
{
k<-sapply(a$features,`[[`,"POS") #This time fectch all POS tags(which is matched m_type)
m_s[a[k %in% m_type & a$type=="word"]] #Only fetch those type with 'word'
}
else{
m_s[a[a$type=="word"]]
}
}
Flat_Noun function shows how to distinguish the normal words and named entity words.
Flat_Noun<-function(noun_list,all_entities){#noun_list contains all words needed to be preproces sed, and all_entities are all entities detected.
falt_noun<-sapply(noun_list,function(x){
tryCatch({ #try-catch block used to detect regular expression grammar (Because some words have invalid symbols although I have removed some of them)
k<-length(which(str_detect(all_entities,x)==T)) #check whwther the found named entities contains these normal nouns.
if(k==0) #k equals to 0 means that this word is an ordinary word
x
},error=function(m){
#cat("There were something wrong",x)
})
})
flat_noun<-unname(unlist(falt_noun)) #Unlist, return a vector of words
flat_noun
}