Web-Mining Project - I: Graph Analysis of a Story Book (Part 2)

3. Correlation Graph Creation

Up to now, I have finished the instruction of (Preprocessing) initialization part. After this stage, I have got a regular and tidy form of the book rather than raw text. In this case, I supposed to use Greenmantle (By John Buchan) [2], this fictional book, as an example to create a correlation graph.

3.1 The whole Book Graph Creation

The code below shows how to create a correlation graph by specified chapters. Within the funtion `create_graph_by_chapter`, it receives 4 parameters, `chapters`: the specified chapters (like first 5 chapters), `m_book`: the specified book, `fre_num`: declare the minimum word frequency,`cor`: means the specified correlation coefficient. In addition, in order to distinguish different types of words, I set different color for different types of words. For example, the vertex color is yellow provided that the word is a person name, and orange is for location name, like country. Besides, the organization, data and normal word colors are purple, red and blue respectively.

In addition, in order to distinguish different types of words, I set different color for different types of words. For example, the vertex color is yellow provided that the word is a person name, and orange is for location name, like country. Besides, the organization, data and normal word colors are purple, red and blue respectively.

create_graph_by_chapter<-function(chapters,m_book,fre_num,cor)
{

m_chapters<-m_book %>% filter(chapter %in% chapters)
word_list<-as.data.frame(m_chapters[,c("word","type")])
name_list<-as.data.frame(word_list[!base::duplicated(word_list$word),c("word","type")])
name_list[name_list$word %in% c("Blenkiron","Stumm","Hannay","Comelii","Enver"),"type"]<-"PER"
#Manually detection for the named entities for those entities which are cannot be detected autom atically.

name_list[name_list$word %in% c("Constantinople","Corneli","Turk"),"type"]<-"LOC"
name_list$color<-ifelse(name_list$type=="PER","yellow",ifelse(name_list$type=="LOC","orange",ifelse(name_list$type=="ORG","purple",ifelse(name_list$type=="DAT","red","blue")))) #Color palette for vertex color

name_list$edge_color<-ifelse(name_list$type=="PER","blue",ifelse(name_list$type=="LOC","orange",ifelse(name_list$type=="ORG","purple",ifelse(name_list$type=="DAT","red","black")))) #color palette for vertex label color(used in community visualization)

name_list$ver_size<-ifelse(name_list$type=="PER",9,ifelse(name_list$type=="LOC",8,6))
rownames(name_list)<-name_list$word
word_cor<-m_chapters %>% group_by(word) %>% filter(n()>=fre_num) %>% pairwise_cor(item=word,fe ature = chapter) %>% filter(!is.na(correlation),correlation>=cor,correlation<1) #Find the words with specified freuency and cor

correlation_graph<-graph_from_data_frame(word_cor,directed = F)
V(correlation_graph)$col<-name_list[V(correlation_graph)$name,"color"]
V(correlation_graph)$edge_col<-name_list[V(correlation_graph)$name,"edge_color"]
V(correlation_graph)$ver_size<-name_list[V(correlation_graph)$name,"ver_size"]
correlation_graph
}

Word Distribution

While, one problem is that how do I decide the optimal word frequency number and correlation coefficient? As for this question, I tried to plot the word distribution to help me decide what parameter I should adapt.

show_words_distribution<-function(m_chapters)
{
word_dis<-book %>% group_by(word) %>% summarise(count=n())
ggplot(data=word_dis)+geom_density(aes(x=word_dis$count))+xlim(0,50)
}

Graph Simplification

On other hand, I found that the created correlation graph was a directed and there were many multiple edges between two nodes. Obviously, it might cause the influence for the analysis of the graph structure, because too many edges would cause a dense graph, and that result would give people uncomfortable feeling in vision. Thus, I simplified the graph with undirected and ignore the weight of the edges that means two nodes are related provided that there exist an edge between them so that I could get a more clear and sparse graph to observe and analysis.

In addition, I also removed those small unconnected components whose total vertex number is less than 3 (meaningless components).

simplify_graph<-function(g)
{
simplified_g<-simplify(g,remove.multiple = T,remove.loops = T)	#Simplify the graph (removeself loops and multiple edges between two nodes)
m<-components(simplified_g)$membership
m_tb<-table(m)	#Generate the membership table
m_tb<-m_tb[m_tb>=4]	#Ignore those unconnected components whose total vertex number is less than 3
member_num<-as.integer(names(m_tb))
main_components<-induced_subgraph(simplified_g,m %in% member_num)
main_components
}

Graph Generation for the Whole Book

The code segment below shows an example of the correlation graph creation. This book consist of 21 chapters, and I used all chapters to create a book firstly and then created a correlation graph by the whole book.

book_titles<-'Greenmantle'	#Set the book directory
chapters<-1:21
chapter_stem="Greenmantle"
ext<-'.txt'
folder<-'./Fiction/'

word_ann<-Maxent_Word_Token_Annotator() #initialize the word tokenization annotator
sent_ann<-Maxent_Sent_Token_Annotator() #initialize the sentence tokenization annotator
pos_ann<-Maxent_POS_Tag_Annotator() #initialize the POS tags annotator

person_ann<-Maxent_Entity_Annotator(kind="person")	#initialize the named entity annotator for each type
location_ann<-Maxent_Entity_Annotator(kind="location")
organization_ann<-Maxent_Entity_Annotator(kind="organization")
date_ann<-Maxent_Entity_Annotator(kind = "date")
annotator_list<-list(sent_ann,word_ann,pos_ann,person_ann,location_ann,organization_ann,date_ann)

#Integrated them into list. The order would be Sentence tokenization -> word tokenization ->POS tags annotation-> named entity extraction

book<-tibble()	#Initialize the book
accum_entities<-vector(mode = "character")
for (m in c(1:21))
{
chapter<- paste0(folder,chapter_stem,m,ext)
raw_text<-readChar(chapter, file.info(chapter)$size)
words_list<-Pre_Processing(raw_text,annotator_list)
person_words<-words_list$person	#Get all found named entities
location_words<-words_list$location
organization_words<-words_list$organization
date_words<-words_list$date
all_nouns<-words_list$nouns person_df<-data.frame()
location_df<-data.frame()
organization_df<-data.frame()
date_df<-data.frame()

#Get all ordinary noun words #Create a dataframe for each entity
#the named entity would be put into dataframe provided that the named entity found

if(length(person_words)>0){
person_df<-data.frame(title=book_titles,chapter=m,word=person_words,type="PER")
}

if(length(location_words)>0){
location_df<-data.frame(title=book_titles,chapter=m,word=location_words,type="LOC")
}

if(length(organization_words)>0){
organization_df<-data.frame(title=book_titles,chapter=m,word=organization_words,type="ORG")
}

if(length(date_words)>0){
date_df<-data.frame(title=book_titles,chapter=m,word=date_words,type="DAT")
}
nouns_df<-data.frame(title=book_titles,chapter=m,word=all_nouns,type="Other")
final_df<-rbind(person_df,location_df,organization_df,date_df,nouns_df)
final_df<-as_tibble(final_df)
final_df<-final_df %>% filter(!word %in% stop_words$word) #Remove the stop words
book<-rbind(book,final_df)  #Combine chapter by chapter
}