Web-Mining Project - I: Graph Analysis of a Story Book (Part 3)
by Yifei Zhou
A View of tidy form of book. The table below shows the tidy version of the book, compared with the original raw text, it looks more structured and organized. The book was presented as a data frame, it consists of four columns, title of the book, chapter number, word and word type (Named entity or ordinary Nouns).
head(book,5)
## # A tibble: 5 x 4
## title chapter word type
## <fct> <int> <fct> <fct>
## 1 Greenmantle 1 Sandy PER
## 2 Greenmantle 1 Richard Hannay PER
## 3 Greenmantle 1 Walter PER
## 4 Greenmantle 1 Walter PER
## 5 Greenmantle 1 Walter PER
t<-book %>% group_by(word) %>% summarise(total=n())
ggplot(data=t)+
geom_density(aes(x=t$total))+
xlim(0,30)+theme_classic()+
labs(title = "The word Fr equency distribution in the whole book")
book_summary<-book %>% filter(type %in% c('PER','LOC','ORG','DAT')) %>% group_by(chapter,type) %>% summarise(total=n())
ggplot(data=book_summary)+
geom_col(aes(x=chapter,y=total,fill=type))+
labs(title="The Named Entity Distribution in the whole book",x="Chapter")+
geom_vline(xintercept = 7,linetype="dotted")+
geom_vline(xintercept = 14,linetype="dotted")+
theme_classic()
The graph above shows the word frequency distribution in the whole book. As I saw, most words are concentrated around 1 because these words are not necessary to us. With the frequency number increasing, I can see that the word density was declined significantly.
Besides, I also plotted the named entity distribution in the whole book as shown in the bar graph above. The reason why I is that this graph could give me a help to observe one chapter mainly talks about whether person, location or date.
set.seed(234)
full_graph<-create_graph_by_chapter(c(1:21),book,25,0.49)
simplified_full_graph<-simplify_graph(full_graph)
lyk=layout_with_fr(simplified_full_graph)
par(mar=c(0.1,0.1,0.8,0))
plot(simplified_full_graph,layout=lyk,vertex.size=V(simplified_full_graph)$ver_size,vertex.color=V(simplified_full_graph)$col,vertex.label.dist=1,vertex.label.color="black",vertex.label.cex=0.85, main="The Relationship Graph of Full Book (Green Mantle)",vertex.frame.color="grey",sub="Figure3.1.1")
As shown in figure above, we can see that the different things were described by different type of colors. Obviously, I could observe that these yellow nodes highlight the core characters within this book. For example, Walter, Sandy, peter, Stumm, Hussin and Blenkiron . Besides, the orange nodes like Europe, Constantinople give me a hint of place and location.
In this case, I set the minimum word frequency number is 25 and the correlation coefficient is 0.49, the reason why I set the parameters like that is due to the fact that too higher coefficient I used would cause the unconnected components or very small components whose just made up of 2 or three nodes. By contrast, if I used the very lower coefficient it would cause the dense graph because lower coefficient means that too many edges. While for the minimum frequency number made a decision that how many nodes do I want to show in the graph.
References:
-
R and OpenNLP for Natural Language Processing NLP - Part 1 [https://www.youtube.com/watch?v=RggCAXBe6BA&t=322s]
-
R and OpenNLP for Natural Language Processing NLP - Part 2 [https://www.youtube.com/watch?v=0lpQludiI-0&t=394s]
-
Greenmantle Book, John Buchan [https://www.gutenberg.org/ebooks/559]