date: 2024-10-27
title: BDA-Lab-2
status: DONE
author:
- AllenYGY
tags:
- Lab
- Report
- PageRank
publish: True
BDA-Lab-2
Goal: Automatically construct a social network of main characters from the book Harry Potter and the Sorcerer's Stone.
This project seeks to build an automated process for extracting and visualizing the relationships between characters from a novel. By identifying these relationships, I aim to construct a knowledge graph that can reveal social dynamics within the story and highlight the influence of key characters. Additionally, by applying association mining and ranking algorithms, I can identify central figures and their interactions, uncovering character importance based on the narrative context.
Character networks are a powerful tool for understanding story structure and character interactions within a narrative. By constructing a network from a text, I can better understand the influence and relationships of main characters, as well as visualize social dynamics within a story. In this project, I use machine learning and natural language processing (NLP) methods to achieve this automatically, focusing on entity extraction, relationship detection, and visualization.
In the previous lab, I used a pretrained BERT model to generate a summary of Harry Potter and the Sorcerer’s Stone, focusing on condensing the storyline to capture essential events and character dynamics. However, I found that understanding relationships within the text requires more than a summary; the text is rich in complex and often implicit relationships between characters. Directly identifying these relationships through rule-based methods or simple keyword matching is challenging due to the intricate, ambiguous, and context-dependent language used in the narrative.
To address these complexities, I chose to use the BERT model again in this lab, this time to identify character relationships. BERT’s ability to grasp nuanced meanings allows us to extract a clearer representation of character interactions. With pretrained language models, I can leverage their deep contextual understanding, bypassing some of the limitations of traditional natural language processing (NLP) techniques.
The aim in this lab is to expand beyond summarization to focus specifically on identifying relationships between characters and building a structured knowledge graph. This graph provides a visual and analytical tool for representing the social network within the story, enhancing our ability to analyze and understand the relational patterns and central characters of Harry Potter and the Sorcerer’s Stone.
The social network of characters is represented in two main components:
Our approach is organized into three main steps: Entity Extraction, Relationship Extraction, and Knowledge Graph Construction.
def extract_entities(text):
words = word_tokenize(text)
pos_tags = pos_tag(words)
chunked = ne_chunk(pos_tags, binary=False)
entities = []
for subtree in chunked:
if isinstance(subtree, Tree) and subtree.label() == "PERSON":
entity = " ".join([leaf[0] for leaf in subtree.leaves()])
entities.append(entity)
return entities
We employ two approaches to extract relationships between entities: spaCy Dependency Parsing and Pretrained BERT Model.
By parsing sentences with spaCy, I identify syntactic dependencies (e.g., subject-verb-object structures) to detect relationships between entities. This allows us to infer direct relationships based on grammatical structure.
def get_entity_relationships(text):
words = word_tokenize(text)
pos_tags = pos_tag(words)
entities = extract_entities(text)
relationships = []
for i in range(len(entities) - 1):
entity1 = entities[i]
entity2 = entities[i + 1]
relation = None
start = words.index(entity1.split()[0]) + len(entity1.split())
end = words.index(entity2.split()[0])
for j in range(start, end):
if pos_tags[j][1].startswith("VB"): # find Vb
relation = pos_tags[j][0]
break
if relation:
relationships.append((entity1, relation, entity2))
#remove duplicate
relationships = list(set(relationships))
return relationships
Using a BERT-based NER model helps capture nuanced relationships that may not be directly linked through syntax alone, allowing for richer contextual understanding of connections between characters.
def create_relationships(df, graph):
for _, row in df.iterrows():
character_a = Node("Character", label="Character", text=row["Character"])
character_b = Node("Character", label="Character", text=row["Target Character"])
graph.merge(character_a, "Character", "text")
graph.merge(character_b, "Character", "text")
relationship = Relationship(character_a, row["Relationship Type"], character_b, notes=row["Notes"])
graph.create(relationship)
With the Neo4j database, I visualize and explore the relationships in the character network. This allows us to:
Objective: Identify and rank the main characters based on their significance and centrality in the character network.
PageRank is a powerful tool that I use in this context to rank the characters according to their significance in the story. Here’s a breakdown of how PageRank works and how it helps us analyze the network of characters:
Using PageRank to Find Main Characters:
Analysis of Results:
By leveraging PageRank, I achieve a deeper, data-driven understanding of the story’s social network, which complements traditional analysis methods and highlights characters based on their relational significance within the entire network structure. This method is particularly valuable for complex narratives where influence and centrality may not be immediately obvious.
Entity and Relationship Extraction:
Comparison of Methods:
Graph Construction:
In our work, I explored three primary approaches for extracting and analyzing character relationships:
Direct Text Extraction:
Dependency Parsing and Language Models:
Graph-Based Analysis:
The dataset consists of the full text of Harry Potter and the Sorcerer's Stone. Key considerations in preparing the data included:
In this lab, I explored an automated approach to constructing a social network of characters from Harry Potter and the Sorcerer's Stone by combining advanced entity and relationship extraction techniques with graph-based analysis.
Our process demonstrated the value of using pretrained language models, such as BERT, to capture nuanced relationships within text that traditional rule-based methods may overlook. While spaCy’s dependency parsing provided an efficient initial method for extracting explicit syntactic relationships, the addition of BERT allowed us to recognize more complex associations, enriching the extracted data.
Neo4j served as an ideal platform for storing, visualizing, and analyzing this character network. By constructing a knowledge graph, I could not only visualize the relationships but also apply graph algorithms like PageRank to identify the most influential characters. This quantitative insight complemented the qualitative extraction, revealing central characters who play key roles in the social structure of the story.
Insight: Pretrained models capture deeper relationships, though at a computational cost. This approach highlights the trade-off between computational efficiency and the quality of extracted relationships, making it suitable for stories with complex character dynamics.
Limitations: While effective, this approach has limitations in handling ambiguous relationships that may require additional context or external knowledge not present in the text.
Future Directions: Future work could explore fine-tuning the BERT model for literary text, which may improve extraction accuracy for context-specific relationships. Expanding this approach to other literary texts or genres could also validate its versatility and refine the methods further.
This lab reinforced the power of combining natural language processing and graph analysis for uncovering insights within narrative structures. Such an approach has promising applications not only in literary analysis but also in areas like social network analysis, sentiment analysis, and content recommendation systems.