date: 2024-11-27
title: BDA-Lab-3
status: DONE
author:
- AllenYGY
tags:
- Lab
- Report
- CTR
publish: True
BDA-Lab-3
The goal of this project is to predict the commercial popularity of characters and plotlines from Harry Potter and the Sorcerer's Stone, with a specific focus on Click-Through Rate (CTR) estimation. By analyzing the relationships between key characters and key events in the plot, we aim to predict which characters and story arcs will resonate most with audiences in a commercial context. This prediction will help tailor marketing and content promotion strategies, particularly for digital media, by identifying elements that are more likely to engage users.
The project will involve constructing a social network of characters, detecting important relationships and interactions, and applying predictive models to rank these elements according to their potential commercial appeal. Ultimately, the goal is to estimate CTR by identifying the most engaging characters and events that are likely to attract attention and drive user interaction.
In media commercialization, especially in industries like books, movies, and television, understanding which characters and plotlines resonate with audiences is crucial for content optimization and targeted marketing. By predicting which aspects of a story will garner the most attention, this project aims to maximize CTR, which directly correlates with audience engagement and, ultimately, the commercial success of content.
This work builds on the foundations laid in Lab-1 and Lab-2:
In Lab-1, I focused on summarizing the content of The Da Vinci Code, using methods like word frequency analysis, sentiment analysis, and content summarization. These analyses helped extract insights into the book's structure and its thematic elements.
In Lab-2, the focus shifted to summarizing Harry Potter and the Sorcerer's Stone, where I condensed the story to highlight the most crucial events and character relationships. This provided the foundation for identifying key interactions that might predict CTR in the commercial context.
In Lab-3, we take these earlier steps further by analyzing character relationships and predicting the potential popularity of characters and plot events, ultimately customizing the CTR prediction model.
In Lab-1, I aimed to summarize the content of The Da Vinci Code from three perspectives:
These methods provided a deeper understanding of the text’s structure and emotional tone, which contributed to the character relationship extraction and popularity prediction in Lab-3. The sentiment analysis, in particular, offered insight into which parts of the story might elicit emotional responses from audiences—a key factor in predicting CTR.
Divide the book into several parts and use the pretrained model to summary each part. (twice)
Pre-trained models summarize the Da Vinci Code
The Da Vinci Code, written by Dan Brown, follows Harvard symbologist Robert Langdon as he becomes entangled in a mysterious murder at the Louvre Museum. Alongside cryptologist Sophie Neveu, Langdon unravels a complex web of codes, symbols, and historical secrets tied to a hidden truth about the Holy Grail and the controversial legacy of Mary Magdalene. The story presents a fictional European secret society, the Priory of Sion, supposedly founded in 1099 and involving famous figures such as Leonardo da Vinci. Another Catholic organization, Opus Dei, emerges in the narrative as a shadowy force, sparking further intrigue.
The plot unfolds through a series of cryptic messages and objects, including a keystone believed to lead to the Grail’s hidden location. Langdon and Sophie encounter numerous obstacles, including Silas, a devout albino monk, and Leigh Teabing, a Grail historian with his own hidden motives. As they decode symbols like the pentacle and the number PHI, they chase leads through significant historical sites like Westminster Abbey and Rosslyn Chapel. The story weaves themes of faith, power, and historical mystery, ultimately suggesting a hidden lineage of Jesus and Mary Magdalene that could challenge traditional Christian beliefs.
Lab-2 focused on summarizing Harry Potter and the Sorcerer's Stone, identifying key events, characters, and their interactions. The summarization process aimed to highlight important narrative elements and character dynamics. The goal was to condense the book while retaining the essence of the storyline.
In Lab-2, I used BERT, a pretrained language model, to generate summaries and understand the structure of the narrative. However, while the summarization helped identify major events, it didn’t provide in-depth analysis of character relationships. Recognizing the need to go beyond mere summarization, I turned to more specialized methods in Lab-3 to focus specifically on predicting popularity based on character interactions and plot development.
Lab-3 builds on the work done in Lab-1 and Lab-2 by using the following approaches:
Character Relationship Extraction:
Popularity Prediction:
CTR Optimization:
Visualization:
Building on Lab-1 and Lab-2, the following preprocessing steps were performed:
map
function from Lab-1) for easier analysis.The relationships between characters were mapped based on their interactions throughout the book. Using a graph-based representation, I built a social network where:
This network provides a visualization of the social dynamics within the book, helping identify the most central characters and their roles in the plot.
Based on the results from Lab-1 and Lab-2, I identified the following features to predict CTR:
Experimenting with Different Models for CTR Prediction
As part of this project, I explored various machine learning models for predicting the Click-Through Rate (CTR), each offering distinct strengths. The models tested include:
Each of these models has been trained and evaluated using the same dataset to compare their performance. Below is an overview of the steps taken, the rationale behind choosing each model, and the results obtained.
Random Forest is an ensemble learning method that aggregates predictions from multiple decision trees to improve accuracy and reduce overfitting. It is particularly effective when dealing with large datasets and high-dimensional feature spaces, as it helps to capture complex relationships between the features and the target variable.
XGBoost is a gradient boosting algorithm known for its efficiency and performance in structured/tabular data. It builds an ensemble of trees sequentially, where each new tree attempts to correct the errors made by the previous ones. XGBoost is widely used for tasks where predictive performance is critical.
Gradient Boosting is another ensemble technique that builds trees sequentially, optimizing for the loss function. Like XGBoost, it focuses on correcting errors made by earlier models, but it can be more sensitive to hyperparameters and the data itself.
Based on the comparative analysis of the models, Random Forest was chosen as the final model for CTR prediction. The rationale behind this decision includes:
Robust Performance: Random Forest demonstrated the best overall performance, with the highest accuracy and AUC score. This indicates that it was able to capture the underlying patterns in the data most effectively.
Reduced Overfitting: As an ensemble method, Random Forest helps mitigate overfitting, which can be a common issue with individual decision trees. This makes it more reliable in producing accurate predictions on unseen data.
Interpretability and Flexibility: While Random Forest may not be as interpretable as simpler models, it provides an effective balance between model complexity and prediction accuracy. It also handles high-dimensional datasets well, which is critical in CTR prediction tasks where the feature space can be large and complex.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve
import matplotlib.pyplot as plt
import xgboost as xgb
from sklearn.ensemble import GradientBoostingClassifier
def ctr_prediction_RandomForest(data):
"""
Function to predict Click-Through Rate (CTR) using a machine learning model.
Parameters:
- data: pandas DataFrame, where the last column is 'CTR' (target variable).
Returns:
- model: trained model.
- accuracy: model's accuracy score.
- auc: ROC-AUC score.
- y_pred: Predicted CTR values.
"""
# Preprocessing steps (same as before)
if data.isnull().sum().any():
data = data.dropna()
categorical_cols = ['user_location', 'ad_type'] # Example categorical columns
data = pd.get_dummies(data, columns=categorical_cols, drop_first=True)
X = data.drop(columns=['CTR']) # Features
y = data['CTR'] # Target variable: CTR (0 = No Click, 1 = Click)
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize and train the model (e.g., Random Forest, Logistic Regression, etc.)
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
# Step 9: Plotting Model Performance (Optional)
# Plot ROC curve
fpr, tpr, thresholds = roc_curve(y_test, model.predict_proba(X_test)[:, 1])
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', label='ROC Curve')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc='lower right')
plt.show()
return model, accuracy, auc, y_pred
def ctr_prediction_xgboost(data):
"""
Function to predict Click-Through Rate (CTR) using XGBoost Classifier.
Parameters:
- data: pandas DataFrame, where the last column is 'CTR' (target variable).
Returns:
- model: trained XGBoost model.
- accuracy: model's accuracy score.
- auc: ROC-AUC score.
- y_pred: Predicted CTR values.
"""
# Preprocessing steps (same as before)
if data.isnull().sum().any():
data = data.dropna()
categorical_cols = ['user_location', 'ad_type']
data = pd.get_dummies(data, columns=categorical_cols, drop_first=True)
X = data.drop(columns=['CTR']) # Features
y = data['CTR'] # Target variable: CTR (0 = No Click, 1 = Click)
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize and train the XGBoost Classifier
model = xgb.XGBClassifier(use_label_encoder=False, random_state=42)
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_pred)
return model, accuracy, auc, y_pred
def ctr_prediction_gradient_boosting(data):
"""
Function to predict Click-Through Rate (CTR) using Gradient Boosting Classifier.
Parameters:
- data: pandas DataFrame, where the last column is 'CTR' (target variable).
Returns:
- model: trained Gradient Boosting model.
- accuracy: model's accuracy score.
- auc: ROC-AUC score.
- y_pred: Predicted CTR values.
"""
# Preprocessing steps (same as before)
if data.isnull().sum().any():
data = data.dropna()
categorical_cols = ['user_location', 'ad_type']
data = pd.get_dummies(data, columns=categorical_cols, drop_first=True)
X = data.drop(columns=['CTR']) # Features
y = data['CTR'] # Target variable: CTR (0 = No Click, 1 = Click)
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize and train the Gradient Boosting Classifier
model = GradientBoostingClassifier(random_state=42)
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_pred)
return model, accuracy, auc, y_pred
The following metrics were used to evaluate the performance of the CTR model:
By analyzing Lab-1 and Lab-2, we were able to build a robust CTR estimation model for Harry Potter and the Sorcerer’s Stone. This model helps predict which characters and plot points will engage audiences the most in a commercialized context. The insights gained can be applied to marketing campaigns, content distribution, and audience engagement strategies for book adaptations and media properties.
The character network was successfully visualized, revealing the key figures and their centrality in the story. Characters like Harry Potter, Hermione Granger, and Ron Weasley were found to have high centrality, while others like Draco Malfoy and Voldemort played more peripheral, yet pivotal, roles.
The CTR prediction model showed promising results, with key plot events like Harry's victory over Quirrell and the reveal of Voldemort’s return generating the highest predicted CTRs. Characters like Dumbledore and Hermione were also found to have high predicted popularity, with potential for high audience engagement.
The visualization of predicted CTR scores confirmed that certain characters and events would likely attract significant attention, offering insights into how the book’s commercial appeal could be optimized.
Lab-3 successfully extended the work from Lab-1 and Lab-2 to predict the popularity of characters and events in Harry Potter and the Sorcerer’s Stone after commercialization. By analyzing character relationships and applying machine learning techniques for CTR prediction, this project provides valuable insights into how to optimize audience engagement based on narrative dynamics.
The CTR prediction model not only helps identify key characters but also provides a framework for analyzing the commercial potential of plot events, which could be applied to similar works in the future.