
  • Bilibili data analysis platform
  • Sentiment analysis based on deep learning

Design Overview


Project Overview

  1. Data Acquisition

    • Crawling Movie Comments
      • Douban: Collect comments from the Douban movie site.
      • Maoyan: Collect comments from the Maoyan movie site.
    • Data Cleaning
      • Perform data cleaning to ensure the quality and consistency of the collected comments.
  2. Model Training

    • Machine Learning Model
      • Utilize SnowNLP for sentiment analysis and other natural language processing tasks.
    • Deep Learning Model
      • Employ PaddleNLP to build and train deep learning models for advanced text analysis.
  3. Model Application

    • Crawling Data from Bilibili
      • Implement real-time data crawling from Bilibili, allowing users to input a video url and retrieve video information and comments instantly.
    • Backend Development
      • Develop the backend using the FastAPI Python framework to handle data processing and model integration.
    • Frontend Development
      • Implement a user-friendly interface to visualize and interact with the analysis results.

Data Acquisition

Data Acquisition: Selecting the Right Dataset

Sources of Our Data

  • Douban: A prominent movie site where users freely express their opinions on movies. These comments are often labeled by users' sentiments, making them invaluable for our analysis.
  • Maoyan: Another key player in the movie industry with a rich database of user comments, similarly labeled, allowing for comparative studies and robust model training.

Why These Datasets?

  • Labeled Data: We chose to focus on comments from Douban and Maoyan because they offer labeled datasets. This is critical as labeled data provides a foundation for training and evaluating our sentiment analysis models with higher accuracy.

Comment on Douban and Maoyan

Data Preprocess: Splitting the dataset

Data Preparation

  1. Data Categorization
    • Negative dict: ratings less than or equal to 3
    • Positive dict: ratings greater than 3
  2. Dataset Splitting
    • Random Split
  3. Training and Testing set
    • Training set : Test set = 8:2

Model Training


  • SnowNLP is a library focused on natural language processing tasks for Chinese text, such as sentiment analysis and text processing.
  • Uses the labeled training data to train the classifier using the Naive Bayes algorithm.

Naive Bayes Algorithm

  • Assumes that features are independent of each other.
  • Estimates probabilities based on the features and labels in the training dataset.

Split characters

Split Chinese characters and calculate the probability that the term appears in the set

Cut Chinese Comment

Training Result

Method Test Dataset Accuracy
SnowNLP 78.58%
KNN 78.32%

SnowNLP Model Evaluation

  1. Comment of Movie
  • Crawl the Comment of the Movie 《 Wandering Earth 2》 From  the Bilibili
  1. Estimation the comment score
  • Get a comment sentiment score using the trained SnowNLP model.
  • Average comment sentiment score is only 0.54. (Calculate method In the appendix )
  1. Comment Score Distribution
  • The distribution is polarized, it does not work well
    SnowNLP Comment Score Distribution
  1. Comment Score Example
Comment Like Sentiment Sentiment x Like
13621 4.14E-11 5.64E-07
7597 0.000216 1.639879
6672 8.11E-10 5.41E-06

It turns out that the model does not work well in practice


Pre-training model ERNIE (Like Bert-wwm)

  • With the development of deep learning, the number of model parameters has increased rapidly, and in order to train these parameters, larger data sets are needed to avoid overfitting.
  • Nowadays,  studies have shown that Pretrained Models (PTM) based on large-scale unlabeled corpora can acquire generic language representations and perform well when fine-tuned to downstream tasks.
  • In addition, pre-training models can avoid training models from scratch.


  • BERT requires minimal architecture changes for a wide range of natural language processing applications.

Training Process

  1. Remove a batch data from the dataloader
  2. Feed batch data to the model for forward calculation
  3. Pass forward calculation result to loss function to calculate loss. The forward calculation result is passed to the evaluation method, and the evaluation index is calculated.
  4. Loss reverse return and update gradient. Repeat the above steps.

Each time an epoch is trained, the program will evaluate the effectiveness of the current model training.

Training Result

Method Test Dataset Accuracy
SnowNLP 78.58%
PaddleNLP 85.31%

PaddleNLP Model Evaluation

  1. Comment of Movie
  • Crawl the Comment of the Movie 《 Wandering Earth 2》 From  the Bilibili
  1. Estimation the comment score
  • Get a comment sentiment score using the trained PaddleNLP model.
  • Average comment sentiment score is 0.89. (Calculate method In the appendix )
  1. Sentiment Score Distribution

PaddleNLP Comment Score Distribution

Model Application

Crawle Bilibili video info

  1. Set the User-Agent and cookie information

Through bvid access to the comment information of the video including:

  • Comment
  • Number of comments and likes


  1. Crawl video basic information

Through bvid access to the basic information of the video including:

  • Title
  • Author
  • Reply count, Favorite count,
  • Coin count, Share count


  1. Crawl video comment content and danmu content

Through bvid request screen XML file obtained through basic information:

  • Time
  • Timestamp
  • Danmu text



Why FastAPI?

“FastAPI is a modern, fast (high-performance) web framework for building APIs with Python 3.6+.”

Key Advantages:

  • High performance
  • Rapid development
  • Automatic interactive API documentation
  • Type hints for fewer bugs


Introduction to the three main API categories: video, videos, newvideo

  • Usage of 3 APIs:
    • video: Fetch specific video information.
      • videos: Gather statistics on all scraped videos.
      • newvideo: Add new video data to the system.



Sentiment Score Calculate Method

  1. Compute the values for the column:

Here, and represent the values of the and columns for the -th row, respectively, and is the result of their product, stored in a new column .

  1. Calculate the ratio of the sum of the column to the sum of the column, which represents the mean sentiment value:
    Here, is the sum of all values in the column, and is the sum of all values in the column. This ratio represents the weighted average sentiment value, where the weight for each comment is its number of likes.