date: 2024-05-19
title: BilibiliVista
status: DONE
author:
- AllenYGY
tags:
- Project
- Python
- DPW
created: 2024-05-19T19:57
updated: 2024-06-11T01:14
publish: True
cd backend
python main.py
cd frontend/code
npm i
npm run dev
Firstly,We have implemented an automated movie review scraper to quickly obtain a large volume of movie reviews and ratings data. Based on this dataset (over 50,000 entries), we trained and implemented two favorability rate analysis models: one based on a favorability rate dictionary and Naive Bayes (machine learning), and another retrained on a transformer with whole word masking pre-training. Additionally, we developed a scraper capable of dynamically obtaining Bilibili video information (comments, bullet chats, etc.). Finally, we integrated these features using FastAPI to achieve front-end and back-end separation, supporting dynamic retrieval of Bilibili-related information and favorability rate analysis of comments on specified Bilibili videos using the two trained models, with dynamic data presentation to the front end.
Automated Movie Review Scraper:
favorability rate Analysis Models:
Dynamic Bilibili Video Information Scraper:
FastAPI-based Front-End and Back-End Integration:
Data Acquisition
Douban
: Collect comments from the Douban
movie site.Maoyan
: Collect comments from the Maoyan
movie site.Model Training
Model Application
Bilibili
Bilibili
, allowing users to input a video url and retrieve video information and comments instantly.To quickly acquire a large dataset, we wrote an automated crawler for movie reviews to simplify the task. It can automatically scrape movie reviews and rating information from Douban and Maoyan.
Additionally, we have also open-sourced this crawler on GitHub.
favorability rate dictionary
combined with Naive Bayes
which is based on SnowNLP.[1]Split Chinese characters and calculate the probability that the term appears in the set
Pretrained Models
(PTM) based on large-scale unlabeled corpora can acquire generic language representations and perform well when fine-tuned to downstream tasks.ERNIE
(Like Bert-wwm)dataloader
forward calculation
Each time an epoch is trained, the program will evaluate the effectiveness of the current model training.
Method | Test Dataset Accuracy |
---|---|
SnowNLP | 78.58% |
PaddleNLP | 85.31% |
In order to test the generalizability of our model. We crawl the comment of the movie 《Wandering Earth 2》 From the Bilibili
, and analyze it with two models respectively.
Obviously, the deep learning approach is more convincing, and the machine learning approach runs much faster than the deep learning approach.
A crawler capable of dynamically obtaining Bilibili video information (comments, bullet chats, etc.) which will be used in the backend part.
Through bvid
access to the comment information of the video including:
Through bvid
access to the basic information of the video including:
danmu
contentThrough bvid
request screen XML file obtained through basic information:
Danmu
text
Those features talking about above will be used in backend according to FastAPI
(backend framwork) to achieve front-end and back-end separation, supporting dynamic retrieval of Bilibili-related information and favorability rate analysis of comments on specified Bilibili videos using the two trained models, with dynamic data presentation to the front end.
Why FastAPI?
“FastAPI is a modern, fast (high-performance) web framework for building APIs with Python 3.6+.”
Introduction to the three main API categories: video
, videos
, newvideo
In this part, we use Vue3 + Typescripts + Arco Design + Vue-18n + Vite
and connect the python backend to visualize data.
Backend
Frontend
SnowNLP
is a library focused on natural language processing tasks for Chinese text, such as favorability rate analysis and text processing.↩︎