date: 2024-04-14
title: Data-Programming-Workshop
status: UNFINISHED
author:
- AllenYGY
tags:
- DPW
- Project
- Document
created: 2024-04-14T16:53
updated: 2024-06-11T01:14
publish: True
Data-Programming-Workshop
Project Overview
Data Acquisition
Douban
: Collect comments from the Douban
movie site.Maoyan
: Collect comments from the Maoyan
movie site.Model Training
Model Application
Bilibili
Bilibili
, allowing users to input a video url and retrieve video information and comments instantly.Sources of Our Data
Why These Datasets?
SnowNLP
is a library focused on natural language processing tasks for Chinese text, such as sentiment analysis and text processing.Naive Bayes
algorithm.Split Chinese characters and calculate the probability that the term appears in the set
Method | Test Dataset Accuracy |
---|---|
SnowNLP | 78.58% |
KNN | 78.32% |
Bilibili
Comment | Like | Sentiment | Sentiment x Like |
---|---|---|---|
六公主给流浪地球2的颁奖词: 这是中国电影工业的一次全面跨越升级,以硬实力将中国科幻电影提升到前所未有的境界,7万多字原创剧本,2万多名工作人员,90多万平方米的置景总面积,历时1400余天的摄制,奉上一场2小时53的视觉盛宴。 如此庞大精良的制作规模,造就了影片同名话题超11.8亿的网络关注量,收获40.23亿票房。 M大数据显示,影片传播指数达9.8,位列年度影片之首。 片中昂扬的中国精神,如同闪耀的星群,照耀着中国科幻电影前进的方向。 |
13621 | 4.14E-11 | 5.64E-07 |
其实导演郭帆不是半路出家,而是从小励志拍科幻片。 郭帆15岁时看了卡梅隆的《终结者2》,然后立志以后拍科幻片,他高考本来要考电影学院,但山东省没有招生导演系的,郭帆母亲也劝他考法律后当个政法委总书记就行。 郭帆考上法律专业后想如果自己以后不奔着梦想去,等晚年躺病床、摇椅就特别后悔,所以他就觉得不管选什么专业,只要奔着哪个目标去,然后他大学也拍过电影短片,而且他学过法律学专业很适合工业化方面。 还有郭帆小时候画画很好,也拿过奖,有美术基础的。所以不要总半路出家、非科班、中途转行也能成功,搞得好像人家外行的行我也行,郭帆是自己本来就有这梦想、本来就有相关知识、而且他29岁时还考上北京电影学院管理系研究生。 |
7597 | 0.000216 | 1.639879 |
中国科幻元年必定是1999年。 郭帆在这一年高考,而且这一年的《科幻世界》压中了高考作文《假如记忆可以移植》,郭导看过并且受到启发拿了高分,同年也是大刘在科幻世界上开始首次投发文章《微观尽头》和《鲸歌》,次年就投发了《流浪地球》。同年高中生谢楠在1月的科幻世界发表了短篇奇想,页码43是吴京的生日4月3号。 |
6672 | 8.11E-10 | 5.41E-06 |
It turns out that the model does not work well in practice
ERNIE (Like Bert-wwm)
Pretrained Models
(PTM) based on large-scale unlabeled corpora can acquire generic language representations and perform well when fine-tuned to downstream tasks.dataloader
forward calculation
Each time an epoch is trained, the program will evaluate the effectiveness of the current model training.
Method | Test Dataset Accuracy |
---|---|
SnowNLP | 78.58% |
PaddleNLP | 85.31% |
Bilibili
Through bvid
access to the comment information of the video including:
Through bvid
access to the basic information of the video including:
danmu
contentThrough bvid
request screen XML file obtained through basic information:
Danmu
text
Why FastAPI?
“FastAPI is a modern, fast (high-performance) web framework for building APIs with Python 3.6+.”
Key Advantages:
Introduction to the three main API categories: video
, videos
, newvideo
Here,