【伪爬虫】B站评论数据扒取脚本(非原创)
info:本脚本调用的B站api版本为2024年的版本,如后续有更改调用方法和返回方式请勿参考此文
本文主要介绍其他大神创作的通过调用浏览器直接爬取B站评论区内容的脚本,并通过ChatGPT 4.0o
进行脚本微调,实现了输出内容的csv格式化。
下面是StarrySkyVictor通过DrissionPage爬取B站评论的脚本的魔改:
from DrissionPage import ChromiumPage # pip install DrissionPage
import time
import csv
import os
URL = input("请输入B站视频或动态的链接:")
num = int(input("请输入要爬取的页面次数:"))
page = ChromiumPage()
page.set.load_mode.none()
# 监听特定的网络流
page.listen.start('https://api.bilibili.com/x/v2/reply/wbi/main?')
# 访问B站页面
page.get(f'{URL}')
time.sleep(3)
for _ in range(num + 1):
page.scroll.to_bottom()
time.sleep(2)
# 用于存储所有捕获的响应数据
responses = []
try:
for _ in range(num):
packet = page.listen.wait()
page.stop_loading()
# 直接读取响应体(已经是 dict)
response_body = packet.response.body
responses.append(response_body)
time.sleep(1)
except Exception as e:
print(f"解析出现错误: {e}")
# 导出CSV
total_comments = 0
output_path = 'comments.csv'
with open(output_path, 'w', encoding='utf-8-sig', newline='') as f:
writer = csv.writer(f)
writer.writerow(['评论内容', '用户名', '性别', 'IP地址'])
for response in responses:
try:
if 'data' in response and response['data'] and 'replies' in response['data']:
datas = response['data']['replies']
total_comments += len(datas)
for data in datas:
comments = data['content']['message']
uname = data['member']['uname']
sex = data['member']['sex']
IP = data['reply_control'].get('location', '未知')
print(f"评论内容: {comments}\n用户名: {uname}\n性别: {sex}\nIP地址: {IP}\n")
writer.writerow([comments, uname, sex, IP])
except KeyError as e:
print(f"处理响应时出现错误: {e}")
page.close()
print(f"总评论数量: {total_comments}")
print(f"评论数据已保存到: {os.path.abspath(output_path)}")
用法:
- 安装Python
- 安装DrissionPage:
pip install DrissionPage
- 如果安装过谷歌浏览器无需调整
- 如果默认Edge浏览器,修改
C:\Users\[Username]\AppData\Local\Programs\Python\[Python-version]\Lib\site-packages\DrissionPage\_configs\chromium_options.py
路径下脚本中第89行,将返回值调整为:return 'C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe'
- 先运行一次脚本,随便选一个b站链接和翻页数,打开以后会发现b站未登录,先扫码登入一下。
- 根据(总评论除以20)+1后取整,计算翻页数,然后将需要调查的动态或视频url复制,执行脚本即可。
参考地址: