谷歌抓取网页视频教程(网页解析我用的是BeautifulSoup的概论(二)_)

优采云发布时间: 2021-10-10 09:35

　　准备寒假，爬一些MOOC课程，爬回家看。

　　爬取的课程是北京大学离散数学导论

　　其实GitHub有可以直接使用的程序，只是不知道怎么提交HTTP请求，所以直接用selenium简单粗暴。

　　我使用 BeautifulSoup 进行网页分析。

　　这个想法其实很简单。只需直接在课件网页上将每章每节课每单元的所有视频都删除即可。所以直接嵌套循环就可以了。

　　遇到的一些困难：

　　课件部分的两个框是隐藏框，点击模拟浏览器

　　操作前需要使用JavaScript修改元素显示值

　　id 属性

　　元素每次点击都不一样，所以定位元素时，不使用id属性定位，使用title属性或其他属性。另一个是我不能使用无头模式来抓取网页。这应该是我这边的环境问题。不知道大家有没有遇到过这种情况。

　　代码：

　　# -*- coding:utf-8 -*-

import time

from selenium import webdriver

from bs4 import BeautifulSoup

import re

import json

from selenium.webdriver.chrome.options import Options

chrome_options = Options()

chrome_options.add_argument("--disable-gpu")

browser = webdriver.Chrome(executable_path='G:\\chromedriver.exe', options=chrome_options)

browser.get('https://www.icourse163.org/learn/NJTU-1002530017#/learn/content?type=detail&id=1004513821') # 目标网页

time.sleep(3)

video = {}

soup = BeautifulSoup(browser.page_source, 'html.parser')

c_l = soup.find("div", attrs={"class": "j-breadcb f-fl"})

chapter_all = c_l.find("div", attrs={"class": "f-fl j-chapter"})

chapter = chapter_all.find_all("div", attrs={"class": "f-thide list"})

for chap in chapter:

js = 'document.querySelectorAll("div.down")[0].style.display="block";'

browser.execute_script(js)

chapter_name = chap.text

a = browser.find_element_by_xpath("//div[@title = '"+chapter_name+"']")

a.click()

time.sleep(3)

soup1 = BeautifulSoup(browser.page_source, 'html.parser')

c_l1 = soup1.find("div", attrs={"class": "j-breadcb f-fl"})

lesson_all = c_l1.find("div", attrs={"class": "f-fl j-lesson"})

lesson = lesson_all.find_all("div", attrs={"class": "f-thide list"})

for les in lesson:

js1 = 'document.querySelectorAll("div.down")[1].style.display="block";'

browser.execute_script(js1)

lesson_name = les.text

b = browser.find_element_by_xpath("//div[@title = '"+lesson_name+"']")

b.click()

time.sleep(3)

soup2 = BeautifulSoup(browser.page_source, 'html.parser')

units = soup2.find_all("li", attrs={"title": re.compile(r"^视频")}) # 只爬取视频课件

for unit in units:

video_name = unit.get("title")

video_link = browser.find_element_by_xpath("//li[@title = '"+video_name+"']")

video_link.click()

time.sleep(3)

soup2 = BeautifulSoup(browser.page_source, 'html.parser')

try:

video_src = soup2.find("source")

video[chapter_name + " " + lesson_name + video_name] = video_src.get("src")

except:

continue

browser.quit()

　　爬取的效果是这样的

　　文笔不好。我开始的时间不长。有兴趣的可以慢慢看原网页的源码。

　　Selenium 简单粗暴，但爬取速度很慢，不如其他爬取方式。

　　以后还是要学着提交POST请求。要是有爬虫带我入门就好了！

　　我刚学爬行的时间不长，计算机知识也不是很多。第一次写东西，多多批评指正！

0

2021-10-10

谷歌抓取网页视频教程

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

谷歌抓取网页视频教程(网页解析我用的是BeautifulSoup的概论(二)_)

0 个评论

发起人

AI时代内容工厂

谷歌抓取网页视频教程(网页解析我用的是BeautifulSoup的概论(二)_)

0 个评论

发起人

相关问题