1.1K
完整的正規表示式請點超連結
這程式爬莫煩開發的測試網站,目標爬出此網頁的圖片(img)連結,以及爬出href標籤內的超連結。
# %%
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
# if has Chinese, apply decode()
#html = urlopen("https://mofanpy.com/static/scraping/basic-structure.html").read().decode('utf-8')
html = urlopen("https://mofanpy.com/static/scraping/table.html").read().decode('utf-8')
print(html)
# %%
soup = BeautifulSoup(html , features='lxml')
#print(soup)
img_links = soup.find_all("img",{"src": re.compile('.*?\.jpg')})
for link in img_links:
print(link['src'])
# %%
course_links = soup.find_all('a', {'href': re.compile('/tutorials.*')})
for link in course_links:
print(link['href'])
顯示結果
<!DOCTYPE html>
<html lang="cn">
<head>
<meta charset="UTF-8">
<title>爬虫练习 表格 table | 莫烦 Python</title>
<style>
img {
width: 250px;
}
table{
width:50%;
}
td{
margin:10px;
padding:15px;
}
</style>
</head>
<body>
<h1>表格 爬虫练习</h1>
<p>这是一个在 <a href="/" >莫烦 Python</a> 的 <a href="/tutorials/data-manipulation/scraping/" >爬虫教程</a>
里无敌简单的网页, 所有的 code 让你一目了然, 清晰无比.</p>
<br>
<table id="course-list">
<tr>
<th>
分类
</th><th>
名字
</th><th>
时长
</th><th>
预览
</th>
</tr>
<tr id="course1" class="ml">
<td>
机器学习
</td><td>
<a href="/tutorials/machine-learning/tensorflow/">
Tensorflow 神经网络</a>
</td><td>
2:00
</td><td>
<img src="/static/img/course_cover/tf.jpg">
</td>
</tr>
<tr id="course2" class="ml">
<td>
机器学习
</td><td>
<a href="/tutorials/machine-learning/reinforcement-learning/">
强化学习</a>
</td><td>
5:00
</td><td>
<img src="/static/img/course_cover/rl.jpg">
</td>
</tr>
<tr id="course3" class="data">
<td>
数据处理
</td><td>
<a href="/tutorials/data-manipulation/scraping/">
爬虫</a>
</td><td>
3:00
</td><td>
<img src="/static/img/course_cover/scraping.jpg">
</td>
</tr>
</table>
</body>
</html>
/static/img/course_cover/tf.jpg
/static/img/course_cover/rl.jpg
/static/img/course_cover/scraping.jpg
/tutorials/data-manipulation/scraping/
/tutorials/machine-learning/tensorflow/
/tutorials/machine-learning/reinforcement-learning/
/tutorials/data-manipulation/scraping/
資料來源:莫煩-BeautifulSoup 解析網頁: 正則表達