Python網頁爬蟲:正規表示式

by 龍冥

完整的正規表示式請點超連結

這程式爬莫煩開發的測試網站,目標爬出此網頁的圖片(img)連結,以及爬出href標籤內的超連結。

# %%
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re

# if has Chinese, apply decode()
#html = urlopen("https://mofanpy.com/static/scraping/basic-structure.html").read().decode('utf-8')
html = urlopen("https://mofanpy.com/static/scraping/table.html").read().decode('utf-8')
print(html)

# %%
soup = BeautifulSoup(html , features='lxml')
#print(soup)
img_links = soup.find_all("img",{"src": re.compile('.*?\.jpg')})
for link in img_links:
    print(link['src'])

# %%
course_links = soup.find_all('a', {'href': re.compile('/tutorials.*')})
for link in course_links:
    print(link['href'])

顯示結果

<!DOCTYPE html>
<html lang="cn">
<head>
        <meta charset="UTF-8">
        <title>爬虫练习 表格 table | 莫烦 Python</title>

        <style>
        img {
                width: 250px;
        }
        table{
                width:50%;
        }
        td{
                margin:10px;
                padding:15px;
        }
        </style>
</head>
<body>

<h1>表格 爬虫练习</h1>

<p>这是一个在 <a href="/" >莫烦 Python</a> 的 <a href="/tutorials/data-manipulation/scraping/" >爬虫教程</a>
        里无敌简单的网页, 所有的 code 让你一目了然, 清晰无比.</p>

<br>
<table id="course-list">
        <tr>
                <th>
                        分类
                </th><th>
                        名字
                </th><th>
                        时长
                </th><th>
                        预览
                </th>
        </tr>

        <tr id="course1" class="ml">
                <td>
                        机器学习
                </td><td>
                        <a href="/tutorials/machine-learning/tensorflow/">
                                Tensorflow 神经网络</a>
                </td><td>
                        2:00
                </td><td>
                        <img src="/static/img/course_cover/tf.jpg">
                </td>
        </tr>

        <tr id="course2" class="ml">
                <td>
                        机器学习
                </td><td>
                        <a href="/tutorials/machine-learning/reinforcement-learning/">
                                强化学习</a>
                </td><td>
                        5:00
                </td><td>
                        <img src="/static/img/course_cover/rl.jpg">
                </td>
        </tr>

        <tr id="course3" class="data">
                <td>
                        数据处理
                </td><td>
                        <a href="/tutorials/data-manipulation/scraping/">
                                爬虫</a>
                </td><td>
                        3:00
                </td><td>
                        <img src="/static/img/course_cover/scraping.jpg">
                </td>
        </tr>

</table>

</body>
</html>
/static/img/course_cover/tf.jpg
/static/img/course_cover/rl.jpg
/static/img/course_cover/scraping.jpg
/tutorials/data-manipulation/scraping/
/tutorials/machine-learning/tensorflow/
/tutorials/machine-learning/reinforcement-learning/
/tutorials/data-manipulation/scraping/

資料來源:莫煩-BeautifulSoup 解析網頁: 正則表達

Related Posts

Copyright © 2024 龍冥 | 本站採用 reCAPTCHA保護機制 隱私權&條款