Python逆引き大全｜初心者から実務まで使えるウェブスクレイピングの実践テクニックウェブスクレイピング編！

ウェブスクレイピングは、データ収集や自動化の分野で非常に役立つスキルです。本記事では、Pythonを使ったウェブスクレイピングの基本から応用まで、詳細なテクニックを解説します。

requestsモジュールでウェブページ取得
HTML解析（BeautifulSoup）
ページ内リンクの抽出
特定タグのデータ抽出
ページのタイトル取得
JSON APIのデータ取得
クッキーの操作
ヘッダー情報の設定
seleniumによるブラウザ操作
動的ページのデータ取得
まとめ
このサイトを稼働しているVPSはこちら

requestsモジュールでウェブページ取得

requestsモジュールを使うことで、簡単にウェブページのHTMLを取得できます。

import requests

url = "https://example.com"
response = requests.get(url)
print(response.text)  # ページのHTML内容を出力

ポイント: response.status_codeでHTTPステータスコードを確認できます。

HTML解析（BeautifulSoup）

取得したHTMLを解析するには、BeautifulSoupモジュールを使用します。

from bs4 import BeautifulSoup

html_content = "<html><head><title>Example</title></head><body></body></html>"
soup = BeautifulSoup(html_content, 'html.parser')
print(soup.title.text)  # ページタイトルを取得

ポイント: lxmlパーサーをインストールすると解析速度が向上します。

ページ内リンクの抽出

ページ内のすべてのリンクを取得する方法を紹介します。

html = """
<html>
  <body>
    <a href="https://example.com/page1">Page 1</a>
    <a href="https://example.com/page2">Page 2</a>
  </body>
</html>
"""
soup = BeautifulSoup(html, 'html.parser')
links = [a['href'] for a in soup.find_all('a', href=True)]
print(links)  # ['https://example.com/page1', 'https://example.com/page2']

応用例: urljoinを使うと相対リンクを絶対URLに変換できます。

特定タグのデータ抽出

特定のタグに含まれるデータを抽出する方法を説明します。

html = """
<html>
  <body>
    <div class="content">Hello, World!</div>
  </body>
</html>
"""
soup = BeautifulSoup(html, 'html.parser')
content = soup.find('div', class_='content').text
print(content)  # Hello, World!

ポイント: findは最初の一致のみ、find_allはすべての一致を取得します。

ページのタイトル取得

ページのタイトルを取得する方法を紹介します。

html = """
<html>
  <head>
    <title>My Page</title>
  </head>
</html>
"""
soup = BeautifulSoup(html, 'html.parser')
print(soup.title.string)  # My Page

JSON APIのデータ取得

JSON形式のAPIデータを取得して解析します。

import requests

url = "https://jsonplaceholder.typicode.com/posts/1"
response = requests.get(url)
data = response.json()
print(data['title'])  # 取得したタイトルを出力

応用例: requests.postを使ってデータを送信できます。

クッキーの操作

クッキーを送信してセッションを維持します。

import requests

url = "https://example.com"
session = requests.Session()
session.cookies.set('sessionid', '123456')
response = session.get(url)
print(response.text)

ヘッダー情報の設定

カスタムヘッダーを設定してリクエストを送信します。

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
response = requests.get("https://example.com", headers=headers)
print(response.text)

ポイント: ヘッダー設定は多くのサイトでアクセス制限を回避するのに役立ちます。

seleniumによるブラウザ操作

動的なウェブページのデータ取得にはseleniumを使用します。

from selenium import webdriver
from selenium.webdriver.common.by import By

# WebDriverの初期化
browser = webdriver.Chrome()
browser.get("https://example.com")

# 要素を取得
element = browser.find_element(By.TAG_NAME, "h1")
print(element.text)

browser.quit()

応用例: ボタンクリックやフォーム送信の自動化も可能です。

動的ページのデータ取得

JavaScriptによって生成されるデータを取得します。

from selenium import webdriver
from selenium.webdriver.common.by import By

browser = webdriver.Chrome()
browser.get("https://example.com/dynamic")
data = browser.find_element(By.ID, "dynamic-data").text
print(data)
browser.quit()

まとめ

Pythonを活用したウェブスクレイピングでは、基本的なHTTPリクエストから動的ページのデータ取得まで、多様な方法を活用できます。これらのテクニックを習得することで、効率的なデータ収集や自動化が可能になります。まずは基本的なモジュールを試しつつ、応用的なツールへとステップアップしましょう！