Introduction
Web scraping extracts data from websites. Always respect robots.txt and terms of service.
Basic Scraping
import requests
from bs4 import BeautifulSoup
response = requests.get("https://example.com")
soup = BeautifulSoup(response.text, "html.parser")
# Find elements
title = soup.find("h1").text
links = soup.find_all("a")
paragraphs = soup.find_all("p", class_="content")
CSS Selectors
# By CSS selector
elements = soup.select("div.container > p")
first_item = soup.select_one(".item")
# With attributes
images = soup.select('img[alt*="profile"]')
Handling Dynamic Content
# For JavaScript-heavy sites, use Selenium
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://example.com")
# Wait for content
element = driver.find_element(By.CSS_SELECTOR, ".dynamic-content")
content = element.text
driver.quit()
Practice Problems
- Extract article titles from news site
- Parse table data into DataFrame
- Follow pagination to scrape multiple pages
- Download images from gallery
- Handle login forms with scraping