Crawling¶
A functionality for extracting desired content from websites.
Example: Periodically extracting real-time trending searches, periodically extracting popular product information, etc.
In [ ]:
# Import libraries
import requests
from bs4 import BeautifulSoup
# Fetch the webpage
res = requests.get('http://v.media.daum.net/v/20170615203441266')
# Parse the webpage
soup = BeautifulSoup(res.content, 'html.parser')
# Extract the required data
mydata = soup.find('title')
print(mydata.get_text())
Required Libraries¶
- requests: A library for fetching web pages
- bs4 (BeautifulSoup): A library for analyzing (crawling) web pages
In [7]:
# Import libraries
import requests
from bs4 import BeautifulSoup
In [ ]:
# Fetch the webpage
# res = requests.get('http://v.media.daum.net/v/20170615203441266')
res = requests.get('http://wns0428.synology.me:7503/')
res.content
HTML Structure¶
A library like BeautifulSoup converts the fetched HTML (string) into manageable code.
- The parsed HTML information is stored in soup.
In [ ]:
# Parse the webpage
soup = BeautifulSoup(res.content, 'html.parser')
soup
Extracting Required Data¶
This is the key part!
- Use the soup.find() function to specify the desired section.
- Use the .get_text() function to retrieve the extracted content.
※ A basic understanding of HTML is necessary.
In [ ]:
data = soup.find('h3')
data
In [ ]:
data.get_text()