Crawling¶

A functionality for extracting desired content from websites.
Example: Periodically extracting real-time trending searches, periodically extracting popular product information, etc.

In [ ]:

# Import libraries
import requests
from bs4 import BeautifulSoup

# Fetch the webpage
res = requests.get('http://v.media.daum.net/v/20170615203441266')

# Parse the webpage
soup = BeautifulSoup(res.content, 'html.parser')

# Extract the required data
mydata = soup.find('title')
print(mydata.get_text())

Required Libraries¶

requests: A library for fetching web pages
bs4 (BeautifulSoup): A library for analyzing (crawling) web pages

In [7]:

# Import libraries
import requests
from bs4 import BeautifulSoup

In [ ]:

# Fetch the webpage
# res = requests.get('http://v.media.daum.net/v/20170615203441266')
res = requests.get('http://wns0428.synology.me:7503/')
res.content

HTML Structure¶

A library like BeautifulSoup converts the fetched HTML (string) into manageable code.

The parsed HTML information is stored in soup.

In [ ]:

# Parse the webpage
soup = BeautifulSoup(res.content, 'html.parser')
soup

Extracting Required Data¶

This is the key part!

Use the soup.find() function to specify the desired section.
Use the .get_text() function to retrieve the extracted content.

※ A basic understanding of HTML is necessary.

In [ ]:

data = soup.find('h3')
data

In [ ]:

data.get_text()

Crawling Tutorial

Crawling¶

Required Libraries¶

HTML Structure¶

Extracting Required Data¶