Various Crawling Techniques: CSS Selector¶
In [2]:
import requests
from bs4 import BeautifulSoup
res = requests.get('https://kim-william.github.io/Personal_Python_Projects/crawling/testhtml/crawlingtest.html')
soup = BeautifulSoup(res.content,'html.parser')
items = soup.select('ul')
for item in items:
print (item.get_text())
(beginner) - Introduction to the Class (beginner) - Preparing the Necessary Tools for Blog Development (beginner) - Setting Up GitHub Pages to Create Your First Blog Page (beginner) - Creating a Simple Web Page (beginner) - Applying a Stylish Theme (beginner) - Understanding Markdown Basics and Creating Your Own Blog Page (beginner) - Mastering Various Markdown Techniques to Customize Your Blog Page (Beginner) - Introduction to the Automation Programs [2] (Beginner) - Demonstrating the Installation of Necessary Programs [5] (Beginner) - Creating Data in Excel Files [9] (Beginner) - Beautifying Excel Files! [8] (Beginner) - Running Python Programs Automatically at Regular Intervals [7] (Beginner) - Writing Messages on Slack Using Python [40] (Beginner) - Checking Website Changes Periodically and Sending Alerts via Messenger [12] (Beginner) - Using the Naver API to Post on Blogs [42] (Intermediate) - Automatically Promoting Product Information Retrieved from the Coupang Partners API on Naver Blogs/Twitter [412]
In [3]:
res = requests.get('https://kim-william.github.io/Personal_Python_Projects/crawling/testhtml/crawlingtest.html')
soup = BeautifulSoup(res.content,'html.parser')
items = soup.select('ul li')
for item in items:
print (item.get_text())
(beginner) - Introduction to the Class (beginner) - Preparing the Necessary Tools for Blog Development (beginner) - Setting Up GitHub Pages to Create Your First Blog Page (beginner) - Creating a Simple Web Page (beginner) - Applying a Stylish Theme (beginner) - Understanding Markdown Basics and Creating Your Own Blog Page (beginner) - Mastering Various Markdown Techniques to Customize Your Blog Page (Beginner) - Introduction to the Automation Programs [2] (Beginner) - Demonstrating the Installation of Necessary Programs [5] (Beginner) - Creating Data in Excel Files [9] (Beginner) - Beautifying Excel Files! [8] (Beginner) - Running Python Programs Automatically at Regular Intervals [7] (Beginner) - Writing Messages on Slack Using Python [40] (Beginner) - Checking Website Changes Periodically and Sending Alerts via Messenger [12] (Beginner) - Using the Naver API to Post on Blogs [42] (Intermediate) - Automatically Promoting Product Information Retrieved from the Coupang Partners API on Naver Blogs/Twitter [412]
In [4]:
res = requests.get('https://kim-william.github.io/Personal_Python_Projects/crawling/testhtml/crawlingtest.html')
soup = BeautifulSoup(res.content,'html.parser')
items = soup.select('ul a')
for item in items:
print (item.get_text())
(beginner) - Introduction to the Class (beginner) - Preparing the Necessary Tools for Blog Development (beginner) - Setting Up GitHub Pages to Create Your First Blog Page (beginner) - Creating a Simple Web Page (beginner) - Applying a Stylish Theme (beginner) - Understanding Markdown Basics and Creating Your Own Blog Page (beginner) - Mastering Various Markdown Techniques to Customize Your Blog Page (Beginner) - Introduction to the Automation Programs [2] (Beginner) - Demonstrating the Installation of Necessary Programs [5] (Beginner) - Creating Data in Excel Files [9] (Beginner) - Beautifying Excel Files! [8] (Beginner) - Running Python Programs Automatically at Regular Intervals [7] (Beginner) - Writing Messages on Slack Using Python [40] (Beginner) - Checking Website Changes Periodically and Sending Alerts via Messenger [12] (Beginner) - Using the Naver API to Post on Blogs [42] (Intermediate) - Automatically Promoting Product Information Retrieved from the Coupang Partners API on Naver Blogs/Twitter [412]
In [27]:
res = requests.get('https://kim-william.github.io/Personal_Python_Projects/crawling/testhtml/crawlingtest.html')
soup = BeautifulSoup(res.content,'html.parser')
items = soup.select('ul > a')
for item in items:
print (item.get_text())
In [6]:
res = requests.get('https://kim-william.github.io/Personal_Python_Projects/crawling/testhtml/crawlingtest.html')
soup = BeautifulSoup(res.content,'html.parser')
items = soup.select('.course')
for item in items:
print (item.get_text())
(beginner) - Introduction to the Class (beginner) - Preparing the Necessary Tools for Blog Development (beginner) - Setting Up GitHub Pages to Create Your First Blog Page (beginner) - Creating a Simple Web Page (beginner) - Applying a Stylish Theme (beginner) - Understanding Markdown Basics and Creating Your Own Blog Page (beginner) - Mastering Various Markdown Techniques to Customize Your Blog Page (Beginner) - Introduction to the Automation Programs [2] (Beginner) - Demonstrating the Installation of Necessary Programs [5] (Beginner) - Creating Data in Excel Files [9] (Beginner) - Beautifying Excel Files! [8] (Beginner) - Running Python Programs Automatically at Regular Intervals [7] (Beginner) - Writing Messages on Slack Using Python [40] (Beginner) - Checking Website Changes Periodically and Sending Alerts via Messenger [12] (Beginner) - Using the Naver API to Post on Blogs [42] (Intermediate) - Automatically Promoting Product Information Retrieved from the Coupang Partners API on Naver Blogs/Twitter [412]
In [7]:
res = requests.get('https://kim-william.github.io/Personal_Python_Projects/crawling/testhtml/crawlingtest.html')
soup = BeautifulSoup(res.content,'html.parser')
items = soup.select('#start')
for item in items:
print (item.get_text())
(beginner) - Introduction to the Class (beginner) - Preparing the Necessary Tools for Blog Development (beginner) - Setting Up GitHub Pages to Create Your First Blog Page (beginner) - Creating a Simple Web Page (beginner) - Applying a Stylish Theme (beginner) - Understanding Markdown Basics and Creating Your Own Blog Page (beginner) - Mastering Various Markdown Techniques to Customize Your Blog Page
In [8]:
res = requests.get('https://kim-william.github.io/Personal_Python_Projects/crawling/testhtml/crawlingtest.html')
soup = BeautifulSoup(res.content,'html.parser')
items = soup.select('li.course.paid')
for item in items:
print (item.get_text())
(Intermediate) - Automatically Promoting Product Information Retrieved from the Coupang Partners API on Naver Blogs/Twitter [412]
In [9]:
res = requests.get('https://kim-william.github.io/Personal_Python_Projects/crawling/testhtml/crawlingtest.html')
soup = BeautifulSoup(res.content,'html.parser')
# The selector ul#hobby_course_list li.course is used to select all <li> tags with the class "course" within the <ul> tag that has the ID "hobby_course_list".
# This retrieves multiple elements.
items = soup.select('ul#hobby_course_list li.course')
for item in items:
print (item.get_text())
(beginner) - Introduction to the Class (beginner) - Preparing the Necessary Tools for Blog Development (beginner) - Setting Up GitHub Pages to Create Your First Blog Page (beginner) - Creating a Simple Web Page (beginner) - Applying a Stylish Theme (beginner) - Understanding Markdown Basics and Creating Your Own Blog Page (beginner) - Mastering Various Markdown Techniques to Customize Your Blog Page
In [12]:
res = requests.get('https://kim-william.github.io/Personal_Python_Projects/crawling/testhtml/crawlingtest.html')
soup = BeautifulSoup(res.content,'html.parser')
# The selector ul#dev_course_list > li.course.paid is used to select the first <li> tag with the classes "course paid" within the <ul> tag that has the ID "dev_course_list".
# The select_one() method returns only one element.
item = soup.select_one('ul#dev_course_list > li.course.paid')
print (item.get_text())
(Intermediate) - Automatically Promoting Product Information Retrieved from the Coupang Partners API on Naver Blogs/Twitter [412]
Objects retrieved using find() or select() can have additional select() or find() functions applied to them.
In [26]:
res = requests.get('https://kim-william.github.io/Personal_Python_Projects/crawling/testhtml/crawlingtest.html')
soup = BeautifulSoup(res.content,'html.parser')
# Select all <tr> tags (table rows) from the parsed HTML
items = soup.select('tr')
for tr in items:
# Find all <td> tags (table columns) within the current row
columns = tr.find_all('td')
row_str = ''
# Loop through each column in the current row
for colum in columns:
# Append the text content of the column to the row string, prefixed with a comma
row_str += f',{colum.get_text()}'
# Remove the leading comma and print the final row string
print (row_str.strip(', '))
# Alternative way to remove the first two characters (comma and space)
# print (row_str[2:])
Schedule,Curriculum Title,Difficulty Level 5.1 ~ 6.15,Creating Your Own Stylish Blog Site,Beginner 6.16 ~ 7.31,First Steps in Python and Data Science (Mastering IT Fundamentals),Intermediate