Various Crawling Techniques: CSS Selector¶

In [2]:

import requests
from bs4 import BeautifulSoup

res = requests.get('https://kim-william.github.io/Personal_Python_Projects/crawling/testhtml/crawlingtest.html')
soup = BeautifulSoup(res.content,'html.parser')
items = soup.select('ul')
for item in items:
    print (item.get_text())

(beginner) - Introduction to the Class
(beginner) - Preparing the Necessary Tools for Blog Development
(beginner) - Setting Up GitHub Pages to Create Your First Blog Page
(beginner) - Creating a Simple Web Page
(beginner) - Applying a Stylish Theme
(beginner) - Understanding Markdown Basics and Creating Your Own Blog Page
(beginner) - Mastering Various Markdown Techniques to Customize Your Blog Page


(Beginner) - Introduction to the Automation Programs [2]
(Beginner) - Demonstrating the Installation of Necessary Programs [5]
(Beginner) - Creating Data in Excel Files [9]
(Beginner) - Beautifying Excel Files! [8]
(Beginner) - Running Python Programs Automatically at Regular Intervals [7]
(Beginner) - Writing Messages on Slack Using Python [40]
(Beginner) - Checking Website Changes Periodically and Sending Alerts via Messenger [12]
(Beginner) - Using the Naver API to Post on Blogs [42]
(Intermediate) - Automatically Promoting Product Information Retrieved from the Coupang Partners API on Naver Blogs/Twitter [412]

In [3]:

res = requests.get('https://kim-william.github.io/Personal_Python_Projects/crawling/testhtml/crawlingtest.html')
soup = BeautifulSoup(res.content,'html.parser')
items = soup.select('ul li')
for item in items:
    print (item.get_text())

(beginner) - Introduction to the Class
(beginner) - Preparing the Necessary Tools for Blog Development
(beginner) - Setting Up GitHub Pages to Create Your First Blog Page
(beginner) - Creating a Simple Web Page
(beginner) - Applying a Stylish Theme
(beginner) - Understanding Markdown Basics and Creating Your Own Blog Page
(beginner) - Mastering Various Markdown Techniques to Customize Your Blog Page
(Beginner) - Introduction to the Automation Programs [2]
(Beginner) - Demonstrating the Installation of Necessary Programs [5]
(Beginner) - Creating Data in Excel Files [9]
(Beginner) - Beautifying Excel Files! [8]
(Beginner) - Running Python Programs Automatically at Regular Intervals [7]
(Beginner) - Writing Messages on Slack Using Python [40]
(Beginner) - Checking Website Changes Periodically and Sending Alerts via Messenger [12]
(Beginner) - Using the Naver API to Post on Blogs [42]
(Intermediate) - Automatically Promoting Product Information Retrieved from the Coupang Partners API on Naver Blogs/Twitter [412]

In [4]:

res = requests.get('https://kim-william.github.io/Personal_Python_Projects/crawling/testhtml/crawlingtest.html')
soup = BeautifulSoup(res.content,'html.parser')
items = soup.select('ul a')
for item in items:
    print (item.get_text())

(beginner) - Introduction to the Class
(beginner) - Preparing the Necessary Tools for Blog Development
(beginner) - Setting Up GitHub Pages to Create Your First Blog Page
(beginner) - Creating a Simple Web Page
(beginner) - Applying a Stylish Theme
(beginner) - Understanding Markdown Basics and Creating Your Own Blog Page
(beginner) - Mastering Various Markdown Techniques to Customize Your Blog Page
(Beginner) - Introduction to the Automation Programs [2]
(Beginner) - Demonstrating the Installation of Necessary Programs [5]
(Beginner) - Creating Data in Excel Files [9]
(Beginner) - Beautifying Excel Files! [8]
(Beginner) - Running Python Programs Automatically at Regular Intervals [7]
(Beginner) - Writing Messages on Slack Using Python [40]
(Beginner) - Checking Website Changes Periodically and Sending Alerts via Messenger [12]
(Beginner) - Using the Naver API to Post on Blogs [42]
(Intermediate) - Automatically Promoting Product Information Retrieved from the Coupang Partners API on Naver Blogs/Twitter [412]

In [27]:

res = requests.get('https://kim-william.github.io/Personal_Python_Projects/crawling/testhtml/crawlingtest.html')
soup = BeautifulSoup(res.content,'html.parser')
items = soup.select('ul > a')
for item in items:
    print (item.get_text())

In [6]:

res = requests.get('https://kim-william.github.io/Personal_Python_Projects/crawling/testhtml/crawlingtest.html')
soup = BeautifulSoup(res.content,'html.parser')
items = soup.select('.course')
for item in items:
    print (item.get_text())

(beginner) - Introduction to the Class
(beginner) - Preparing the Necessary Tools for Blog Development
(beginner) - Setting Up GitHub Pages to Create Your First Blog Page
(beginner) - Creating a Simple Web Page
(beginner) - Applying a Stylish Theme
(beginner) - Understanding Markdown Basics and Creating Your Own Blog Page
(beginner) - Mastering Various Markdown Techniques to Customize Your Blog Page
(Beginner) - Introduction to the Automation Programs [2]
(Beginner) - Demonstrating the Installation of Necessary Programs [5]
(Beginner) - Creating Data in Excel Files [9]
(Beginner) - Beautifying Excel Files! [8]
(Beginner) - Running Python Programs Automatically at Regular Intervals [7]
(Beginner) - Writing Messages on Slack Using Python [40]
(Beginner) - Checking Website Changes Periodically and Sending Alerts via Messenger [12]
(Beginner) - Using the Naver API to Post on Blogs [42]
(Intermediate) - Automatically Promoting Product Information Retrieved from the Coupang Partners API on Naver Blogs/Twitter [412]

In [7]:

res = requests.get('https://kim-william.github.io/Personal_Python_Projects/crawling/testhtml/crawlingtest.html')
soup = BeautifulSoup(res.content,'html.parser')
items = soup.select('#start')
for item in items:
    print (item.get_text())

(beginner) - Introduction to the Class
(beginner) - Preparing the Necessary Tools for Blog Development
(beginner) - Setting Up GitHub Pages to Create Your First Blog Page
(beginner) - Creating a Simple Web Page
(beginner) - Applying a Stylish Theme
(beginner) - Understanding Markdown Basics and Creating Your Own Blog Page
(beginner) - Mastering Various Markdown Techniques to Customize Your Blog Page

In [8]:

res = requests.get('https://kim-william.github.io/Personal_Python_Projects/crawling/testhtml/crawlingtest.html')
soup = BeautifulSoup(res.content,'html.parser')
items = soup.select('li.course.paid')
for item in items:
    print (item.get_text())

(Intermediate) - Automatically Promoting Product Information Retrieved from the Coupang Partners API on Naver Blogs/Twitter [412]

In [9]:

res = requests.get('https://kim-william.github.io/Personal_Python_Projects/crawling/testhtml/crawlingtest.html')
soup = BeautifulSoup(res.content,'html.parser')
# The selector ul#hobby_course_list li.course is used to select all <li> tags with the class "course" within the <ul> tag that has the ID "hobby_course_list". 
# This retrieves multiple elements.
items = soup.select('ul#hobby_course_list li.course')
for item in items:
    print (item.get_text())

(beginner) - Introduction to the Class
(beginner) - Preparing the Necessary Tools for Blog Development
(beginner) - Setting Up GitHub Pages to Create Your First Blog Page
(beginner) - Creating a Simple Web Page
(beginner) - Applying a Stylish Theme
(beginner) - Understanding Markdown Basics and Creating Your Own Blog Page
(beginner) - Mastering Various Markdown Techniques to Customize Your Blog Page

In [12]:

res = requests.get('https://kim-william.github.io/Personal_Python_Projects/crawling/testhtml/crawlingtest.html')
soup = BeautifulSoup(res.content,'html.parser')
# The selector ul#dev_course_list > li.course.paid is used to select the first <li> tag with the classes "course paid" within the <ul> tag that has the ID "dev_course_list". 
# The select_one() method returns only one element.
item = soup.select_one('ul#dev_course_list > li.course.paid')
print (item.get_text())

(Intermediate) - Automatically Promoting Product Information Retrieved from the Coupang Partners API on Naver Blogs/Twitter [412]

Objects retrieved using find() or select() can have additional select() or find() functions applied to them.

In [26]:

res = requests.get('https://kim-william.github.io/Personal_Python_Projects/crawling/testhtml/crawlingtest.html')
soup = BeautifulSoup(res.content,'html.parser')
# Select all <tr> tags (table rows) from the parsed HTML
items = soup.select('tr')
for tr in items:
    # Find all <td> tags (table columns) within the current row
    columns = tr.find_all('td')
    row_str = ''
    # Loop through each column in the current row
    for colum in columns:
        # Append the text content of the column to the row string, prefixed with a comma
        row_str += f',{colum.get_text()}'
    # Remove the leading comma and print the final row string
    print (row_str.strip(', ')) 
    
    # Alternative way to remove the first two characters (comma and space)
    # print (row_str[2:])

Schedule,Curriculum Title,Difficulty Level
5.1 ~ 6.15,Creating Your Own Stylish Blog Site,Beginner
6.16 ~ 7.31,First Steps in Python and Data Science (Mastering IT Fundamentals),Intermediate

Crawling CSS Selector

Various Crawling Techniques: CSS Selector¶