Scraping - 95~96 days - Learn Python online for 100 days.

Sep 16, 2023#python学习267

AI Translation

This post is translated from Chinese into English through AI.View Original

AI-generated summary

This is a summary of the given text in "en" language: The author spent 95 days collecting 5 news articles from a news source, then used OpenAI to generate keywords, and sent those keywords to Spotipy to retrieve songs. Due to issues with OpenAI, the author skipped practicing that day. On the 96th day, the author learned how to fetch and parse HTML content using Python, which is a powerful tool for web scraping. They used the requests library to get the HTML content of a webpage, and then used BeautifulSoup to format it. The exercise for the day was to retrieve content titles from Hacker News and print them if they contained the keywords "python" or "replit". They also added the keyword "SQL" since they didn't find any titles with the previous two keywords. The code provided in the main.py file demonstrates this process.

Record#

Day 95 is to retrieve 5 news articles from "news", submit them to openai to generate keywords, and then send the keywords to spotipy to return songs. Due to the relationship with openai, today's practice is skipped.
Day 96 is to learn how to retrieve and parse HTML content, and finally learn the most powerful feature of Python: web scraping!
Use response = requests.get(url) and html = response.text to retrieve the HTML content of a webpage.
Use soup = BeautifulSoup(html, 'html.parser') to format the HTML. Before that, import the library: from bs4 import BeautifulSoup.
Use soup.find_all("span", {"class", "titleline"}) to retrieve specific content. span is the tag name, followed by the class and class name.
Today's practice is to retrieve the titles from hacker news, and if they contain "python" and "replit", print them. During the process, it was found that there were no titles containing these two keywords, so another keyword, "SQL", was added.

CODE#

main.py#

from bs4 import BeautifulSoup
import requests

url = "https://news.ycombinator.com"

response = requests.get(url)
html = response.text

soup = BeautifulSoup(html, 'html.parser')
title = soup.find_all("span", {"class", "titleline"})
print(len(title))

for txt in title:
  if "python" in txt.text or "replit" in txt.text or "SQL" in txt.text:
    print(txt.text)

Translation: