Fetch the latest news as they are posted on around 100 news websites using Python. Developing a script or service that can automatically scrape news articles from a list of websites, extract the relevant information (such as the headline, author, date, image, and body text), and store the data in a structured format. The script should be able to run on a regular basis, so that it can constantly update the database with the latest news.
The websites are listed here: https://blog.feedspot.com/pakistan_news_rss_feeds/
Please have a look at the websites listed above, before committing.
Deliverables:
* Working code that runs automatically every hour.
* Pulls the latest news posted on 100 news websites.
* Cleans and processes the data: author, headline, image, body text, entity recognition, language, type (news, blog, opinion, etc)
* Stores them in a database as well as S3.
* Deliver the working solution and well documented in client AWS.
Please describe what approach you will take RSS vs Scraping and why. Portfolio, Code Samples help.
The ideal candidate should have a strong background in RSS pulls, web scraping and data extraction, as well as experience with APIs and integrating data sources. And can advise on a robust and long lasting solution.
* Strong experience with web scraping and data extraction techniques
* Proficiency in Python
* Familiarity with APIs and integrating data from multiple sources
* Ability to work independently and deliver high-quality results within tight deadlines
Budget: $250
Posted On: February 16, 2023 07:44 UTC Category: Full Stack Development Skills:MongoDB, PostgreSQL, Amazon DynamoDB, Python, RSS, Data Scraping
Skills: MongoDB, PostgreSQL, Amazon DynamoDB, Python, RSS, Data Scraping Country: United States
click to apply
Project ID:
3310994
Project category:
MongoDB, PostgreSQL, Amazon DynamoDB, Python, RSS, Data Scraping