Jobs scraper - Specs (1.0.0) (1)

Specs (1.0.0) Jobs scraper General project goal We need a web scraper, implemented using Python 3.11, which is capable of scraping job posts from Linkedin and save all the appropriate info to a MySQL database. Even though this activity only concerns Linkedin, keep in mind that this scraper’s functionalities will be expanded to include other relevant websites. Features Here are detailed the “must have” features: 1. The scraper must be able to scrape search results in a specific country for a given search query (e.g. "PHP developer") 2. It should have the ability to iterate through multiple countries to perform global scraping for a given search query 3. It should have the ability to iterate through multiple search queries to perform a complete scrape of all of them 4. It should take the country (which can be a single one, a list of comma-separated countries or "ALL") and search query (or multiple ones) as input parameters 5. For each job post, the scraper must collect at least the following information: - Country - City - Region - Work location (hybrid, remote, office) - Work type (full-time, part-time, contract-based, internship, etc.) - Company name - Company size - Job post content (as-is) - Job post link - Company page link 6. The scraper must operate in headless mode, allowing it to be executed on a server without a graphical interface 7. The scraper should handle the uniqueness of scraped job posts using two methods: - Fast approach: calculate an hash of the entire content and use it to see if it matches any other job post. This is a proposal, better solutions, if any, can surely be proposed and evaluated - In-depth approach: utilise TF-IDF (Term Frequency-Inverse Document Frequency) to further ensure uniqueness even if job posts do not match 100%. With this, job posts will be considered duplicates if the “score” is above a certain configurable threshold 1 Specs (1.0.0) Jobs scraper 8. The scraper should handle and bypass any control such as captchas, 429 errors or any other relevant checks put in place by Linkedin 9. It must be designed with extensibility in mind, allowing for future implementations of additional scrapers for various websites (e.g. Indeed, Monster) 10. The scraper should store the scraped job posts in a MySQL database 11. It should generate a tracking log to monitor its progress on every run, log any errors, and provide traceability throughout the execution Additional notes: 1. We’ll provide you with an hosted DB, if useful 2. We’ll provide you with a GitHub repo to which we expect you’ll commit as you progress through the implementation and not only at the end of it 2

Jobs scraper - Specs (1.0.0) (1)

Related documents

Products

Support

Jobs scraper - Specs (1.0.0) (1)

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib