Uploaded by Mister EightyFive

Jobs scraper - Specs (1.0.0) (1)

advertisement
Specs (1.0.0)
Jobs scraper
General project goal
We need a web scraper, implemented using Python 3.11, which is capable of scraping job
posts from Linkedin and save all the appropriate info to a MySQL database.
Even though this activity only concerns Linkedin, keep in mind that this scraper’s
functionalities will be expanded to include other relevant websites.
Features
Here are detailed the “must have” features:
1. The scraper must be able to scrape search results in a specific country for a given
search query (e.g. "PHP developer")
2. It should have the ability to iterate through multiple countries to perform global
scraping for a given search query
3. It should have the ability to iterate through multiple search queries to perform a
complete scrape of all of them
4. It should take the country (which can be a single one, a list of comma-separated
countries or "ALL") and search query (or multiple ones) as input parameters
5. For each job post, the scraper must collect at least the following information:
- Country
- City
- Region
- Work location (hybrid, remote, office)
- Work type (full-time, part-time, contract-based, internship, etc.)
- Company name
- Company size
- Job post content (as-is)
- Job post link
- Company page link
6. The scraper must operate in headless mode, allowing it to be executed on a server
without a graphical interface
7. The scraper should handle the uniqueness of scraped job posts using two methods:
- Fast approach: calculate an hash of the entire content and use it to see if it
matches any other job post.
This is a proposal, better solutions, if any, can surely be proposed and
evaluated
- In-depth approach: utilise TF-IDF (Term Frequency-Inverse Document
Frequency) to further ensure uniqueness even if job posts do not match
100%.
With this, job posts will be considered duplicates if the “score” is above a
certain configurable threshold
1
Specs (1.0.0)
Jobs scraper
8. The scraper should handle and bypass any control such as captchas, 429 errors or
any other relevant checks put in place by Linkedin
9. It must be designed with extensibility in mind, allowing for future implementations of
additional scrapers for various websites (e.g. Indeed, Monster)
10. The scraper should store the scraped job posts in a MySQL database
11. It should generate a tracking log to monitor its progress on every run, log any errors,
and provide traceability throughout the execution
Additional notes:
1. We’ll provide you with an hosted DB, if useful
2. We’ll provide you with a GitHub repo to which we expect you’ll commit as you
progress through the implementation and not only at the end of it
2
Download