Text Pre-Processing
Filter Noise
Out of the 1.3 Million jobs we acquired from Indeed, there were 250K jobs with the word "data" in either job title or job description. We filtered "data" related jobs based on the assumption that data science related job will definitely contain the word "data". However, based on cursory analysis on the filtered jobs (and multiple iterations of training and validating the model) we found there were too many irrelevant jobs related to office administration, delivery services, care providers etc. which did not fit our expectations.
To filter the noise, we used inclusion and exclusion patterns in the job description and job title. The patterns were based on the research by Data Science Central team on ~10K LinkedIn professionals with titles related to Data Science and categorized 400 job titles for data scientists. We formed inclusion patterns based on the cleaned job title from the research and also formed exclusion patterns analyzing job titles that clearly did not fell into data science related job titles. After filtering the noise from master job posts database, we had 85K job posts related to data science field i.e. ~6.5% of job posts collected.
Text Normalization
After analyzing sample job posts, we observed that companies post job descriptions in different formats and varying lengths and different lengths. Most of the job descriptions have - company information (40%), job responsibilities (58%), qualification (50%) and skills (79%). Length of job posts followed typical zipf distribution. We concatenated job title with job description for the entire analysis.
We applied following text normalization techniques before building features:
- Removed non-alphanumeric characters including punctuations (except +, ., -)
- Text converted to lower case
- Removed URLs and email ids
- Removed stop words (used smart list stop words, NLTK stop words and our own stop words based on the analysis)
- Stemming (Snowball stemmer) to reduce words to their root form
We built text pre-processing module with the ability to enable/disable any specific normalization technique. Based on multiple iterations and analysis, our current model does not apply stemming as some of the keywords were lost during the process.
After applying the normalization techniques, size of the relevant job posts data came down to 250M from 400M.
Evaluation of Model
There really is not a standard way to evaluate retrieval effectiveness of a job search engine, or a search engine of any type for that matter. The results can only be subjectively compared to what an individual is expecting, or what is done by similar systems.
Evaluation
There are two particular problems in evaluating Job Fiction’s model.
- Unlike commercial job search engines (LinkedIn, Indeed, Dice, etc.), Job Fiction's database is not a live database and the job posts returned may already be expired.
- The retrieval logic of the commercial job search engines is not known, and each has its own advance search options available to users. To say the jobs returned from two different job search engines are not the same is only acknowledging that each search algorithm is different.
One search engine that is similar to Job Fiction is JobScan.co. Though the main goal of JobScan.co is to evaluate the match between a resume and a job description, it offers a list of jobs similar to the job description submitted. Submitting the same job description to Job Fiction and to JobScan.co result in different jobs; this is expected since Job Fiction queries it’s own dataset. Subjectively comparing the job titles reviewed, it is possible to suggest Job Fiction returns better matches, but since the goal of Job Fiction is to ignore job titles as an indicator of an appropriate job match this too is not an appropriate evaluation.
Feedback
From the minimal usability testing conducted,
- There is general excitement for the results returned. The job results are interesting to the user, and some have titles the user would typically dismiss and never consider.
- Users were at first confused with the usability and functionality until realized just how simple it was. It was later complimented for being so simple and to-the-point, saying they were not used to such a straight-forward design.
- The job classification in the results is a little less useful, especially if the three job descriptions entered are already very similar.
- Even though the jobs returned matched the user’s interest, many still included the job tokens the user explicitly placed in the Exclude list, suggesting the weights used in the model’s “Must Have,” Nice to Have” and “Exclude” lists were not effective.