The success of any AI model hinges on the quality of data fed into the system. ML systems run on large quantities of data, but they cannot be expected to perform with just any data. It needs to be high-quality AI training data. If the output from the AI model needs to be authentic and accurate, needless to say, the data for training the system should be of high standards.
The data that the AI and ML models are trained on should be of prime quality for the business to draw meaningful and relevant insights from it. Yet, procuring huge volumes of heterogeneous data is posing a challenge to companies.
Companies should rely on providers like Shaip, who implement strict data quality management measures in their processes to counter this challenge. Additionally, at Shaip, we also undertake the continuous transformation of our systems to meet the evolving challenges.
Introduction to Shaip’s Data Quality Management
At Shaip, we understand the significance of reliable training data and its part in developing ML models and the outcome of AI-based solutions. In addition to screening our workers for skills, we are equally focused on developing their knowledge base and personal development.
We follow strict guidelines and standard operating procedures implemented at all levels of the process so that our training data meets the quality benchmark.
Quality Management
Our quality management workflow has been instrumental in delivering machine learning and AI models. With feedback-in-loop, our quality management model is a scientifically tested method that has been instrumental in successfully delivering several projects for our clients. Our quality audit process flow proceeds in the following manner.
- Reviewing the contract
- Create an audit checklist
- Document sourcing
- Sourcing 2-Layer Audit
- Annotation Text Moderation
- Annotation 2-Layer Audit
- Delivery of Work
- Client Feedback
Crowdsource Worker Selection and Onboarding
Our rigorous worker selection and onboarding process set us apart from the rest of the competition. We undertake a precise selection process to bring on board only the most skilled annotators based on the quality checklist. We consider:
- Previous experience as a Text moderator to ensure their skills and experience match our requirements.
- Performance in previous projects to ensure their productivity, quality and output were on par with the project needs.
- Extensive domain knowledge is a requisite for choosing a particular worker for a specific vertical.
Our selection process doesn’t end here. We subject the workers to a sample annotation test to verify their qualifications and performance. Based on the performance in the trial, disagreement analysis, and Q & A, they will be selected.
Once the workers are selected, they will undergo a thorough training session using Project SOW, guidelines, Sampling methods, tutorials, and more depending on the project need.
Data Collection Checklist
Double-layered quality checks are put in place to ensure only the high-quality training data is passed through to the next team.
Level 1: Quality Assurance Check
Shaip’s QA team makes the Level 1 quality check for data collection. They check all the documents, and they are quickly validated against the necessary parameters.
Level 2: Critical Quality Analysis Check
The CQA team consisting of credentialed, experienced and qualified resources will evaluate the remaining 20% of the retrospective samples.
Some of the data sourcing quality checklist items include,
- Is the URL source authentic, and does it allow for data web-scraping?
- Is there diversity in the shortlisted URLs so that bias can be avoided?
- Is the content validated for relevance?
- Does the content include moderation categories?
- Are priority domains covered?
- Is the document type sourced keeping in mind document type distribution?
- Does each moderation class contain the minimum volume slab?
- Is the Feedback-in-loop process followed?
Data Annotation Checklist
Similar to the Data Collection, we also have two layers of quality checklist for data annotation.
Level 1: Quality Assurance Check
This process ensures that 100% of documents are correctly validated against the quality parameters set by the team and the client.
Level 2: Critical Quality Analysis Check
This process ensures that 15 to 20% of the retrospective samples are also validated, and quality assured. This step is undertaken by the qualified and experienced CQA team with a minimum of 10 years of experience in quality management and Black Belt holders.
The CQA team ensures,
- Consistency in text moderation by users
- Checking if the correct phrases and moderation classes are used for each document
- Checking the metadata
We also provide daily feedback based on Pareto Analysis to ensure their performance is on-par with the client’s requirements.
We put in another layer of performance analysis to focus on least-performing annotators using Bottom Quartile Management. Before final delivery, we also ensure sample hygiene checks are completed.
Parameter Threshold
Depending on the project guidelines and client requirements, we have a 90 to 95% parameter threshold. Our team is equipped and experienced to undertake any of the following methods to ensure higher quality management standards.
- F1 Score or F Measure – to judge the performance of two classifiers – 2* ((Precision * Recall)/ (Precision + Recall))
- DPO or Defects per Opportunity method is calculated as a ratio of defects divided by the opportunities.
Sample Audit Checklist
Shaip’s sample audit checklist is a complete customization procedure that can be tailored to meet the demands of the project and client. It can be modified based on the feedback received from the client and finalized after a thorough discussion.
- Language Check
- URL and Domain Check
- Diversity Check
- Volume per Language and moderation class
- Targeted keywords
- Document type and relevance
- Toxic phrase check
- Metadata check
- Consistency check
- Annotation class check
- Any other mandatory checks as per the client’s preference
We take stringent measures to maintain data quality standards because we understand that all AI-based models are data-driven. And, having high-quality training data is a requisite for all AI and machine learning models. We understand the criticality of quality training data and its importance on the performance and success of your AI models.