What’s the Scoop on Effective Data Annotation?


What’s the Scoop on Effective Data Annotation?


The world of AI can be full of a lot of unknowns. That’s why we’re committed to bringing our #PeopleofAI to the forefront. We want to illuminate what they do, how they stay on top of ever-evolving trends and research, and why this matters for smart, human business. This week, our data annotator, Emily, demystifies what goes into the data annotation process — an integral piece in creating effective machine learning. Dive into Emily’s unique world here. 

Data annotation is the foundation upon which machine learning models are built. So, I’ve learned the importance of the “measure twice, cut once” proverb. I’ve been with Stradigi AI as a data annotator for two years now, and the opportunity to continue to learn every day is what keeps things exciting. Working alongside experts in the field of machine learning means I’m surrounded by really intelligent people who inspire me to work hard and constantly improve.

Laying the foundation: data in Artificial Intelligence

What’s the Scoop on Effective Data Annotation?

Is data really the “new oil”? We hear that comparison frequently, and in our office we do often reference the “pipeline” that is machine learning. Data is a fundamental part of artificial intelligence (AI), and without it there would be no base upon which to build.   However, it’s not a new source — data has been around for a long time, but the ways in which we manipulate it have evolved drastically. We now have the resources that help us find new data sources, such as Google. Google recently released a dataset search that makes data readily available to researchers, or Kaggle with their great dataset search.

Anyone with a foundational knowledge of Python can now create their own AI, thanks to this huge influx of data availability. Whether you’re a beginner or more advanced, there are project-based tutorials online that can guide you in creating your own AI. I’ve dabbled, trying my hand at cat and dog classification tools and small-scale image recognition. And it’s fun to see something come together, when all the pieces of the puzzle fit and your code runs properly. But it’s good to keep in mind that with great power comes great responsibility. Which is why we have people involved in many aspects of the pipeline.

When creating responsible AI, putting a human in the loop reflects an AI that is accountable for its actions, which in turn makes for a good, reputable AI. This is important as we look towards creating more #AIForGood, one example being our ASL coffee demo that was showcased at SXSW and C2 Montreal this year.

What is data annotation?

I get asked this question a lot, and it can be difficult to give a concise elevator pitch. Data annotation encompasses many different aspects of the AI process. Fundamentally, though, data annotation is one of the first processes in the development of any machine learning project. Data annotators respond to the need for data, sometimes setting out on an initial dataset search based on a specific need.

After wrangling the data, data scientists clean the datasets to ensure they can be easily digested and are perfectly presented to the models. Once that is complete, annotators, research scientists, and data scientists collaborate to determine which annotation method(s) they should use moving forward. There are many forms of annotations (which I outline below), so it’s a process of deciding the right type for the collected data and the scope of the project. Most importantly, annotations should be of a certain quality so that when used for training and testing models, they present good results. This is key to having high precision in your model.

What are some of the data annotation forms out there?

Annotation can take several forms, and annotators often work on tight time constraints. Depending on the situation and project, data annotators can choose from a menu of annotation forms. Here are four examples of methods that we typically implement when creating AI.

A/B evaluation 

This form requires making a decision between two choices. For example, if we are working on a project to differentiate between images of cats and dogs, we first need to label images as “cat” or “dog”. The information is then used to train, which is a fancy way of saying it helps the machine learn what an image of a dog and cat looks like, based on the input from the annotators.

Object detection

This method is useful when we have images that may need specific labeling within each element. For example, in autonomous driving, the AI requires image input where an annotator labels every tree, pedestrian, car, building, and traffic sign in the image. This is done so that when the system in a car runs the AI, the vehicle will know not to back up into a building or hit a pedestrian. These elements are avoided because the AI knows what these things look like, thanks to detailed annotation.

Image Segmentation

This method is a common tool for the annotation of AI in the medical industry. Annotators may be called upon to look through radiology images or CT scans to find anomalies, tumors, and other pathologies by using software that sets apart these instances with different colours or shapes (these are called “masks”). This enables the research scientist to train models to look for these types of anomalies with higher rates of accuracy than what a radiologist might see on their own.

Textual Data

Sentiment analysis is one form of annotation using textual data. It requires an annotator to look through thousands of typed documents to label them as positive or negative (such as tweets), so that the AI can then use that information to detect certain feelings or behaviours. For instance, this method might be used to detect hate speech or toxic comments in online conversations. Annotators may also be required to become subject matter experts in fields such as finance or government, where they label key phrases of a sentence in a specific document. These key phrases are then clustered into corresponding groups and used to find topics of similar interest.

Overall, the majority of my time as an annotator is spent labeling and categorizing data for the machine learning models to test and train. As you can imagine, annotating a few 1,000 samples of data can take time, but I have come to enjoy working alongside researchers and data scientists as they help me achieve skills to become more experienced in my field. There’s a fine line between great annotations and a speedy process, and I like to think that a good annotator can dance that line really well. Finally, I think that good annotation is about thorough collaboration and communication with all members of the AI team, to ensure that the final result is in its best form and is representative of a well-researched process.

For more information on Stradigi AI and our ML solution, Kepler, visit our platform page here.

This site is registered on wpml.org as a development site.