What it Takes to Build Your Own Computer Vision Solution

6 min readJan 26, 2021

Insights from Nomad Go CEO and Co-founder David Greschler.

When the Worldwide Web hit the market in the mid 90’s, doing much of anything with it — such as building a website — required programming skills in Java, HTML or a handful of other languages. And anything beyond that, such as adding e-commerce capabilities or running ads, required a custom-built solution. The number of people with those skills was in short supply, and the availability of coding tools to help was even less. If you didn’t have the requisite skills, you had to pay consultants who did.

Around the same time, Microsoft bought FrontPage, an HTML editor that offered fairly basic tools and templates for businesses to build their own websites. Nothing fancy, but it gave businesses a foot in the door to build an online presence without breaking the bank. These days there’s such an abundance of tools that you can build a fairly functional site while streaming the latest episode of your favorite Netflix show (not that I would advise it).

Fast forward 25 years and computer vision (CV) stands to be the next wave of innovation. The latest projections from Forrester, Gartner Research and other analyst firms indicate that CV is at the top of the list of investments for many companies.

But today’s CV ecosystem looks much the same as the web did roughly 25 years ago. Building a solution requires the expertise of developers, data scientists and other technical specialists across many disciplines, which can take several weeks, or longer. The diagram below gives you a general idea of the various elements involved in building a computer vision solution from the ground up, and the decisions that are involved.

Defining your use case is the first step to building any solution. Clearly identifying your business problem will go a long way toward ensuring that your solution generates the raw data you need, and will be a guidepost as you build your solution.

The computer vision framework is the foundation of your solution. It’s primarily made up of a library of developer tools, and there are a lot of frameworks out there. Each one is designed with a specific set of performance characteristics, and some for specific use cases, so picking the right one can be a serious undertaking. A few of the frameworks worth mentioning are (developed by Facebook), Tensorflow (Google), (Intel), (Apple), and .

The machine learning model sits on top of the technical framework. It processes images and identifies the objects within them. Until recently, most companies focused on developing computer vision models that run in the cloud, powering services such as facial and image recognition features on social media platforms or photo management applications. But an increasing number of companies are building models that can run on the network’s edge, whether on smart cameras and smart devices (like phones and tablets), or in a network appliance such as Microsoft Azure Stack Edge or Amazon Panorama.

Providers of machine learning models:

* Amazon Rekognition

* Microsoft Cognitive Services

* Clarifai (Cloud)

* Deepomatic (Cloud)

* Apple (Edge)

A handful of consultancies can also customize a model for a cloud or edge-based scenario.

Training and testing your machine learning model is essential before plugging it into production environment. This step can’t be emphasized enough because without large volumes of high-quality training data, your machine learning models won’t accurately recognize the image or video data they’re processing, resulting in less than optimal performance. A training data provider allows you to teach your machine learning model to understand what to look for. The Training data market has exploded in recent years due to the high demand for annotated data across many data types. Examples of companies in this space include and . Recently there’s been the emergence of synthetic training data (i.e. computer-generated training data) with companies such as Parallel Domain that can accelerate the annotation process using computers.

Cameras and sensors are the cherry on top of your solution, assuming you’ve built a CV solution for physical environments (in online scenarios, your computer or smartphone plays the role of the device). Once your machine learning model is trained, it’s ready to be hooked up to a sensor or other network device. There are four options based upon their capabilities:

* Traditional cameras are hooked up to a network video recorder, which is hooked up to an AI service in the cloud. Video footage is periodically uploaded to the machine learning model to be analyzed.

*Purpose-built smart cameras with built-in chips make it possible to process images on the actual camera. Nvidia and Sony have both invested heavily in this area. This is a very promising solution as it will mean that you can dramatically reduce the cloud inference requirements, as well as reduce bandwidth needs (since you don’t have to send images over the network for processing).

*Mass-produced smart devices (Apple and Android) have the compute power necessary to serve as a dedicated smart camera that analyzes imagery on the device and sends data to the cloud, before deleting the images. Like smart cameras, they have the ability to run all inference on the edge, as well as post-process the data collected before sending it to the cloud. As with smart cameras, this dramatically reduces the cloud inference requirements and bandwidth needs.

* Dedicated appliances, such as AWS Panorama and Azure Stack Edge, sit at the network’s edge and can be connected to numerous cameras, both traditional and smart. These edge appliances analyze the footage, and then send data to the cloud.

Organizing your data is critical to ensuring that your data is clean (i.e. no redundancies, formatting errors or incomplete data). This is a complex process that requires someone with data science skills, as well as an understanding of your business needs.

Finally, you can process the data and extract any insights that are relevant to your business. With this in mind, solution providers have developed industry-specific solutions for the built environment, ranging from retail and commercial real estate, to healthcare, manufacturing and agriculture. However, this is often a costly endeavor, not until the web consultants of the 1990's.

Computer vision holds great business potential, but like the early days of the web, you have to make decisions every step of the way-from which technical framework and machine learning model you should use, to the pros and cons of going the DIY route with your training data, to the best way of deploying your solution to get the data you need. In addition, it can be an expensive solution as you need to integrate the different solutions together to get actionable data.

Turns out, CV doesn’t have to be a big hit on your budget and it doesn’t require you building a solution from scratch. That’s where comes in. Next time we’ll talk all about it.

Nomad Go’s end-to-end computer vision solution.

Originally published at https://www.nomad-go.com on January 26, 2021.

What it Takes to Build Your Own Computer Vision Solution

Written by Nomad Go