Abstract
The use of autonomous systems within the retail environment is currently hampered by the challenges object detection networks face when trying to locate and classify products. These challenges arise from densely packed product arrangements, limited training data, and frequent product up-dates. Existing object detection methods such as R-CNN and YOLO exhibit limitations in scalability and accuracy when applied to these complex environments. This work explores a new vision algorithm designed specifically for retail applications by integrating two distinct networks to address the dual object detection tasks of localization and classification. The first task utilizes RetinaNet to localize densely packed products, relying on its ability to detect small overlapping objects. Then, a Siamese Neural Network (SNN) is used to generate embeddings of local-ized product images. SNNs learn to create feature embed dings with minimal training data and can support the addition of new classifications without requiring network retraining. This separation of localization and classification tasks allows each network to be fine-tuned for its specific purpose, resulting in a final object detection algorithm that can adequately perform in the retail landscape.