Detecting Online Counterfeit-goods Seller using Connection Discovery
Ming Cheung, James She, Weiwei Sun, and Jiantao Zhou
Published at ACM Trans. Multimedia Comput. Communication 2019
1. Introduction
It was hard to become an online seller in the past as technical knowledge of system design and programming were required. With social media and e-commerce platforms, sellers have become able to set up an online store conveniently. As it is effortless to become a seller on such platforms, the number of counterfeit products being sold online has increased substantially. Most online sellers do not have an offline store, and it is impossible to investigate their goods before buying them, which makes detecting such products difficult. E-commerce giants like Amazon and Alibaba dedicate significant effort to removing counterfeit products. However, We are still widely accessible as counterfeit goods must be manually flagged and removed. This approach creates challenges for e-commerce platform owners, brand owners, and customs officers. Using manual effort to detect a small number of counterfeit sellers from among the hundreds of thousands of sellers is inefficient, and it is not scalable to the rapid growth of online sellers. A practical detection framework may help reduce the search space by identifying potential counterfeit sellers for further investigation. Traditionally, detecting counterfeit sellers has been conducted by searching suitable tags and browsing sellers whose products contain those tags. However, such sellers may use disguising text such as “toy watch” or “high-quality shoes,” meaning a text-based detection system will.
This article conducts an intensive study on counterfeit and non-counterfeit sellers on social media using connection discovery. The sellers are labelled manually by surveying 40 experienced online shoppers to mark each seller independently. Hence, a classifier is built based on a fine-tuned CNN through their connections. The following contributions are made in this article: (1) They collect 473K shared images from Taobao, Instagram and Carousell. (2) They analyze the behaviours of counterfeit sellers using object recognition and connection discovery approaches and prove that related sellers share similar images. (3) They propose an optimized framework to detect counterfeit sellers using connections discovered from their shared images. (4) They conduct a showcase to prove that the proposed framework can work even if the counterfeit sellers edit the images using various methods.
2. Related Work
A summary of the literature review of this paper can be seen in Figure 1. We can see that many studies related to abnormal user detection have been carried out, but I divide them into four main parts, namely: abnormal users, content sharing behaviour, counterfeiting activities and connection discovery, and include limit each part so that it is easy to understand.
3. Method
3.1 Image Encoding using Convolutional Neural Network
The
new idea proposed in this section is to optimize W to be sensitive to visual
features that can help detect counterfeit sellers. One possible goal is to
minimize the distance between fake sellers' images, minimize the distance
between non-fake sellers, and maximize the distance between fake and non-fake
sellers. As a result, images from related pairs were more likely to be
annotated with similar machine-generated labels, while pictures from unrelated pairs
were less likely to be annotated with the same machine-generated labels.
Because of this, fake sellers have a more similar seller profile, and training
classifiers to identify those sellers will be more effective. The distance
between two images, x1 and x2, can be calculated by :
3.2
Discovering the Connections among Sellers
The
relationship between two sellers is defined as their similarity Si, j. Each
image was encoded into a feature vector using image processing and computer
vision techniques, as shown in step 1 of Figure 3. All user-shared image
feature vectors from each user were grouped [15] into cluster K and were
annotated by a unique label generated by a machine representing its cluster, as
shown in steps 2 and 3 in Figure 3. Seller i is then profiled by a K-dimensional
vector, Li, which describes the distribution of unique labels K in the image
shared by this seller:
where · is the dot product of two vectors and ||. || is the L2 norm of a vector. We can see the system overflow of connection discovery of the proposed framework in figure 2.
This section introduces testing hypotheses about how counterfeit sellers can be detected. We consider the following two hypotheses:
- H0: Seller i is a fake seller;
- H1: Seller i is not a fake seller.
This decision is called true positive when the algorithm chooses H1 when H1 is true. However, if the algorithm chooses H0, it is called a false negative. Likewise, when H0, is in fact true, choosing H1 is a false positive. Finally, choosing H0 when H0 is a true negative. A binary classifier is proposed for hypothesis testing. In binary classifier training, sellers are labelled as fake or non-fake sellers, with expected returns, 1 and 0, respectively. Therefore, the binary classifier gives an output range from 0 to 1, with 0 representing the certainty that the user is a non-fake seller and 1 representing the certainty that the user is a fake seller. If it is greater than the threshold, then H0 is accepted; otherwise, it will be rejected.
3.4. Dataset
The data is collected from Taobao, a popular Chinese e-commerce platform. The data collected can be divided into two categories, namely shoes and cosmetics. Information, including prices and product images, is collected using Octopus.3 For each seller, 80 products are selected from their product list, and all images of each product are collected. To avoid the distraction of thumbnails and ad images, we set the minimum size of captured images to be 400 × 400. In total, 101,090 and 51,870 images were collected from 93 and 100 shoe and cosmetic sellers. There are 38 counterfeit sellers and 55 non-fake sellers among shoe sellers, while cosmetics sellers are 23 counterfeit and 77 non-fake sellers. Sellers were manually labelled by surveying 40 experienced online shoppers to tag each seller independently taking into account the seller's pages, images and prices. Finally, the statistical value of the survey results is used as a label for each seller.
4. Result
Figures 7(a) and 7(b) show the results of counterfeit seller identification, with the number of iterations in training, for shoe and cosmetics sellers, respectively. The experiment is repeated 100 times, and the mean is taken among the 100 results. In each trial, 90% of users are randomly selected as the training set, and the rest are the testing set. Note that CD and CDFT use K = 1,000 for comparison with Object, which has 1,000 class labels. It is observed that the performance improves with more iterations, and CDFT outperforms the other two approaches, with about 80% accuracy for both shoe and cosmetics sellers. CD is the second-best performer, as Object fails to represent users. More discussion can be found in the next section. Hence, it is also interesting to investigate how K affects the performance of the detection.
Figures 3(a) and 3(b) show the results of counterfeit seller identification with different K for shoe and cosmetics sellers, respectively. The experiment is repeated 20 times, and the mean is taken among the 20 results. In each trial, 90% of users are randomly selected as the training set, and the rest are the testing set. The performance of the 10th iteration is used for the evaluation. The result is also compared with the siamese network (CDFT-Sim) [2, 13, 28], for which the objective is that the output of the CNN is a vector that maximizes the distances between images of unrelated pairs, and minimizes the distances between images of related pairs. It is observed that CDFT outperforms the CD and CDFT-Sim approaches for any values of K. The performance is similar with different values of K. More discussion on finding an optimized K can be found in the later section.
Figure 4. The precision of detection for different values of K for (a) shoe sellers and (b) cosmetics sellers.
Figures 4(a) and 4(b) show the p of counterfeit seller identification with different a for shoe and cosmetics sellers, respectively. The training is repeated 100 times, and the Λ of all data in the testing sets are recorded. After that, the value will be tested against different values of a. It is observed that p increases with a and reaches a maximum of 0.94 and 0.95 when a equals 0.99. This means that for sellers with Λ equal to 0.99, about 95% are counterfeit sellers. Note that a high p will result in a lower, and a lot of counterfeit sellers are not detected.
5. Conclusion
This article proposes A CNN-based connection discovery framework to detect online counterfeit sellers, who previously had to be manually detected. Experiments are conducted with real-life datasets with over 450K shared images from the social media and e-commerce platforms Taobao, Instagram, and Carousell. The framework can achieve over 90% in F 1 score for detecting counterfeit sellers. And to the best of their knowledge, this is the first work to detect online counterfeit sellers from their shared images.
6. Discussion
The proposed approach shows promising results in detecting fake sellers, but is limited to the training set. Although manual effects are required to verify whether there are new types of images among fake sellers, the proposed approach can detect most fake sellers and learn new types through continuous training. CDFT-Sim, only slightly improves the performance of the CD, while the CDFT significantly improves the results. The possible reason is that CDFT-Sim requires more data for training, and CDFT-Sim may give unstable results. In addition, CDFT-Sim is more computationally intensive. Therefore, CDFT-Sim is better suited for tasks that CDFT cannot train, such as optimizing CNN for follower/follower recommendations. Besides, in this research paper, they perform manually labelling for the dataset thus making training data is limited and having to re-verify if new data comes in, an improvement by using continuous learning will improve the performance of the algorithm. Another perspective in my own opinion is " How do the differences between copy items and the original items as both products are almost similar".
7. Future Work
There are two ways to improve performance. The first is to combine Object with CDFT. While Object cannot help classify counterfeit and non-fake sellers, it can help classify sellers who sell similar products, and CDFT can build on those results. The second direction is to investigate different parameters, such as K, the number of unique labels generated by the machine. Although it is evident that the proposed algorithm works with different K, it is not clear how to optimize the selection. Optimized K value can provide better performance. Besides, we can try this approach to another method for feature extraction, such as copy-move, watermarking, rotating, and flipping. Another Perspective for future work might be using another feature to detect counterfeit sellers, such as Using CNN-Based Object Detection Algorithm by Jointing Semantic Segmentation for Images [2]. And also using another feature to detect counterfeit sellers such as metadata from the image, User review, User Profile will give a better result.
8. References
- Ming Cheung, James She, Weiwei Sun, and Jiantao Zhou. 2019. Detecting Online Counterfeit-goods Seller using Connection Discovery. ACM Trans. Multimedia Comput. Commun. Appl. 15, 2, Article 35 (May 2019), 16 pages.
- Baohua Qiang, Ruidong Chen, and Minghao Yang, Yijie Zhai , Yuanchao Pang , Mingliang Zhou. 2020. Convolutional Neural Networks-Based Object Detection Algorithm by Jointing Semantic.Segmentation for Images.MDPI/Sensors. Article 20 ( September 2020), 14