Laman

Kamis, 07 Oktober 2021

Summary Research Paper "Detecting Online Counterfeit-goods Seller using Connection Discovery"


Detecting Online Counterfeit-goods Seller using Connection Discovery

Ming Cheung, James She, Weiwei Sun, and Jiantao Zhou

Published at ACM Trans. Multimedia Comput. Communication 2019


1. Introduction

It was hard to become an online seller in the past as technical knowledge of system design and programming were required. With social media and e-commerce platforms, sellers have become able to set up an online store conveniently. As it is effortless to become a seller on such platforms, the number of counterfeit products being sold online has increased substantially. Most online sellers do not have an offline store, and it is impossible to investigate their goods before buying them, which makes detecting such products difficult. E-commerce giants like Amazon and Alibaba dedicate significant effort to removing counterfeit products. However, We are still widely accessible as counterfeit goods must be manually flagged and removed. This approach creates challenges for e-commerce platform owners, brand owners, and customs officers. Using manual effort to detect a small number of counterfeit sellers from among the hundreds of thousands of sellers is inefficient, and it is not scalable to the rapid growth of online sellers. A practical detection framework may help reduce the search space by identifying potential counterfeit sellers for further investigation. Traditionally, detecting counterfeit sellers has been conducted by searching suitable tags and browsing sellers whose products contain those tags. However, such sellers may use disguising text such as “toy watch” or “high-quality shoes,” meaning a text-based detection system will.

This article conducts an intensive study on counterfeit and non-counterfeit sellers on social media using connection discovery. The sellers are labelled manually by surveying 40 experienced online shoppers to mark each seller independently. Hence, a classifier is built based on a fine-tuned CNN through their connections. The following contributions are made in this article: (1) They collect 473K shared images from Taobao, Instagram and Carousell. (2) They analyze the behaviours of counterfeit sellers using object recognition and connection discovery approaches and prove that related sellers share similar images. (3) They propose an optimized framework to detect counterfeit sellers using connections discovered from their shared images. (4) They conduct a showcase to prove that the proposed framework can work even if the counterfeit sellers edit the images using various methods. 


    2. Related Work

    A summary of the literature review of this paper can be seen in Figure 1. We can see that many studies related to abnormal user detection have been carried out, but I divide them into four main parts, namely: abnormal users, content sharing behaviour, counterfeiting activities and connection discovery, and include limit each part so that it is easy to understand.

Figure 1. Research state of the art


    3. Method

    3.1 Image Encoding using Convolutional Neural Network

        The new idea proposed in this section is to optimize W to be sensitive to visual features that can help detect counterfeit sellers. One possible goal is to minimize the distance between fake sellers' images, minimize the distance between non-fake sellers, and maximize the distance between fake and non-fake sellers. As a result, images from related pairs were more likely to be annotated with similar machine-generated labels, while pictures from unrelated pairs were less likely to be annotated with the same machine-generated labels. Because of this, fake sellers have a more similar seller profile, and training classifiers to identify those sellers will be more effective. The distance between two images, x1 and x2, can be calculated by :


where f(, W) is the encoded vector of an image using CNN with a weight of W. It is desirable to obtain W such that the distance between related seller images is small, and the distance between unrelated seller images is large.

    3.2 Discovering the Connections among Sellers

        The relationship between two sellers is defined as their similarity Si, j. Each image was encoded into a feature vector using image processing and computer vision techniques, as shown in step 1 of Figure 3. All user-shared image feature vectors from each user were grouped [15] into cluster K and were annotated by a unique label generated by a machine representing its cluster, as shown in steps 2 and 3 in Figure 3. Seller i is then profiled by a K-dimensional vector, Li, which describes the distribution of unique labels K in the image shared by this seller:


where k K, li, k is the frequency of the kth label in the profile, Li, and K is the total number of unique labels in the system, as shown in step 4 in Figure 3. If the image is encoded using an object-based encoder, such as ResNet[ 9], then images in the same cluster tend to have the same Object. Given the profiles, Li and Lj, of sellers i and j, their similarity, Si, j, can be evaluated through the images they share by :

where · is the dot product of two vectors and ||. || is the L2 norm of a vector. We can see the system overflow of connection discovery of the proposed framework in figure 2.

  

Figure 2. System flow of connection discovery

 
 
    3.3 Hypothesis Testing Detection of Counterfeit Sellers

      This section introduces testing hypotheses about how counterfeit sellers can be detected. We consider the following two hypotheses:

  1. H0: Seller i is a fake seller;
  2. H1: Seller i is not a fake seller.

        This decision is called true positive when the algorithm chooses H1 when H1 is true. However, if the algorithm chooses H0, it is called a false negative. Likewise, when H0, is in fact true, choosing H1 is a false positive. Finally, choosing H0 when H0 is a true negative. A binary classifier is proposed for hypothesis testing. In binary classifier training, sellers are labelled as fake or non-fake sellers, with expected returns, 1 and 0, respectively. Therefore, the binary classifier gives an output range from 0 to 1, with 0 representing the certainty that the user is a non-fake seller and 1 representing the certainty that the user is a fake seller. If it is greater than the threshold, then H0 is accepted; otherwise, it will be rejected.

  

     3.4. Dataset

    The data is collected from Taobao, a popular Chinese e-commerce platform. The data collected can be divided into two categories, namely shoes and cosmetics. Information, including prices and product images, is collected using Octopus.3 For each seller, 80 products are selected from their product list, and all images of each product are collected. To avoid the distraction of thumbnails and ad images, we set the minimum size of captured images to be 400 × 400. In total, 101,090 and 51,870 images were collected from 93 and 100 shoe and cosmetic sellers. There are 38 counterfeit sellers and 55 non-fake sellers among shoe sellers, while cosmetics sellers are 23 counterfeit and 77 non-fake sellers. Sellers were manually labelled by surveying 40 experienced online shoppers to tag each seller independently taking into account the seller's pages, images and prices. Finally, the statistical value of the survey results is used as a label for each seller.

    

    4. Result

    Figures 7(a) and 7(b) show the results of counterfeit seller identification, with the number of iterations in training, for shoe and cosmetics sellers, respectively. The experiment is repeated 100 times, and the mean is taken among the 100 results. In each trial, 90% of users are randomly selected as the training set, and the rest are the testing set. Note that CD and CDFT use K = 1,000 for comparison with Object, which has 1,000 class labels. It is observed that the performance improves with more iterations, and CDFT outperforms the other two approaches, with about 80% accuracy for both shoe and cosmetics sellers. CD is the second-best performer, as Object fails to represent users. More discussion can be found in the next section. Hence, it is also interesting to investigate how K affects the performance of the detection.


Figure 3. The precision of detection for the number of iterations in training: (a) shoe sellers and (b) cosmetics sellers.

    Figures 3(a) and 3(b) show the results of counterfeit seller identification with different K for shoe and cosmetics sellers, respectively. The experiment is repeated 20 times, and the mean is taken among the 20 results. In each trial, 90% of users are randomly selected as the training set, and the rest are the testing set. The performance of the 10th iteration is used for the evaluation. The result is also compared with the siamese network (CDFT-Sim) [2, 13, 28], for which the objective is that the output of the CNN is a vector that maximizes the distances between images of unrelated pairs, and minimizes the distances between images of related pairs. It is observed that CDFT outperforms the CD and CDFT-Sim approaches for any values of K. The performance is similar with different values of K. More discussion on finding an optimized K can be found in the later section.

Figure 4. The precision of detection for different values of K for (a) shoe sellers and (b) cosmetics sellers.

    Figures 4(a) and 4(b) show the p of counterfeit seller identification with different a for shoe and cosmetics sellers, respectively. The training is repeated 100 times, and the Λ of all data in the testing sets are recorded. After that, the value will be tested against different values of a. It is observed that p increases with a and reaches a maximum of 0.94 and 0.95 when a equals 0.99. This means that for sellers with Λ equal to 0.99, about 95% are counterfeit sellers. Note that a high p will result in a lower, and a lot of counterfeit sellers are not detected.

    

    5. Conclusion

    This article proposes A CNN-based connection discovery framework to detect online counterfeit sellers, who previously had to be manually detected. Experiments are conducted with real-life datasets with over 450K shared images from the social media and e-commerce platforms Taobao, Instagram, and Carousell. The framework can achieve over 90% in F 1 score for detecting counterfeit sellers. And to the best of their knowledge, this is the first work to detect online counterfeit sellers from their shared images.

    6. Discussion

The proposed approach shows promising results in detecting fake sellers, but is limited to the training set. Although manual effects are required to verify whether there are new types of images among fake sellers, the proposed approach can detect most fake sellers and learn new types through continuous training. CDFT-Sim, only slightly improves the performance of the CD, while the CDFT significantly improves the results. The possible reason is that CDFT-Sim requires more data for training, and CDFT-Sim may give unstable results. In addition, CDFT-Sim is more computationally intensive. Therefore, CDFT-Sim is better suited for tasks that CDFT cannot train, such as optimizing CNN for follower/follower recommendations. Besides, in this research paper, they perform manually labelling for the dataset thus making training data is limited and having to re-verify if new data comes in, an improvement by using continuous learning will improve the performance of the algorithm. Another perspective in my own opinion is " How do the differences between copy items and the original items as both products are almost similar".

    

    7. Future Work

There are two ways to improve performance. The first is to combine Object with CDFT. While Object cannot help classify counterfeit and non-fake sellers, it can help classify sellers who sell similar products, and CDFT can build on those results. The second direction is to investigate different parameters, such as K, the number of unique labels generated by the machine. Although it is evident that the proposed algorithm works with different K, it is not clear how to optimize the selection. Optimized K value can provide better performance. Besides, we can try this approach to another method for feature extraction, such as copy-move, watermarking, rotating, and flipping. Another Perspective for future work might be using another feature to detect counterfeit sellers, such as  Using CNN-Based Object Detection Algorithm by Jointing Semantic Segmentation for Images [2]. And also using another feature to detect counterfeit sellers such as metadata from the image, User review, User Profile will give a better result.


   8. References

  1. Ming Cheung, James She, Weiwei Sun, and Jiantao Zhou. 2019. Detecting Online Counterfeit-goods Seller using Connection Discovery. ACM Trans. Multimedia Comput. Commun. Appl. 15, 2, Article 35 (May 2019), 16  pages.
  2. Baohua Qiang, Ruidong Chen, and Minghao Yang, Yijie Zhai , Yuanchao Pang , Mingliang Zhou. 2020. Convolutional Neural Networks-Based Object Detection Algorithm by Jointing Semantic.Segmentation for Images.MDPI/Sensors. Article 20 ( September 2020), 14



Selasa, 21 September 2021

Summary Research Paper

 "Evaluating Saliency Map Explanations for Convolutional Neural Networks: A User Study"

 Ahmed Alqaraawi , Martin Schuessler, Philipp Weiß , Enrico Costanza, Nadia Berthouze 

Accepted at IUI 2020

In recent decades, the popularity of machine learning has increased because of its ability to outperform humans at specific tasks. Machine learning has seen use cases ranging from predicting customer behaviour to forming the operating system for self-driving cars. Therefore, it is imperative to ensure that the system can be accounted for so that users can know when they have to trust the results of the predictions given by the system, even for users who are not experts in the field of machine learning.

The problem is, it has become quite common these days to hear people refer to modern machine learning systems as "black boxes". The Black boxes typically refer to a method for which we can only observe the inputs and outputs but not the internal workings. Machine Learning may work very well, but understanding the work process of the system is very difficult, even for experts. Especially in the case of deep neural networks. They know how neural networks learn, but they don't know what specific neural networks have learned.

In this study, the researchers chose CNN as the research topic because this method is part of a deep neural network that is very popular in classifying images. And to create an explainable CNN, the most popular approach is to use a saliency map, highlighting essential pixels for the image classification algorithm. The saliency map is claimed to facilitate interpretation by beginners or experts. Several studies have been carried out related to making saliency maps from CNN, but research is still minimal regarding them with actual users. To address this research gap, in this paper, they report on an online user study designed to evaluate the performance of saliency maps generated by a state of the art algorithm: layer-wise relevance propagation (LRP) [1].


2.     Related Work

    In this section, I will discuss several studies that have relevance to the research conducted by the author. First, Bach et al. proposed a general solution to understanding classification decisions by pixel-wise decomposition of non-linear classifiers called LRP. The author chooses this LRP algorithm as an algorithm that generated a saliency map in their research. Second, a study by Yin [2] investigates whether an ML model's accuracy affects laypeople's willingness to trust the model via a sequence of large-scale, randomized, pre-registered human subject experiments. Their work investigates various aspects of users' understanding and trust of the model performance on a hold-out set and maps to the post-deployment performance. Third, Cai et al. [3] conducted a study involving a user study for evaluated two kinds of example-based explanations for a sketch-recognition algorithm: normative and comparative explanations. However, they did not evaluate saliency maps, even though they are both text-based. Riberio et al. [4] proposed LIME, a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner. This method is built by learning an interpretable model locally around the prediction. This research also evaluated the saliency map for text-based classifiers but lack of statistical significance test. In addition, it is unclear whether results would apply to more complex scenarios like multi-class classification with a CNN. So, to address this research gap, Alqarawi et al. (the author of this research) doing an online user study designed to evaluate the performance of saliency maps generated by a state of the art algorithm: layer-wise relevance propagation (LRP).


3. Method

         To evaluate whether the saliency map helps users understand how CNN works in terms of results, they designed a between-group online study on the CNN model for the multi-label image classification. This multi-class selection is because the saliency map can highlight specific parts of the image corresponding to one label and features that fit alternative labels. The authors used the existing Keras Library model trained using the Imagenet dataset with VGG16 Architecture. Then, they fine-tuned the model using the Pascal VOC 2012 dataset. As for the on hold-test set, they use PASCAL VOG 2007 with Average Precision of 0.74%; this is intentional because they want to see whether user understanding can improve regardless of the weaknesses and strengths of the model.  To generate the Saliency Map, the author used Bach's LRP algorithm. In the pilot study, the author used two algorithms (namely LRP and LIME) and compared which algorithms were the most informative according to the pilot study participants and based on that; they decide to choose LRP.  In Preprocessing data, they determine the threshold value to translate a classification score (0 and 1) into an output; detected if the score is above the threshold and missed if the score is below the threshold. They calculated threshold values for each class (e.g. horse, cat) by maximizing the F1-score for the class on the training set.

Figure 1: The interface: Examples are presented in the blue box at the top. The task is shown in the green box at the bottom. All participants worked on the same tasks and where shown the same examples. Conditions differed only in terms of the additional information that was presented alongside each example. Here, saliency maps and scores are shown. 

In general, the task in user study are:  first, the participant should list 2-3 features they believe the system is sensitive to and 2-3 features the system ignores (see Figure1). Second,  they asked participants to predict whether the system will recognize an object of interest ('cat' or 'horse') in the given task image. The author wants to reduce fatigue and want the experiment not to last more than 40 minutes, so they only use 14 images of assignments from various assignments or samples from subset classes and limits the number of classes used in this study to only two categories: cats and horses. This study evaluates two independent variables: the Presence of Saliency Maps and the Presence of Classification Score. Each variable has two main factors, whether shown or omitted. They recruited 64 participants (16 per condition) through Prolific 2, an online crowdsourcing platform. They required participants to have normal or corrected to normal vision and be fluent in English for data quality. They also made it mandatory for participants to be above 18 years of age and to have a technical background (i.e. a degree in computing or engineering).


4. Result of the Research Paper

They evaluated the effects of saliency maps and classification scores based on the percentage of correct estimates per participant. As a result, when the saliency map showed, participants predicted CNN results to be 60.7% more accurate (although still relatively low) than when the map was not displayed (55.1%). The test was carried out using the Two Way Anova Independent Test. In contrast, there is no significant effect on the existence of a classification score. And there is no relationship between the presence of a saliency map and the classification score (see Figure 2). Participants were asked to rate their confidence in their predictions on a 1-4 Likert scale on the task. Using the Kruskal Wallis independent test, it was found that participants tended to feel "slightly confident" in their predictions, with a median value of 3.

Figure  2: Left:  When saliency maps were shown, participants were significantly more accurate in predicting the outcome of the classifier. Right:  Scores did not significantly influence the participant’s prediction performance. Success rates were relatively low across conditions, showing that tasks were very challenging.

To evaluate the prediction accuracy of the results, the authors also assessed the features that were considered sensitive to the classification results. Since the feedback from users is based on free text, the authors perform an inductive assessment of the input based on the features/concepts they experience. The author decided to divide the features into two groups, namely Saliency Feature and General Attribute. The percentage of mentioned saliency features when the saliency map is displayed is much higher than when the map is not displayed (83.5% vs 54.6%). Meanwhile, the existence of a classification score has no significant effect.


5. Discussion And Future Work

The saliency map significantly affected the participant's prediction accuracy but was still relatively low (only 60.7%). The authors investigated participants' performance; the result is that participants are better at predicting whether the system output is true or true positive, and participants have difficulty predicting errors. (false positives: 46.9% and false negatives: 36.7%). That's because participants may have overestimated the performance of the system. Most of the participants predicted that the system was correct when the system failed to classify the class. It is essential to note that users need to be aware and understand when the system fails. Until now, the instance-level description claimed in detecting errors is necessary to evaluate this issue empirically in the future.

In addition, CNN looks for patterns in a sub-symbol fashion that lead to results instead of processing data "semantically" like Humans. Further research is needed to develop algorithmic explanations that can bridge the gap between humans and the system by directing users not to make decisions into high-level image classifications such as semantics.

The reason why paying attention to the saliency features do not give each other a good understanding of the CNN model is that general attributes can affect the classification results. The saliency map makes participants only consider the highlighted features and miss some other general attribute; They suggest a more global representation of the image should complement the saliency map, such as a measure of contrast or overall brightness.


6. Limitation  

There are several limitations in this paper, such as :

1.  They were using small classes for reasons of time and minimizing participant fatigue. Future work should carry out long-term evaluations (i.e. lasting several days or weeks) to allow participants to explore large data sets with multiple classes in more depth.

2.   It's used one specific network architecture (VGG16) and one specific technique to generate saliency maps (LRP). In the pilot study, the author has tried to identify which combination of the two techniques the participants found more informative. So it is possible that the results of the study will be different when using different combination techniques.

3.    This study design does not allow the authors to conclude user performance for other outcomes (e.g. TP, FN, FP). The reason is that they perform a fully offset task, and True Negatives (TN) are not part of the task set; future research should address this limitation and study this aspect in more detail. 

4.   Their participants are required to have technical backgrounds, but they do not control Machine Learning expertise. We saw the potential to repeat their study with different participant populations, such as Machine Learning Experts, or lay users.

5.   The selected methodology doesn't answer the research question; instead, it checks how well the user can guess the answer of the classifier. Why do authors check how many saliency features do participants pick with and without a saliency map?

6.  About the Frequency of Individual features mentions by participants, is there any reason to divide the features into two groups and make so much difference? What's the idea behind that.

7.   In my opinion, it may be better to use a highlight system directly on the image task (e.g. marking features that are considered sensitive to the classification process with circles or rectangles), than to use text-based feedback.


7. Reference

[1]      S. Bach, A. Binder, G. Montavon, F. Klauschen, K. R. Müller, and W. Samek, "On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation," PLoS One, vol. 10, no. 7, pp. 1–46, 2015, DOI: 10.1371/journal.pone.0130140.

[2]      M. Yin, J. W. Vaughan, and H. Wallach, "Understanding the effect of accuracy on trust in machine learning models," Conf. Hum. Factors Comput. Syst. - Proc., pp. 1–12, 2019, DOI: 10.1145/3290605.3300509.

[3]      C. J. Cai, J. Jongejan, and J. Holbrook, "The effects of example-based explanations in a machine learning interface," Int. Conf. Intell. User Interfaces, Proc. IUI, vol. Part F147615, pp. 258–262, 2019, DOI: 10.1145/3301275.3302289.

[4]      M. T. Ribeiro, S. Singh, and C. Guestrin, " 'Why should i trust you?' Explaining the predictions of any classifier," Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., vol. 13-17-August-2016, pp. 1135–1144, 2016, DOI: 10.1145/2939672.2939778.