An Analytical Review to Explore the Latest Research Trends in Spamming and Anti-Spamming Domains

Rupal Jain

CS PhD student at IGDTUW | UNESCO Research Fellow | Don Lavoie Fellow at GMU | Elinor Ostrom Doctoral Fellow | ex-Google | ex-Adobe | ex-Mozilla

Nota Bene: I self-sustain. No cash reserves. I work and pay my bills, while studying. I am looking for remote work in 2024. Write to me if you're a professor, recruiter, admissions counselor, grant writer, policy researcher, run a non-profit, looking for online tutor, or simply a hustler yourself.

Home

An Analytical Review to Explore the Latest Research Trends in Spamming and Anti-Spamming Domains

Published Jul 07, 2019

TL;DR

Designed and led experiments analyzing the latest research trends in spamming and anti-spamming domains. Em- ployed IR based techniques and opinion mining to study behavior and advancement by scraping 270+ research articles. Proposed a novel metric, InOutRank, to determine credibility score based on the incoming and outgoing flow from a node (user)

Abstract

Across the world, the internet has in-practice become a primary source of information consumption. Most of the web users look up to web content while searching for news, events, updates, and almost anything and everything. This massive use of internet also gives birth to various kinds of malpractices which are meant to artificially mold our views and in-turn lead in worsening our web experience by affecting the quality of results retrieved. Over the last decade, research on information retrieval, review spam-detection, identifying malicious links/websites/user accounts in online social media, etc. has gained momentum. It has gained attention both from industry and academia. This paper primarily represents a systematic perspective of different kinds of spamming and anti-spamming techniques and algorithms, and graphical and statistical survey of the advancement of research work in the concerned domains. The study is based on the data collected from past ten years. The graveness of global internet issues is portrayed in this paper. It attempts to motivate the readers to advance the research in this currently trending domain. This paper also presents the observations, inferences from the graphical analysis, and underlying principles used for detecting spamming techniques used by malicious sources over the internet. It also proposes a new metric called InOutRank which helps in determining credibility score for the concerned object based on its incoming and outgoing flow.

Introduction

For today’s youth, online social platforms like Facebook, Twitter, Instagram, etc. have become a de-facto place to create and facilitate information exchange. However, given the humongous volume of data being transferred each day, it becomes difficult to manage and moderate all of it. As a side effect, the advent of such facilities has capacitated the hostile bodies by giving them an upper hand at adopting unfair means to generate and promote their own, and mostly malicious content. They take unfair advantage of the loopholes of the search engines for various purposes like monetary gains, streaming fake/unverified information, pornographic or vulgar or adult content, scams, etc. They compromise the web streamflow for their monetary gains, and hence degrade the user experience, reputation, and credibility of the system. They spread it through various ways like malware, review spam to facilitate untrue feedback which affects the purchase statistics of a product, duplicate bulk messages, phishing, e-mail spam, click baits, excessive and undeserved link creation in-order to fool the search engines, etc. This leads to the flow of wrong information through channels which have a very large user base, thus accelerating the process even faster. Web spam can be stated as the content which is deliberately generated with a clear intention of triggering, rather underserved relevance or importance of some specific pages. One of the examples includes spamdexing. The chief motivation behind the need of improvement of search engines is that, beyond the financial objectives, the quality and credibility of content is required to be maintained.

Various ways through which spammers try to influence the search results are: including artificial text in a page, creating a fake network of spammers with everyone promoting each other’s page to fool search engine algorithms, etc. Spammers use search engines with an aim to influence the customers who actually use the product. Although their ultimate objective is mostly business related, but it can also be classified as religious or political. Usually, people are not aware of such kinds of exploitation and tend to blindly believe the results shown by the search engine. Research has shown that people prefer to care about only the top-ranked results, i.e., 85% of the search results. Apart from the content quality concerns, there are even major issues associated with web spam like worldwide financial losses caused, loss of life or property due to spam. Such spams have also affected the commercial markets like review and feedback systems of online shopping websites. It influences the customer’s decision to purchase a product or service, and review spam leads to the poisoning of such a valuable source of information that can be used by the company to improve their products. And with the ever-growing size and value aspects of the internet, the amount and influence of online reviews and feedback are gradually increased, among the technical folks and the domain of e-commerce. Email spam like phishing too has become rampant and hurts many aspects. It has caused millions of dollars of damage each passing year. Phishing is based on the principles of social engineering which tricks the users to click on a malicious link and give their login credentials to the attacker. It is a deceitful technique used to attain personal information about a user. It can be disastrous at times and lead to identity theft, and in some instances, it has also led to huge monetary and property losses. Traditional phishing attacks target emailing systems. It is also estimated that there was a worldwide loss of $520 million from phishing attacks in 2011 alone. As per the reports, phishing attacks have targeted 43% of all the OSM users in 2010.

Some of the main reasons as to why detecting phishing on online social media has become difficult are:

  • Short content: networking media like Twitter allows limited space ( 140 characters in Twitter) and mainly contains shorthand’s and slang notation which is difficult to classify.
  • Large amount of data: social networking sites allow sharing of opinions and interests by everyone thus creating a huge amount to data each hour, thus making it difficult to regulate.
  • Shortened URLs: studies have shown that most of the phishing attacks have been successful through fake URLs which are shortened to hide the target URL and the associated malicious content in that URL. They help in escaping the blacklisted content.
  • Rapid flow: Information on social media flows very fast thus making it even harder to manage.

Needless to say, attackers have been successful in their mendacious attempts not just because of the smartness on their end, but also because of the carelessness at the user’s end. People easily trust the written word of articles, newspapers, online reading websites or books, they easily trust and get influenced by whatever today’s journalists portray on television, etc. They are unable, unprepared, and sometimes not even willing to think rationally and critically about what they hear and what they see. A recent study shows that most of the youngsters consider internet as their primary source of information consumption, most of them don’t even bother to verify even one other source for validation. Social engineering is another famous terminology in the realm of online social interaction. But along with that, it poses a major threat to organizational information security. It can be defined as the use of various psychological tricks and impersonation/pretexting to trick them into helping the attackers by providing them illegal intrusion, and/or in getting them the confidential information. And if we closely observe the social engineering attacks we may observe that, it is easier for attackers to trick someone into giving his/her login credentials rather than spending time and effort to hack in. Various studies have shown that online users show a great deal of trust in the friend requests and messages they receive online, even if they are unsolicited. This is where reverse social engineering attacks come into the picture. In such attacks ,it is not the attacker which initiates a contact with the user, rather it is the user which does it first or to be precise, it is the user which is tricked into initiating it. As a result, a good level of trust is built between the two parties and this makes the user vulnerable to such psychological games. Another category of a lesser known spam is: video spam. It is spam content being floated in the form of videos. It facilitates a whole new like of interaction among users, including discussion forums, debates, chats, video blogging, educational purposes, etc. Some web services are also providing video based features to gather user’s feedback and suggestions through video responses. In turn this makes these video sharing systems susceptible to such opportunistic and exploiting infections, such as self-promotion, false defaming the system’s reputation, propagating pornographic content, video spamming, etc. And considering the fact that these days, apart from adults, kids also spend time on websites like YouTube. And this poses a dire need to regulate the content flowing through these channels. As compared to text spam detection, video spam detection is a bigger challenge. Users cannot easily identify a video spam before watching at least a portion of it thus leading to unnecessary consumption of resources, compromising one’s patience, and also hampers the reputation of the system.

Conclusion

This paper proposed a new metric called InOutRank which helps in determining a credibility score of an object. It also presents a series of graphical and statistical analysis of the existing research in the domain of anti-spamming and spam detection techniques. It identifies the best out of all the mentioned techniques based on certain factors, and suggested a few improvisations to cover the loopholes of that approach. But, it could present a general statistic without much specifics because the analysis is done only on the data taken from Google Scholar.

Through the survey presented in this paper, it can be concluded that web spam has been/is becoming one of the most severe issues concerning the internet. Although various solutions have been proposed by the researchers across the globe, this problem doesn’t seem to converge. Instead, the attackers become successful most of the times in fooling the users in some or the other manner. Apparently, this problem will continue to exist at some or the other scale with the ever-increasing usage of internet. And this should be the prime motivation for us to contribute towards this domain.

Full Paper can be accessed for free here