Large Language Models (LLMs) have exploded in prominence in recent months, particularly after the introduction of Al chatbots like ChatGPT. By researching an existing database and learning patterns to generate new and unique material, these Al-powered models may generate new content such as text, photos, audio, and more. While similar techniques have previously been used to generate content with generative Al, researchers have recently created the first-of-its-kind LLM to assess and attack cybersecurity threats: Surprisingly, this model has only been trained on data from the dark web.
What exactly is DarKBERT?
DarkBERT is an encoder model that uses transformers and follows the RoBERTa architecture. Instead of being trained on the web, researchers trained this LLM on a massive dataset of dark web pages, incorporating data from hacker forums, scamming websites, and other illegal internet sources. DarKBERT can revolutionize the fight against cybercrime by finding and analyzing the elusive domains of the Internet that remain hidden from search engines, according to its creators in a paper titled ‘DarkBERT: A language model for the dark side of the Internet’ published on arxiv.org but yet to be peer-reviewed.
While the dark web is normally hidden and inaccessible to the general public, researchers were able to access and collect data from its pages by using the Tor network. The data was then subjected to multiple processes such as deduplication, category balancing, and pre-processing to produce a refined dark web database, which was subsequently given to RoBERTa, resulting in the formation of DarKBERT during a 15-day period.
Applications for cyber security
DarKBERT has the potential for a wide range of cybersecurity applications because it was trained on a collection of dark web pages. It can aid in the monitoring of illegal activity and the strengthening of cybersecurity measures. Also, it can “combat the extreme lexical and structural diversity of the Dark Web that may be detrimental to building a proper representation of the domain,” according to the research paper.
It has the potential to automate the process of monitoring dark web forums where illegal information is commonly discussed. DarKBERT is capable of detecting websites that are involved in leaking sensitive or confidential data and selling ransomware. Lastly, it uses the BERT-family language model’s fill-mask function to detect and filter out phrases linked with criminal activities which can help identify and tackle new cyber threats.