Web-Hacking Dataset for the Cyber Criminal Profiling


Abstract

As in the real world’s criminal investigation, cyber criminal profiling is important to attribute cyber attacks. Every cyber crime committed by the same hacker or hacking group has unique characteristics such as attack purpose, attack methods, and target’s profile. Therefore, a complete analysis of the hacker’s activities can give investigators hard evidence to attribute attacks and unveil criminals. To foster further research, we release the web-hacking case dataset we have collected.


1. Dataset

We built a large hacking case database which includes 212,093 web-hacking cases that happened during the past 15 years from Zone-H.org site automatically. At Zone-H.org, some information is stored in compliance with defined formats in a case-centric database. Most of the information include the date, domain, IP address, system, and web server for the attack. Other information in mirror pages are stored in the form of HTML source. Due to the case encoding, font and other tags and features that exist in the HTML code, those information are put to use in the case vector design after parsing and processing the HTML contents.

With this dataset, researchers can do clustering analysis and in-depth analysis for discovering relationships between hackers or hacker groups. In our work, we attempted to analyze a relationship between DarkSeoul group's attacks and another set of attacks including Sony Pictures Entertainment attack case.


1-1. Data Set Release

For academic purposes, we are happy to release our Dataset. If you use our dataset for your experiment, please cite our paper.

          • Dataset Download Link: Download

          • note: the current dataset only contains abstracted meta-data. If you need full fields used in our paper, please try to crawl http://zone-h.org/archive by your own.


2. Publication

  • 1. Full paper

Han, M. L., Kwak, B. I., & Kim, H. K. (2019). CBR-Based Decision Support Methodology for Cybercrime Investigation: Focused on the Data-Driven Website Defacement Analysis. Security and Communication Networks, 2019.

  • 2. Preliminary version (2-page poster)

Han, M. L., Han, H. C., Kang, A. R., Kwak, B. I., Mohaisen, A., & Kim, H. K. (2016). IEEE Conference on Communications and Network Security, Philadelphia, PA USA.


3. Contact

Mee Lan Han (blosst at korea.ac.kr) or Huy Kang Kim (cenda at korea.ac.kr)