WitFoo has officially released its Precinct 6 Cybersecurity Dataset as an open-source resource, significantly advancing the availability of real-world cyber attack data. Notably, the dataset includes 100 million structured and labeled security event records, all derived from live attack traffic. As a result, researchers and cybersecurity professionals now gain access to one of the most extensive and realistic datasets available today.
Moreover, WitFoo developed this dataset in collaboration with the University of Canterbury, aiming to strengthen research across cybersecurity, artificial intelligence, and data science domains. Compared to its earlier dataset of just 2 million records, this new release represents a 50-fold increase in scale. More importantly, it captures activity from real production environments instead of controlled lab simulations, thereby improving its practical relevance.
The dataset originates from attack traffic observed over a two-month period in 2024. To ensure privacy, WitFoo carefully sanitized the data, removing sensitive information while preserving critical elements such as timing, structure, and behavioral patterns of cyberattacks. Consequently, users can study realistic attack scenarios without compromising organizational security.
Structured into Four Key Subsets
To enhance usability, WitFoo divided the dataset into four distinct subsets. First, the Signals subset contains 100 million normalized security events collected from sources like syslog, Windows Security Auditing, VPC flow logs, and endpoint telemetry. Each record includes detailed attributes such as timestamps, usernames, hostnames, network metadata, severity levels, and sanitized message content.
In addition, the Graph Edges and Graph Nodes subsets map relationships between users, hosts, processes, and network connections. Meanwhile, the Incidents subset introduces correlated security incidents enriched with binary classification labels, confidence scores, MITRE ATT&CK mappings, and lifecycle metadata. Therefore, this comprehensive structure makes the dataset highly suitable for machine learning, threat detection, and advanced cybersecurity research.
Furthermore, WitFoo has published the sanitization codebase as open source. This move ensures transparency, allowing researchers to examine how sensitive data was handled and removed.
Driving Real-World Cybersecurity Research
Traditionally, many open cybersecurity datasets relied on simulated environments, which often limited their effectiveness in real-world applications. However, Precinct 6 stands apart because it reflects genuine adversary behavior captured during live attacks. As a result, it provides a more accurate foundation for developing and testing detection models.
Researchers can leverage this dataset for intrusion detection, anomaly detection, graph-based threat analysis, automated incident response, benchmarking, and educational purposes. Additionally, WitFoo has made the dataset available under an Apache 2.0 license, enabling free usage across academic, commercial, and government sectors.
“For a decade, WitFoo ran over 4,000 experiments with Fortune 500 companies, universities, and government agencies to develop Empathetic Processing. This dataset is the product of that research, and we believe it belongs in the hands of the academic community,” said Charles Herring, Chairman and Co-Founder, WitFoo.
He further emphasized the uniqueness of the dataset:
“Most publicly available cybersecurity datasets were generated in lab environments with scripted attacks and synthetic traffic. That’s useful for basic benchmarking, but it doesn’t teach you what real adversaries actually look like in a production network. This data comes from live attack traffic observed in 2024. The attackers didn’t know they were being recorded, and they weren’t following a script. We’ve sanitised the data to protect the organisations involved, and we’ve published the sanitisation code itself as open source so researchers can verify exactly how we did it. Cybersecurity’s biggest bottleneck isn’t compute or clever algorithms. It’s the lack of realistic data that researchers can actually train against,” said Charles Herring.
Adding further perspective, Dr Etienne Borde highlighted the dataset’s academic value:
“One of the persistent challenges in cybersecurity research is that most available datasets are either synthetic or derived from controlled laboratory exercises, which limits how well models trained on them generalise to real-world conditions,” said Dr Etienne Borde, Associate Professor in Computer Science and Software Engineering, University of Canterbury.
He also noted the dataset’s impact on future research:
“A dataset of this scale built from live attack traffic is genuinely rare. It opens up research pathways that simply weren’t feasible before, from graph-based threat modelling to evaluating AI-driven detection systems against authentic adversary behaviour. We look forward to incorporating this resource into our research and teaching programmes at Canterbury,” said Borde.
Recommended Cyber Technology News:
- MiningDropper Malware Targets Android With RATs and Infostealers
- Indonesia Suspends Game Ratings After Data Breach
- Notion Data Exposure Reveals Editor Emails on Public Pages
To participate in our interviews, please write to our CyberTech Media Room at info@intentamplify.com
🔒 Login or Register to continue reading




