WitFoo Unveils Massive 100 Million Cyber Attack Dataset

WitFoo has officially released its Precinct 6 Cybersecurity Dataset as an open-source resource, significantly advancing the availability of real-world cyber attack data. Notably, the dataset includes 100 million structured and labeled security event records, all derived from live attack traffic. As a result, researchers and cybersecurity professionals now gain access to one of the most extensive and realistic datasets available today.

Moreover, WitFoo developed this dataset in collaboration with the University of Canterbury, aiming to strengthen research across cybersecurity, artificial intelligence, and data science domains. Compared to its earlier dataset of just 2 million records, this new release represents a 50-fold increase in scale. More importantly, it captures activity from real production environments instead of controlled lab simulations, thereby improving its practical relevance.

The dataset originates from attack traffic observed over a two-month period in 2024. To ensure privacy, WitFoo carefully sanitized the data, removing sensitive information while preserving critical elements such as timing, structure, and behavioral patterns of cyberattacks. Consequently, users can study realistic attack scenarios without compromising organizational security.

Structured into Four Key Subsets

To enhance usability, WitFoo divided the dataset into four distinct subsets. First, the Signals subset contains 100 million normalized security events collected from sources like syslog, Windows Security Auditing, VPC flow logs, and endpoint telemetry. Each record includes detailed attributes such as timestamps, usernames, hostnames, network metadata, severity levels, and sanitized message content.

In addition, the Graph Edges and Graph Nodes subsets map relationships between users, hosts, processes, and network connections. Meanwhile, the Incidents subset introduces correlated security incidents enriched with binary classification labels, confidence scores, MITRE ATT&CK mappings, and lifecycle metadata. Therefore, this comprehensive structure makes the dataset highly suitable for machine learning, threat detection, and advanced cybersecurity research.

Furthermore, WitFoo has published the sanitization codebase as open source. This move ensures transparency, allowing researchers to examine how sensitive data was handled and removed.

Driving Real-World Cybersecurity Research

Traditionally, many open cybersecurity datasets relied on simulated environments, which often limited their effectiveness in real-world applications. However, Precinct 6 stands apart because it reflects genuine adversary behavior captured during live attacks. As a result, it provides a more accurate foundation for developing and testing detection models.

Researchers can leverage this dataset for intrusion detection, anomaly detection, graph-based threat analysis, automated incident response, benchmarking, and educational purposes. Additionally, WitFoo has made the dataset available under an Apache 2.0 license, enabling free usage across academic, commercial, and government sectors.

“For a decade, WitFoo ran over 4,000 experiments with Fortune 500 companies, universities, and government agencies to develop Empathetic Processing. This dataset is the product of that research, and we believe it belongs in the hands of the academic community,” said Charles Herring, Chairman and Co-Founder, WitFoo.

He further emphasized the uniqueness of the dataset:

“Most publicly available cybersecurity datasets were generated in lab environments with scripted attacks and synthetic traffic. That’s useful for basic benchmarking, but it doesn’t teach you what real adversaries actually look like in a production network. This data comes from live attack traffic observed in 2024. The attackers didn’t know they were being recorded, and they weren’t following a script. We’ve sanitised the data to protect the organisations involved, and we’ve published the sanitisation code itself as open source so researchers can verify exactly how we did it. Cybersecurity’s biggest bottleneck isn’t compute or clever algorithms. It’s the lack of realistic data that researchers can actually train against,” said Charles Herring.

Adding further perspective, Dr Etienne Borde highlighted the dataset’s academic value:

“One of the persistent challenges in cybersecurity research is that most available datasets are either synthetic or derived from controlled laboratory exercises, which limits how well models trained on them generalise to real-world conditions,” said Dr Etienne Borde, Associate Professor in Computer Science and Software Engineering, University of Canterbury.

He also noted the dataset’s impact on future research:

“A dataset of this scale built from live attack traffic is genuinely rare. It opens up research pathways that simply weren’t feasible before, from graph-based threat modelling to evaluating AI-driven detection systems against authentic adversary behaviour. We look forward to incorporating this resource into our research and teaching programmes at Canterbury,” said Borde.

Recommended Cyber Technology News:

To participate in our interviews, please write to our CyberTech Media Room at info@intentamplify.com

🔒 Login or Register to continue reading

Tags: AI security, Attack data, Cybersecurity dataset, Data research, Security logs, threat detection

CyberTech Intelligence

Connect with Us

WitFoo Unveils Massive 100 Million Cyber Attack Dataset

Structured into Four Key Subsets

Driving Real-World Cybersecurity Research

CyberTech Media Room

Share With

Recent Posts

Daily CyberTech Highlights: Essential News and Analysis | 4 June 2026

Healthcare AI Governance Standards Are Racing to Keep Up With Agentic AI

Enterprise AI Fails Without Trusted Infrastructure Data

Android Zero-Days Are Turning Mobile Patch Velocity Into a Board-Level Security Metric

AI Developer Tools Have Become the New Software Supply Chain Attack Surface

Contact Us

Quick Links

Insights

Get in touch

Connect with Us

Our Other Brands

From Insights to Intelligence – A New Era Begins.

GTM Strategy

Demand Intelligence

Pipeline Activation

Round Tables

Sponsored Research

Targeted Content

Webinars & Panels

Vendor Intelligence

Strategic Consulting

From Audience Engagement to Buying Group Intelligence to Pipeline Activation

Get Your Custom Audience & Pipeline Plan

WitFoo Unveils Massive 100 Million Cyber Attack Dataset

Structured into Four Key Subsets

Driving Real-World Cybersecurity Research

CyberTech Media Room

Share With

Recent Posts

Daily CyberTech Highlights: Essential News and Analysis | 4 June 2026

Healthcare AI Governance Standards Are Racing to Keep Up With Agentic AI

Enterprise AI Fails Without Trusted Infrastructure Data

Android Zero-Days Are Turning Mobile Patch Velocity Into a Board-Level Security Metric

AI Developer Tools Have Become the New Software Supply Chain Attack Surface

Enterprise AI Security Visibility Is Becoming the Next Governance Battleground

Contact Us

Quick Links

Insights

Get in touch

Connect with Us

Our Other Brands

From Insights to Intelligence – A New Era Begins.

GTM Strategy

Demand Intelligence

Pipeline Activation

Round Tables

Sponsored Research

Targeted Content

Webinars & Panels

Vendor Intelligence

Strategic Consulting

See Your Target Accounts Already in Market

Access Real Buyer Intent Data for Cybersecurity & B2B Tech

From Audience Engagement to Buying Group Intelligence to Pipeline Activation

Get Your Custom Audience & Pipeline Plan