Downloads, Hacking Tools, Regular Expression, Regular Expressions, Scan, Scans, Simplify, SSH, SSH Key, Terraform, Website

WARCannon – High Speed/Low Cost CommonCrawl RegExp In Node.js

August 7, 2021, 5:04 AMAugust 7, 2021 704

WARCannon was built to simplify and cheapify the process of ‘grepping the internet’.

With WARCannon, you can:

Build and test regex patterns against real Common Crawl data
Easily load Common Crawl datasets for parallel processing
Scale compute capabilities to asynchronously crunch through WARCs at frankly unreasonable capacity.
Store and easily retrieve the results

How it Works

WARCannon leverages clever use of AWS technologies to horizontally scale to any capacity, minimize cost through spot fleets and same-region data transfer, draw from S3 at incredible speeds (up to 100Gbps per node), parallelize across hundreds of CPU cores, report status via DynamoDB and CloudFront, and store results via S3.

In all, WARCannon can process multiple regular expression patterns across 400TB in a few hours for around $100.

Installation

WARCannon requires that you have the following installed:

awscli (v2)
terraform (v0.11)
jq
jsonnet
npm (v12 or v14)

ProTip: To keep things clean and distinct from other things you may have in AWS, it’s STRONGLY recommended that you deploy WARCannon in a fresh account. You can create a new account easily from the ‘Organizations’ console in AWS. By ‘STRONGLY recommended’, I mean ‘seriously don’t install this next to other stuff’.

First, clone the repo and copy the example settings.

$ git clone [email protected]:c6fc/warcannon.git
$ cd warcannon
warcannon$ cp settings.json.sample settings.json

Edit settings.json to taste:

backendBucket: Is the bucket to store the terraform state in. If it doesn’t exist, WARCannon will create it for you. Replace ‘< somerandomcharacters >’ with random characters to make it unique, or specify another bucket you own.
awsProfile: The profile name in ~/.aws/credentials that you want to piggyback on for the installation.
nodeInstanceType: An array of instance types to use for parallel processing. ‘c’-types are best value for this purpose, and any size can be used. ["c5n.18xlarge"] is the recommended value for true campaigns.
nodeCapacity: The number of nodes to request during parallel processing. The resulting nodes will be an arbitrary distribution of the nodeInstanceTypes you specify.
nodeParallelism: The number of simultaneous WARCs to process per vCPU. 2 is a good number here. If nodes have insufficient RAM to run at this level of parallelism (as you might encounter with ‘c’-type instances), they’ll run at the highest safe parallelism instead.
nodeMaxDuration: The maximum lifespan of compute nodes in seconds. Nodes will be automatically terminated after this time if the job has still not completed. Default value is 24 hours.
sshPubkey: A public SSH key to facilitate remote access to nodes for troubleshooting.
allowSSHFrom: A CIDR mask to allow SSH from. Typically this will be <yourpublicip>/32

Grepping the Internet

WARCannon is fed by Common Crawl via the AWS Open Data program. Common Crawl is unique in that the data retrieved by their spiders not only captures website text, but also other text-based content like JavaScript, TypeScript, full HTML, CSS, etc. By constructing suitable Regular Expressions capable of identifying unique components, researchers can identify websites by the technologies they use, and do so without ever touching the website themselves. The problem is that this requires parsing hundreds of terabytes of data, which is a tall order no matter what resources you have at your disposal.

Developing Regular Expressions

Grepping the internet isn’t for the faint of heart, but starting with an effective seive is the first start. WARCannon supports this by enabling local verification of regular expressions against real Common Crawl data. First, open lambda_functions/warcannon/matches.js and modify the regex_patterns object to include the regular expressions you wish to use in name: pattern format. Here’s an example from the default search set:

exports.regex_patterns = {
"access_key_id": /(\'A|"A)(SIA|KIA|IDA|ROA)[JI][A-Z0-9]{14}[AQ][\'"]/g,
};

Strings matching this expression will be saved under the corresponding key in the results; access_key_id in this case. Protip: Use RegExr with the ‘JavaScript’ format to build and test regular expressions against known-good matches.

You also have the option of only capturing results from specified domains. To do this, simply populate the domains array with the FQDNs that you wish to include. It is recommended that you leave this empty [] since it’s almost never worthwhile (the processing effort saved is very small), but it can be useful in some niche cases.

exports.domains = ["example1.com", "example2.com"];

Once the matches.js is populated, run the following command:

warcannon$ ./warcannon testLocal <warc_path>

WARCannon will then download the warc and parse it with your configured matches. There are a few quality-of-life things that WARCannon does by default that you should be aware of:

WARCannon will download the warc to /tmp/warcannon.testLocal on first run, and will re-use the downloaded warc from then on even if you change the warc_path. If you wish to use a different warc, you must delete this file.
WARCs are large; most coming in at just over 1GB. WARCannon uses the CLI for multi-threaded downloads, but if you have slow internet, you’ll need to exercise patience the first time around.

On top of everything else, WARCannon will attempt to evaluate the total compute cost of your regular expressions when run locally. This way, you can be informed if a given regular expression will significantly impact performance before you execute your campaign.

Performing Custom Processing

Sometimes a simple regex pattern isn’t sufficient on its own, and you need some additional steps to ensure you’re returning the right information. In this case, simply adding a function to the exports.custom_functions object with the same key name allows you to perform any additional processing you see fit.

exports.regex_patterns = {
"access_key_id": /(\'A|"A)(SIA|KIA|IDA|ROA)[JI][A-Z0-9]{14}[AQ][\'"]/g,
};exports.custom_functions = {
"access_key_id": function(match) {
// Ignore matches with 'EXAMPLE' in the text, since this is common for documentation.
if (match.text(/EXAMPLE/) != null) {
// Returning a boolean 'false' discards the match.
return false
}
}
}

Note: WARCannon is meant to crunch through text at stupid speeds. While it’s certainly possible to perform any type of operation you’d like, adding high-latency custom functions such as network calls can significantly increase processing time and costs. Network calls could also result in LOTS of calls against a website, which could get you in trouble. Be smart about how you use these functions.

Performing a One-Off Test in AWS

The costs of AWS can be anxiety-inducing, especially when you’re only looking to do some research. WARCannon is built to allow both one-off executions in addition to full campaigns, so you can be confident in the results you’ll get back. Once you’re happy with the results you get with testLocal, you can deploy your updated matches and run a cloud-backed test easily:

warcannon$ ./warcannon deploy
warcannon$ ./warcannon test <warc_path>

This will synchronously execute a Lambda function with the regular expressions you’ve configured, and immediately return the results. This process takes about 2.5 minutes, so don’t be afraid to wait while it does its magic.

Launching a Real Campaign

Once you’re happy with the results you get in Lambda, you’re ready to grep the internet for real. We’ll first go over some basic housekeeping, then kick it off.

Clearing the Queue

WARCannon uses AWS Simple Queue Service to distribute work to the compute nodes. To ensure that your results aren’t tainted with any prior runs, you can tell WARCannon to empty the queue:

warcannon$ ./warcannon emptyQueue
[+] Cleared [ 15 ] messages from the queue

You can then verify the state of the queue:

warcannon$ ./warcannon status
Deployed: [ YES ]; SQS Status: [ EMPTY ]
Job Status: [ INACTIVE ]
Active job url: https://d201offlnmhkmd.cloudfront.net

Verify the following before proceeding:

The SQS Queue is empty
The job status is ‘INACTIVE’

Populating the Queue (Simple)

In order to create the queue messages that the compute nodes will consume, you must first populate SQS with crawl data. WARCannon has several commands to help with this, starting with the ability to show the available scans. In this case, let’s look at the scans available for the year 2021:

warcannon$ ./warcannon list 2021
CC-MAIN-2021-04
CC-MAIN-2021-10

We have two scans matching the string “2021” to work with. We can now instruct WARCannon to populate the queue based on one of these scans. This time, we need to provide a parameter that uniquely identifies one of the scans. “2021-04” will do the trick. We could choose to populate only a partial scan by also specifying a number of chunks and a chunk size, but we’ll skip that for now.

warcannon$ ./warcannon populate 2021-04
{Created 799 chunks of 100 from 79840 available segments"
"StatusCode": 200,
"ExecutedVersion": "$LATEST"
}

Populating the Queue via Athena (Advanced)

During deployment, WARCannon automatically provisions a database (warcannon_commoncrawl) and workgroup (warcannon) in Athena that can be used to rapidly query information from CommonCrawl. This can be especially useful for populating sparse campaigns based on certain queries. For example, the following query will search for WARCs that contain responses from ‘example.com’

SELECT
warc_filename,
COUNT(url_path) as num
FROM
warcannon_commoncrawl.ccindex
WHERE
subset = 'warc'	AND
url_host_registered_domain IN ('example.com') AND
crawl = 'CC-MAIN-2021-04'
GROUP BY warc_filename
ORDER BY num DESC

You can use the Athena console to fine-tune your results, but you must run the query from the WARCannon command line if you intend to populate a job with it:

./warcannon queryAthena "SELECT warc_filename, COUNT(url_path) as num FROM warcannon_commoncrawl.ccindex WHERE subset = 'warc' AND url_host_registered_domain IN ('example.com') AND crawl = 'CC-MAIN-2021-04' GROUP BY warc_filename ORDER BY num DESC"[+] Query Exec Id: 0319486e-1846-491c-badf-2e23ae213974 .. SUCCEEDED

WARCannon can then use the results of a query to populate the queue, and does so based on the warc_filename column from the resultset. As such, it’s recommended that you either group by this column or use distinct() to avoid duplicates. WARCannon will throw an error if this field isn’t present. Populate the queue with Athena results by passing the Query Execution ID to the populateAthena command.

./warcannon populateAthena 0319486e-1846-491c-badf-2e23ae213974{Created 26 chunks of 10 from 251 available segments"
"StatusCode": 200,
"ExecutedVersion": "$LATEST"
}

Note: While populating a sparse job for a single domain might seem like a good idea, it often isn’t. The responses from a single domain tend to be spread widely across a large subset of WARCs. This can be seen clearly using the example query above to see that of the ~150,000 records in each WARC, the largest single hit for moderate-sized websites can be in the single-digits.

Firing the WARCannon

With the queue populated, we’re ready to fire. WARCannon will do a few sanity checks to ensure everything is in order, then show you the configuration of the campaign and give you one last opportunity to abort before you finalize the order.

Ready to fire? [Yes]: “>

warcannon$ ./warcannon fire
[!] This will request [ 6 ] spot instances of type [ m5n.24xlarge, m5dn.24xlarge ]
lasting for [ 86400 ] seconds.To change this, edit your settings in settings.json and run ./warcannon deploy
--> Ready to fire? [Yes]:

Pull the trigger by responding with ‘Yes’.

Ready to fire? [Yes]: Yes
{ "SpotFleetRequestId": "sfr-03dd32b8-51f7-4c8e-802b-a702fc3c8c95"
} [+] Spot fleet request has been sent, and nodes should start coming online within ~5 minutes. Monitor node status and progress at https://d201offlnmhkmd.cloudfront.net “>

--> Ready to fire? [Yes]: Yes
{
"SpotFleetRequestId": "sfr-03dd32b8-51f7-4c8e-802b-a702fc3c8c95"
}[+] Spot fleet request has been sent, and nodes should start coming online within ~5 minutes.
Monitor node status and progress at https://d201offlnmhkmd.cloudfront.net

The response includes a link to your unique status URL, where you can monitor the progress of your campaign and the performance of each node.

Obtaining Results

WARCannon results are stored in S3 in JSON format, broken down by each node responsible for producing the results. Athena results are stored in the same bucket under the /athena/ prefix. You can sync the results of a campaign to the ./warcannon/results/ folder on your local machine using the syncResults command.

./warcannon syncResultssync: s3://warcannon-results-202...
sync: s3://warcannon-results-202...
sync: s3://warcannon-results-202...
...

You can then empty the results buckets with clearResults

./warcannon clearResultsdelete: s3://warcannon-results-202...
delete: s3://warcannon-results-202...
delete: s3://warcannon-results-202...
...
[+] Deleted [ 21 ] files from S3.

Source : KitPloit – PenTest Tools!

KitPloitRegular ExpressionRegular ExpressionsScanScansSimplifySSHSSH KeyTerraformWebsite

Previous ArticleBlack Hat: How cybersecurity incidents can become legal minefieldsNext ArticleResearchers Perform An Analysis on Chinese Malware Used Against Russian Government

Huge security flaw in macOS lets hackers steal your passwords

Motorola Moto G (2nd Gen) gets Android Marshmallow update

WhatsApp down on New Year’s Eve: Users worldwide unable to connect as messaging app crashes repeatedly

WhatsApp for Windows Phone update brings starred messages, new camera interface

Microsoft Lumia 950 Dual SIM, Lumia 950 XL Dual SIM Launched in India

Nokia C1 Leak Tips Launch With Android and Windows 10 Mobile

A solar-powered “Lunar” smartwatch seems like a good idea — if it works

TV Service is being killed by Google Fiber; The Company wants to concentrate on High Speed Internet

Google Home now lets you set and manage your reminders

Hacker Steve Lord says Windows Phone is the”hardest nut to crack”

Google Makes Full-Disk Encryption Mandatory for New Android 6.0 Devices

Hike users can now send messages without internet

Social-Analyzer – API And Web App For Analyzing And Finding A Person Profile Across +300 Social Media Websites (Detections Are Updated Regularly)

Six Methods to Create a Secure Password You’ll Actually Remember [INFOGRAPHIC]

Here’s how to kick nazis off your Twitter right now

Twitter CEO promises to crack down on hate, violence and harassment with “more aggressive” rules

Twitter users join 24hr boycott to protest online harassment

Twitter says it may “refine” its policies after reversing position on Blackburn campaign ad

WhatsApp video calling feature, new design leaked

Microsoft Lumia 950 Dual SIM, Lumia 950 XL Dual SIM Launched in India

Flipkart Partners With Google to Launch App-Like Mobile Website

Google Makes Full-Disk Encryption Mandatory for New Android 6.0 Devices

Indian govt to launch its own operating system for official use

Google Makes Website Making Easy With “Material Design Lite” and Free Website Builder

Shodan-Dorks – Dorks for Shodan; a powerful tool used to search for Internet-connected devices

Secator – The Pentester’S Swiss Knife

RecycledInjector – Native Syscalls Shellcode Injector

CakeFuzzer – Automatically And Continuously Discover Vulnerabilities In Web Applications Created Based On Specific Frameworks

Mantra – A Tool Used To Hunt Down API Key Leaks In JS Files And Pages

ScrapPY – A Python Utility For Scraping Manuals, Documents, And Other Sensitive PDFs To Generate Wordlists That Can Be Utilized By Offensive Security Tools

VulnKnox – A Go-based Wrapper For The KNOXSS API To Automate XSS Vulnerability Testing

Camtruder – Advanced RTSP Camera Discovery and Vulnerability Assessment Tool

Ghost-Route – Ghost Route Detects If A Next JS Site Is Vulnerable To The Corrupt Middleware Bypass Bug (CVE-2025-29927)

DockerSpy – DockerSpy Searches For Images On Docker Hub And Extracts Sensitive Information Such As Authentication Secrets, Private Keys, And More

VulnNodeApp – A Vulnerable Node.Js Application

Pyrit – The Famous WPA Precomputed Cracker

Sri Lanka arrests 2 men over Taiwan bank hacking

Here’s the Facebook Hacking Tool that Can Really Hack Accounts, But…

3 Wipro employees arrested for hacking UK firm TalkTalk

Samsung agrees to pay Apple $548 million for copying its iPhone designs

Indian hackers ‘pay back’ Pakistan for 26/11

Boy, 15, arrested in Northern Ireland in connection with TalkTalk hack

Sri Lanka arrests 2 men over Taiwan bank hacking

324,000 Financial Records with CVV Numbers Stolen From A Payment Gateway

Over 800,000 Brazzers User Accounts Hacked

Aryabhatta college of Delhi University (DU) website hacked by Pakistani Hackers

Indian Railways page hacked by Al Qaeda. And this is the message they left for Indian Muslims

JNU’s Website Defaced by Indian Hackers

‘Pokémon Snap’ lives on through ‘Pokémon Go’ photography contest

Desk lamp transforms from notepad into a modern, stylish lamp

Nissan drove a GT-R around a racetrack using a PS4 controller

Razer’s first ever smartphone could be coming next month

Oculus Go solves VR’s two biggest problems

Truly driverless cars could soon be allowed on California’s roads

Shodan-Dorks – Dorks for Shodan; a powerful tool used to search for Internet-connected devices

Uro – Declutters Url Lists For Crawling/Pentesting

Witcher – Managing GitHub Advanced Security (GHAS) Controls At Scale

ByeDPIAndroid – App To Bypass Censorship On Android

API-s-for-OSINT – List Of API’s For Gathering Information About Phone Numbers, Addresses, Domains Etc

Firecrawl-Mcp-Server – Official Firecrawl MCP Server – Adds Powerful Web Scraping To Cursor, Claude And Any Other LLM Clients

Your iPhone will Alert You if You are Being Monitored At Work

Warning! — Linux Mint Website Hacked and ISOs replaced with Backdoored Operating System

WhatsApp down on New Year’s Eve: Users worldwide unable to connect as messaging app crashes repeatedly

WhatsApp video calling feature, new design leaked

Bad Santa! Microsoft Offers — ‘Upgrade now’ or ‘Upgrade tonight’ to Push Windows 10

Samsung agrees to pay Apple $548 million for copying its iPhone designs

Drozer – The Leading Security Assessment Framework For Android

Apepe – Enumerate Information From An App Based On The APK File

Androidqf – (Android Quick Forensics) Helps Quickly Gathering Forensic Evidence From Android Devices, In Order To Identify Potential Traces Of Compromise

FireStorePwn – Firestore Database Vulnerability Scanner Using APKs

LibAFL – Advanced Fuzzing Library – Slot Your Fuzzer Together In Rust! Scales Across Cores And Machines. For Windows, Android, MacOS, Linux, No_Std, …

Cpufetch – Simplistic Yet Fancy CPU Architecture Fetching Tool