Octopii – An AI-powered Personal Identifiable Information (PII) Scanner

Octopii is an open-source AI-powered Personal Identifiable Information (PII) scanner that can look for image assets such as Government IDs, passports, photos and signatures in a directory.

Working

Octopii uses Tesseract’s Optical Character Recognition (OCR) and Keras’ Convolutional Neural Networks (CNN) models to detect various forms of personal identifiable information that may be leaked on a publicly facing location. This is done in the following steps:

1. Importing and cleaning image(s)

The image is imported via OpenCV and Python Imaging Library (PIL) and is cleaned, deskewed and rotated for scanning.

2. Performing image classification and Optical Character Recognition (OCR)

A directory is looped over and searched for images. These images are scanned for unique features via the image classifier (done by comparing it to a trained model), along with OCR for finding substrings within the image. This may have one of the following outcomes:

Best case (score >=90): The image is sent into the image classifier algorithm to be scanned for features such as an ISO/IEC 7810 card specification, colors, location of text, photos, holograms etc. If it is successfully classified as a type of PII, OCR is performed on it looking for particular words and strings as a final check. When both of these are confirmed, the result from Octopii is extremely reliable.
Average case (score >=50): The image is partially/incorrectly identified by the image classifier algorithm, but an OCR check finds contradicting substrings and reclassifies it.
Worst case (score >=0): The image is only identified by the image classifier algorithm but an OCR scan returns no results.
Incorrect classification: False positives due to a very small model or OCR list may incorrectly classify PIIs, giving inaccurate results.

As a final verification method, images are scanned for certain strings to verify the accuracy of the model.

The accuracy of the scan can determined via the confidence scores in output. If all the mentioned conditions are met, a score of 100.0 is returned.

To train the model, data can also be fed into the model_generator.py script, and the newly improved h5 file can be used.

Usage

Install all dependencies via pip install -r requirements.txt.
Install the Tesseract helper locally via sudo apt install tesseract-ocr -y (for Ubuntu/Debian).
To run Octopii, type python3 octopii.py <location name>, for example python3 octopii.py pii_list/

python3 octopii.py <location to scan> <additional flags>

Octopii currently supports local scanning and scanning S3 directories and open directory listings via their URLs.

Example

Contributing

Open-source projects like these thrive on community support. Since Octopii relies heavily on machine learning and optical character recognition, contributions are much appreciated. Here’s how to contribute:

1. Fork

Fork the official repository at https://github.com/redhuntlabs/octopii

2. Understand

There are 3 files in the models/ directory.
– The keras_models.h5 file is the Keras h5 model that can be obtained from Google’s Teachable Machine or via Keras in Python.
– The labels.txt file contains the list of labels corresponding to the index that the model returns.
– The ocr_list.json file consists of keywords to search for during an OCR scan, as well as other miscellaneous information such as country of origin, regular expressions etc.

Generating models via Teachable Machine

Since our current dataset is quite small, we could benefit from a large Keras model of international PII for this project. If you do not have expertise in Keras, Google provides an extremely easy to use model generator called the Teachable Machine. To use it:

Visit https://teachablemachine.withgoogle.com/train and select ‘Image Project’ → ‘Standard Image Model’.
A few classes are visible. Rename the class to an asset type ypu’d like to upload, such as “German Passport” or “California Driver License”.
Add images by clicking the ‘Upload’ button and upload some image assets. Note: images have to be square

Tip: segregate your image assets into folders with the folder name being the same as the class name. You can then drag and drop a folder into the upload dialog.

Click ‘+ Add a class’ at the bottom of the page to add more classes with data and repeat. You can make the classes more specific, such as “Goa Driver License Old Format”.

Note: Only upload the same as the class name, for example, the German Passport class must have German Passport pictures. Uploading the wrong data to the wrong class will confuse the machine learning algorithms.

Verify the classes and images one last time. Once you’re ready, click on the ‘Train Model’ button. You can increase the epoch size (such as 5000) to improve model accuracy.
To test, you can test the model by clicking the Input dropdown and selecting ‘File’, then uploading a sample image.
Once you’re ready, click the ‘Export Model’ button. In the dialog that pops up, select the ‘Tensorflow’ tab (not Tensorflow.js) and select the ‘Keras’ radio button, then click ‘Download my model’ to export the newly generated model. Extract the downloaded zip file and paste the keras_model.h5 file and labels.txt file into the models/ directory in Octopii.

The images used for the model above are not visible to us since they’re in a proprietary format. You can use both dummy and actual PII. Make sure they are square-ish in image size.

Updating OCR list

Once you generate models using Teachable Machine, you can improve Octopii’s accuracy via OCR. To do this:

Open the existing ocr_list.json file. Create a JSONObject with the key having the same name as the asset class. NOTE: The key name must be exactly the same as the asset class name from Teachable Machine.
For the keywords, use as many unique terms from your asset as possible, such as “Income Tax Department”. Store them in a JSONArray.
(Advanced) you can also add regexes for things like ID numbers and MRZ on passports if they are unique enough. Use https://regex101.com to test your regexes before adding them.
Save/overwrite the existing ocr_list.json file.

3. Edit

You can replace each file you modify in the models/ directory after you create or edit them via the above methods.

4. Pull request

Submit a pull request from your forked repo and we’ll pick it up and replace our current model with it if the changes are large enough.

Note: Please take the following steps to ensure quality

Make sure the model returns extremely accurate results by testing it locally first.
Use proper text casing for label names in both the Keras model and ocr_list.json.
Make sure all JSON is valid with appropriate character escapes with no duplicate keys, regexes or keywords.
For country names, please use the ISO 3166-1 alpha-2 code of the country.

Credits

License

MIT License

Author: Owais Shaikh

Source : KitPloit – PenTest Tools!

Huge security flaw in macOS lets hackers steal your passwords

Motorola Moto G (2nd Gen) gets Android Marshmallow update

WhatsApp down on New Year’s Eve: Users worldwide unable to connect as messaging app crashes repeatedly

WhatsApp for Windows Phone update brings starred messages, new camera interface

Microsoft Lumia 950 Dual SIM, Lumia 950 XL Dual SIM Launched in India

Nokia C1 Leak Tips Launch With Android and Windows 10 Mobile

A solar-powered “Lunar” smartwatch seems like a good idea — if it works

TV Service is being killed by Google Fiber; The Company wants to concentrate on High Speed Internet

Google Home now lets you set and manage your reminders

Hacker Steve Lord says Windows Phone is the”hardest nut to crack”

Google Makes Full-Disk Encryption Mandatory for New Android 6.0 Devices

Hike users can now send messages without internet

Social-Analyzer – API And Web App For Analyzing And Finding A Person Profile Across +300 Social Media Websites (Detections Are Updated Regularly)

Six Methods to Create a Secure Password You’ll Actually Remember [INFOGRAPHIC]

Here’s how to kick nazis off your Twitter right now

Twitter CEO promises to crack down on hate, violence and harassment with “more aggressive” rules

Twitter users join 24hr boycott to protest online harassment

Twitter says it may “refine” its policies after reversing position on Blackburn campaign ad

WhatsApp video calling feature, new design leaked

Microsoft Lumia 950 Dual SIM, Lumia 950 XL Dual SIM Launched in India

Flipkart Partners With Google to Launch App-Like Mobile Website

Google Makes Full-Disk Encryption Mandatory for New Android 6.0 Devices

Indian govt to launch its own operating system for official use

Google Makes Website Making Easy With “Material Design Lite” and Free Website Builder

Shodan-Dorks – Dorks for Shodan; a powerful tool used to search for Internet-connected devices

Secator – The Pentester’S Swiss Knife

RecycledInjector – Native Syscalls Shellcode Injector

CakeFuzzer – Automatically And Continuously Discover Vulnerabilities In Web Applications Created Based On Specific Frameworks

Mantra – A Tool Used To Hunt Down API Key Leaks In JS Files And Pages

ScrapPY – A Python Utility For Scraping Manuals, Documents, And Other Sensitive PDFs To Generate Wordlists That Can Be Utilized By Offensive Security Tools

VulnKnox – A Go-based Wrapper For The KNOXSS API To Automate XSS Vulnerability Testing

Camtruder – Advanced RTSP Camera Discovery and Vulnerability Assessment Tool

Ghost-Route – Ghost Route Detects If A Next JS Site Is Vulnerable To The Corrupt Middleware Bypass Bug (CVE-2025-29927)

DockerSpy – DockerSpy Searches For Images On Docker Hub And Extracts Sensitive Information Such As Authentication Secrets, Private Keys, And More

VulnNodeApp – A Vulnerable Node.Js Application

Pyrit – The Famous WPA Precomputed Cracker

Sri Lanka arrests 2 men over Taiwan bank hacking

Here’s the Facebook Hacking Tool that Can Really Hack Accounts, But…

3 Wipro employees arrested for hacking UK firm TalkTalk

Samsung agrees to pay Apple $548 million for copying its iPhone designs

Indian hackers ‘pay back’ Pakistan for 26/11

Boy, 15, arrested in Northern Ireland in connection with TalkTalk hack

Sri Lanka arrests 2 men over Taiwan bank hacking

324,000 Financial Records with CVV Numbers Stolen From A Payment Gateway

Over 800,000 Brazzers User Accounts Hacked

Aryabhatta college of Delhi University (DU) website hacked by Pakistani Hackers

Indian Railways page hacked by Al Qaeda. And this is the message they left for Indian Muslims

JNU’s Website Defaced by Indian Hackers

‘Pokémon Snap’ lives on through ‘Pokémon Go’ photography contest

Desk lamp transforms from notepad into a modern, stylish lamp

Nissan drove a GT-R around a racetrack using a PS4 controller

Razer’s first ever smartphone could be coming next month

Oculus Go solves VR’s two biggest problems

Truly driverless cars could soon be allowed on California’s roads

Shodan-Dorks – Dorks for Shodan; a powerful tool used to search for Internet-connected devices

Uro – Declutters Url Lists For Crawling/Pentesting

Witcher – Managing GitHub Advanced Security (GHAS) Controls At Scale

ByeDPIAndroid – App To Bypass Censorship On Android

API-s-for-OSINT – List Of API’s For Gathering Information About Phone Numbers, Addresses, Domains Etc

Firecrawl-Mcp-Server – Official Firecrawl MCP Server – Adds Powerful Web Scraping To Cursor, Claude And Any Other LLM Clients

Your iPhone will Alert You if You are Being Monitored At Work

Warning! — Linux Mint Website Hacked and ISOs replaced with Backdoored Operating System

WhatsApp down on New Year’s Eve: Users worldwide unable to connect as messaging app crashes repeatedly

WhatsApp video calling feature, new design leaked

Bad Santa! Microsoft Offers — ‘Upgrade now’ or ‘Upgrade tonight’ to Push Windows 10

Samsung agrees to pay Apple $548 million for copying its iPhone designs

Drozer – The Leading Security Assessment Framework For Android

Apepe – Enumerate Information From An App Based On The APK File

Androidqf – (Android Quick Forensics) Helps Quickly Gathering Forensic Evidence From Android Devices, In Order To Identify Potential Traces Of Compromise

FireStorePwn – Firestore Database Vulnerability Scanner Using APKs

LibAFL – Advanced Fuzzing Library – Slot Your Fuzzer Together In Rust! Scales Across Cores And Machines. For Windows, Android, MacOS, Linux, No_Std, …

Cpufetch – Simplistic Yet Fancy CPU Architecture Fetching Tool