Data Extractor
Extract structured data from documents in any format: PDF, DOCX, HTML, TXT, images, and more. Converts unstructured or semistructured content into clean JSON, CSV, or other structured formats. Handles invoices, forms, reports, and freetext documents.
Overview
The Data Extractor is a specialized utility designed for AI agents to process unstructured and semi-structured information from various file formats. Available within the TerminalSkills/skills repository, this tool enables agents like Claude, Gemini, and Codex to parse content from PDFs, DOCX files, HTML, and images. It focuses on transforming raw text, invoices, and reports into organized formats such as JSON or CSV for further analysis. By leveraging this skill, users can automate the conversion of free-text documents into machine-readable data structures. The TerminalSkills collection, which hosts this tool, currently maintains a popularity rating of 72 stars on GitHub, reflecting its utility for developers building data-driven agentic workflows and automated document processing pipelines.
Use Cases
Install Notes
# Review source first
open https://github.com/TerminalSkills/skills/blob/main/skills/data-extractor/SKILL.mdCopy or clone the skill folder into your agent skills directory after reviewing its instructions and scripts.
Security Notes
This skill processes document content to generate structured outputs. Users should ensure that sensitive information within PDFs, images, or text files is handled according to their specific privacy requirements. As part of the TerminalSkills repository, the tool operates within the execution environment of the compatible AI agent, and data handling is subject to the permissions granted to that agent.
Related Skills
Feedback Analysis
TerminalSkills/skills
Collect user feedback from multiple channels, categorize it, extract patterns, and turn it into prioritized product decisions. Build a systematic process from raw input to actionable insight.
Data Validator
TerminalSkills/skills
Perform comprehensive data quality checks on datasets — validate schemas, detect anomalies, find duplicates, and enforce data contracts. Essential for ETL pipelines where bad data silently corrupts downstream analytics and dashboards.
Pandas
TerminalSkills/skills
Pandas is a Python library for loading, cleaning, transforming, and analyzing tabular data. It provides DataFrames for structured data manipulation, supports CSV, Excel, SQL, JSON, and Parquet formats, and offers powerful groupby aggregation, merge/join operations, time series resampling, and method chaining for buildi
Data Analysis
TerminalSkills/skills
Analyze tabular data from CSV, Excel, or other structured formats. Generate summary statistics, discover patterns, answer specific questions, and produce visualizations. Uses Python with pandas for data manipulation and matplotlib/seaborn for charts.