PYPDF 및 Langchain으로 맞춤형 PDF 파서 구축

저자의 이미지 | 캔버

PDF 파일은 어디에나 있습니다. 아마도 대학 논문, 전기 청구서, 사무실 계약, 제품 매뉴얼 등과 같은 다양한 장소에서 보았을 것입니다. 그들은 매우 일반적이지만 그들과 함께 일하는 것은 쉽지 않습니다. PDF에서 텍스트를 읽거나 섹션으로 나누거나 빠른 요약을받는 등 PDF에서 유용한 정보를 추출한다고 가정 해 봅시다. 이것은 간단하게 들릴지 모르지만 시도하면 그렇게 매끄럽지 않다는 것을 알 수 있습니다.

Word 또는 HTML 파일과 달리 PDFS는 깔끔하고 읽기 쉬운 방식으로 콘텐츠를 저장하지 않습니다. 대신, 그들은 프로그램에 의해 읽지 않고 좋아 보이도록 설계되었습니다. 텍스트는 모든 곳에 있거나 이상한 블록으로 나뉘거나 페이지에 흩어져 있거나 테이블 및 이미지와 혼합 될 수 있습니다. 이것은 깨끗하고 구조화 된 데이터를 얻기가 어렵습니다.

이 기사에서는이 혼란을 처리 할 수있는 무언가를 만들 것입니다. 우리는 할 수있는 커스텀 PDF 파서를 만들 것입니다:

더 나은 서식을위한 옵션 레이아웃 보존과 함께 페이지 레벨의 PDFS에서 추출 및 청소 텍스트
이미지 메타 데이터 추출을 처리합니다
소음을 줄이기 위해 페이지 전체에 반복적 인 선을 감지하여 원치 않는 헤더 및 바닥 글을 제거하십시오.
저자, 제목, 생성 날짜, 회전 및 페이지 크기와 같은 자세한 문서 및 페이지 수준 메타 데이터 검색
추가 NLP 또는 LLM 처리를 위해 컨텐츠를 관리 가능한 작품으로 청소하십시오.

시작합시다.

폴더 구조

시작하기 전에 명확성과 확장 성을 위해 프로젝트 파일을 구성하는 것이 좋습니다.

custom_pdf_parser/
│
├── parser.py           
├── langchain_loader.py  
├── pipeline.py          
├── example.py     
├── requirements.txt     # Dependencies list
└── __init__.py         # (Optional) to mark directory as Python package

당신은 떠날 수 있습니다 __init.py__ 주요 목적은 단순히이 디렉토리가 파이썬 패키지로 취급되어야 함을 나타 내기위한 주요 목적이므로 비어 있습니다. 나머지 각 파일의 목적을 단계별로 설명하겠습니다.

필요한 도구 (요구 사항 .txt)

필요한 라이브러리는 다음과 같습니다.

PYPDF : PDF 파일을 읽고 쓰는 순수한 파이썬 라이브러리. PDF 파일에서 텍스트를 추출하는 데 사용됩니다.
Langchain : 언어 모델을 사용하여 컨텍스트 인식 애플리케이션을 구축하는 프레임 워크 (문서 작업을 처리하고 체인하는 데 사용). 텍스트를 올바르게 처리하고 구성하는 데 사용됩니다.

다음과 같이 설치하십시오.

pip install pypdf langchain

의존성을 깔끔하게 관리하려면 a 요구 사항 .txt 다음과 함께 파일

그리고 달리기 :

pip install -r requirements.txt

1 단계 : PDF Parser (Parser.py) 설정

핵심 클래스 custompdfparser PYPDF를 사용하여 각 PDF 페이지에서 텍스트와 메타 데이터를 추출합니다. 또한 텍스트를 정리하고 이미지 정보를 추출하고 (선택 사항), 각 페이지에 종종 나타나는 반복적 인 헤더 또는 바닥 글을 제거하는 방법도 포함되어 있습니다.

레이아웃 형식 보존을 지원합니다
페이지 번호, 회전 및 미디어 박스 크기와 같은 메타 데이터를 추출합니다.
콘텐츠가 너무 적은 페이지를 걸러 낼 수 있습니다
텍스트 청소는 단락 브레이크를 보존하는 동안 과도한 공백을 제거합니다

이 모든 것을 구현하는 논리는 다음과 같습니다.

import os
import logging
from pathlib import Path
from typing import List, Dict, Any
import pypdf
from pypdf import PdfReader
# Configure logging to show info and above messages
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class CustomPDFParser:
  def __init__(
      self,extract_images: bool = False,preserve_layout: bool = True,remove_headers_footers: bool = True,min_text_length: int = 10
  ):
      """
      Initialize the parser with options to extract images, preserve layout, remove repeated headers/footers, and minimum text length for pages.
      Args:
          extract_images: Whether to extract image info from pages
          preserve_layout: Whether to keep layout spacing in text extraction
          remove_headers_footers: Whether to detect and remove headers/footers
          min_text_length: Minimum length of text for a page to be considered valid
      """
      self.extract_images = extract_images
      self.preserve_layout = preserve_layout
      self.remove_headers_footers = remove_headers_footers
      self.min_text_length = min_text_length
  def extract_text_from_page(self, page: pypdf.PageObject, page_num: int) -> Dict[str, Any]:
      """
      Extract text and metadata from a single PDF page.
      Args:
          page: PyPDF page object
          page_num: zero-based page number
      Returns:
          dict with keys:
              - 'text': extracted and cleaned text string,
              - 'metadata': page metadata dict,
              - 'word_count': number of words in extracted text
      """
      try:
 # Extract text, optionally preserving the layout for better formatting
          if self.preserve_layout:
              text = page.extract_text(extraction_mode="layout")
          else:
              text = page.extract_text()
        # Clean text: remove extra whitespace and normalize paragraphs
          text = self._clean_text(text)
        # Gather page metadata (page number, rotation angle, mediabox)
          metadata = {
              "page_number": page_num + 1,  # 1-based numbering
              "rotation": getattr(page, "rotation", 0),
              "mediabox": str(getattr(page, "mediabox", None)),
          }
          # Optionally, extract image info from page if requested
          if self.extract_images:
              metadata["images"] = self._extract_image_info(page)
          # Return dictionary with text and metadata for this page
          return {
              "text": text,
              "metadata": metadata,
              "word_count": len(text.split()) if text else 0
          }
      except Exception as e:
          # Log error and return empty data for problematic pages
          logger.error(f"Error extracting page {page_num}: {e}")
          return {
              "text": "",
              "metadata": {"page_number": page_num + 1, "error": str(e)},
              "word_count": 0
          }
  def _clean_text(self, text: str) -> str:
      """
      Clean and normalize extracted text, preserving paragraph breaks.
      Args:
          text: raw text extracted from PDF page
      Returns:
          cleaned text string
      """
      if not text:
          return ""
      lines = text.split('\n')
      cleaned_lines = []
      for line in lines:
          line = line.strip()  # Remove leading/trailing whitespace
          if line:
              # Non-empty line; keep it
              cleaned_lines.append(line)
          elif cleaned_lines and cleaned_lines[-1]:
              # Preserve paragraph break by keeping empty line only if previous line exists
              cleaned_lines.append("")
      cleaned_text="\n".join(cleaned_lines)
#Reduce any instances of more than two consecutive blank lines to two
      while '\n\n\n' in cleaned_text:
          cleaned_text = cleaned_text.replace('\n\n\n', '\n\n')
      return cleaned_text.strip()
  def _extract_image_info(self, page: pypdf.PageObject) -> List[Dict[str, Any]]:
      """
      Extract basic image metadata from page, if available.
      Args:
          page: PyPDF page object
      Returns:
          List of dictionaries with image info (index, name, width, height)
      """
      images = []
      try:
          # PyPDF pages can have an 'images' attribute listing embedded images
          if hasattr(page, 'images'):
              for i, image in enumerate(page.images):
                  images.append({
                      "image_index": i,
                      "name": getattr(image, 'name', f"image_{i}"),
                      "width": getattr(image, 'width', None),
                      "height": getattr(image, 'height', None)
                  })
      except Exception as e:
          logger.warning(f"Image extraction failed: {e}")
      return images

  def _remove_headers_footers(self, pages_data: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
      """
      Remove repeated headers and footers that appear on many pages.
      This is done by identifying lines appearing on over 50% of pages
      at the start or end of page text, then removing those lines.
      Args:
          pages_data: List of dictionaries representing each page's extracted data.
      Returns:
          Updated list of pages with headers/footers removed
      """
      # Only attempt removal if enough pages and option enabled
      if len(pages_data)  threshold and line.strip()]
      potential_footers = [line for line in set(last_lines)
                          if last_lines.count(line) > threshold and line.strip()]
      # Remove identified headers and footers from each page's text
      for page_data in pages_data:
          lines = page_data["text"].split('\n')
          # Remove header if it matches a frequent header
          if lines and potential_headers:
              for header in potential_headers:
                  if lines[0].strip() == header.strip():
                      lines = lines[1:]
                      break
          # Remove footer if it matches a frequent footer
          if lines and potential_footers:
              for footer in potential_footers:
                  if lines[-1].strip() == footer.strip():
                      lines = lines[:-1]
                      break

          page_data["text"] = '\n'.join(lines).strip()
      return pages_data
  def _extract_document_metadata(self, pdf_reader: PdfReader, pdf_path: str) -> Dict[str, Any]:
      """
      Extract metadata from the PDF document itself.
      Args:
          pdf_reader: PyPDF PdfReader instance
          pdf_path: path to PDF file
      Returns:
          Dictionary of metadata including file info and PDF document metadata
      """
      metadata = {
          "file_path": pdf_path,
          "file_name": Path(pdf_path).name,
          "file_size": os.path.getsize(pdf_path) if os.path.exists(pdf_path) else None,
      }
      try:
          if pdf_reader.metadata:
              # Extract common PDF metadata keys if available
              metadata.update({
                  "title": pdf_reader.metadata.get('/Title', ''),
                  "author": pdf_reader.metadata.get('/Author', ''),
                  "subject": pdf_reader.metadata.get('/Subject', ''),
                  "creator": pdf_reader.metadata.get('/Creator', ''),
                  "producer": pdf_reader.metadata.get('/Producer', ''),
                  "creation_date": str(pdf_reader.metadata.get('/CreationDate', '')),
                  "modification_date": str(pdf_reader.metadata.get('/ModDate', '')),
              })
      except Exception as e:
          logger.warning(f"Metadata extraction failed: {e}")
      return metadata
  def parse_pdf(self, pdf_path: str) -> Dict[str, Any]:
      """
      Parse the entire PDF file. Opens the file, extracts text and metadata page by page, removes headers/footers if configured, and aggregates results.
      Args:
          pdf_path: Path to the PDF file
      Returns:
          Dictionary with keys:
              - 'full_text': combined text from all pages,
              - 'pages': list of page-wise dicts with text and metadata,
              - 'document_metadata': file and PDF metadata,
              - 'total_pages': total pages in PDF,
              - 'processed_pages': number of pages kept after filtering,
              - 'total_words': total word count of parsed text
      """
      try:
          with open(pdf_path, 'rb') as file:
              pdf_reader = PdfReader(file)
              doc_metadata = self._extract_document_metadata(pdf_reader, pdf_path)
              pages_data = []
              # Iterate over all pages and extract data
              for i, page in enumerate(pdf_reader.pages):
                  page_data = self.extract_text_from_page(page, i)
                  # Only keep pages with sufficient text length
                  if len(page_data["text"]) >= self.min_text_length:
                      pages_data.append(page_data)
              # Remove repeated headers and footers
              pages_data = self._remove_headers_footers(pages_data)
           # Combine all page texts with a double newline as a separator
              full_text="\n\n".join(page["text"] for page in pages_data if page["text"])
              # Return final structured data
              return {
                  "full_text": full_text,
                  "pages": pages_data,
                  "document_metadata": doc_metadata,
                  "total_pages": len(pdf_reader.pages),
                  "processed_pages": len(pages_data),
                  "total_words": sum(page["word_count"] for page in pages_data)
              }
      except Exception as e:
          logger.error(f"Failed to parse PDF {pdf_path}: {e}")
          raise

2 단계 : Langchain (langchain_loader.py)과 통합

그만큼 langchainpdfloader 클래스는 사용자 정의 파서를 랩핑하고 구문 분석 페이지를 Langchain으로 변환합니다. 문서 랭 체인 파이프 라인의 빌딩 블록 인 물체.

Langchain ‘s를 사용하여 문서를 작은 조각으로 청킹 할 수 있습니다. recursiveCharactertextSplitter
다운 스트림 LLM 입력을 위해 청크 크기를 사용자 정의하고 중첩 할 수 있습니다.
이 로더는 RAW PDF 컨텐츠와 Langchain의 문서 추상화 간의 깨끗한 통합을 지원합니다.

이것의 논리는 다음과 같습니다.

from typing import List, Optional, Dict, Any
from langchain.schema import Document
from langchain.document_loaders.base import BaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from parser import CustomPDFParser  # import the parser defined above
class LangChainPDFLoader(BaseLoader):
   def __init__(
       self,file_path: str,parser_config: Optional[Dict[str, Any]] = None,chunk_size: int = 500, chunk_overlap: int = 50
   ):
       """
       Initialize the loader with the PDF file path, parser configuration, and chunking parameters.
       Args:
           file_path: path to PDF file
           parser_config: dictionary of parser options
           chunk_size: chunk size for splitting long texts
           chunk_overlap: chunk overlap for splitting
       """
       self.file_path = file_path
       self.parser_config = parser_config or {}
       self.chunk_size = chunk_size
       self.chunk_overlap = chunk_overlap
       self.parser = CustomPDFParser(**self.parser_config)
   def load(self) -> List[Document]:
       """
       Load PDF, parse pages, and convert each page to a LangChain Document.
       Returns:
           List of Document objects with page text and combined metadata.
       """
       parsed_data = self.parser.parse_pdf(self.file_path)
       documents = []
       # Convert each page dict to a LangChain Document
       for page_data in parsed_data["pages"]:
           if page_data["text"]:
               # Merge document-level and page-level metadata
               metadata = {**parsed_data["document_metadata"], **page_data["metadata"]}
               doc = Document(page_content=page_data["text"], metadata=metadata)
               documents.append(doc)
       return documents
   def load_and_split(self) -> List[Document]:
       """
       Load the PDF and split large documents into smaller chunks.
       Returns:
           List of Document objects after splitting large texts.
       """
       documents = self.load()
       # Initialize a text splitter with the desired chunk size and overlap
       text_splitter = RecursiveCharacterTextSplitter(
           chunk_size=self.chunk_size,
           chunk_overlap=self.chunk_overlap,
           separators=["\n\n", "\n", " ", ""]  # hierarchical splitting
       )
       # Split documents into smaller chunks
       split_docs = text_splitter.split_documents(documents)
       return split_docs

3 단계 : 처리 파이프 라인 구축 (Pipeline.py)

그만큼 pdfprocessingpipeline 클래스는 다음에 대한 높은 수준의 인터페이스를 제공합니다.

단일 PDF 처리
출력 형식 선택 (RAW DICT, LANGCAIN 문서 또는 일반 텍스트)
구성 가능한 청크 크기로 청크를 활성화 또는 비활성화합니다
오류 처리 및 로깅

이 추상화를 통해 더 큰 응용 프로그램이나 워크 플로에 쉽게 통합 할 수 있습니다. 이것의 논리는 다음과 같습니다.

from typing import List, Optional, Dict, Any
from langchain.schema import Document
from parser import CustomPDFParser
from langchain_loader import LangChainPDFLoader
import logging
logger = logging.getLogger(__name__)
class PDFProcessingPipeline:
   def __init__(self, parser_config: Optional[Dict[str, Any]] = None):
       """
       Args:
          parser_config: dictionary of options passed to CustomPDFParser
       """
       self.parser_config = parser_config or {}
   def process_single_pdf(
       self,pdf_path: str,output_format: str = "langchain",chunk_documents: bool = True,chunk_size: int = 500,chunk_overlap: int = 50
   ) -> Any:
       """
       Args:
           pdf_path: path to PDF file
           output_format: "raw" (dict), "langchain" (Documents), or "text" (string)
           chunk_documents: whether to split LangChain documents into chunks
           chunk_size: chunk size for splitting
           chunk_overlap: chunk overlap for splitting
       Returns:
           Parsed content in the requested format
       """
       if output_format == "raw":
           # Use raw CustomPDFParser output
           parser = CustomPDFParser(**self.parser_config)
           return parser.parse_pdf(pdf_path)
       elif output_format == "langchain":
           # Use LangChain loader, optionally chunked
           loader = LangChainPDFLoader(pdf_path, self.parser_config, chunk_size, chunk_overlap)
           if chunk_documents:
               return loader.load_and_split()
           else:
               return loader.load()
       elif output_format == "text":
           # Return combined plain text only
           parser = CustomPDFParser(**self.parser_config)
           parsed_data = parser.parse_pdf(pdf_path)
           return parsed_data.get("full_text", "")
       else:
           raise ValueError(f"Unknown output_format: {output_format}")

4 단계 : 파서 테스트 (example.py)

파서를 다음과 같이 테스트합시다.

import os
from pathlib import Path
def main():
   print("👋 Welcome to the Custom PDF Parser!")
   print("What would you like to do?")
   print("1. View full parsed raw data")
   print("2. Extract full plain text")
   print("3. Get LangChain documents (no chunking)")
   print("4. Get LangChain documents (with chunking)")
   print("5. Show document metadata")
   print("6. Show per-page metadata")
   print("7. Show cleaned page text (header/footer removed)")
   print("8. Show extracted image metadata")
   choice = input("Enter the number of your choice: ").strip()
   if choice not in {'1', '2', '3', '4', '5', '6', '7', '8'}:
       print("❌ Invalid option.")
       return
   file_path = input("Enter the path to your PDF file: ").strip()
   if not Path(file_path).exists():
       print("❌ File not found.")
       return
   # Initialize pipeline
   pipeline = PDFProcessingPipeline({
       "preserve_layout": False,
       "remove_headers_footers": True,
       "extract_images": True,
       "min_text_length": 20
   })
   # Raw data is needed for most options
   parsed = pipeline.process_single_pdf(file_path, output_format="raw")
   if choice == '1':
       print("\nFull Raw Parsed Output:")
       for k, v in parsed.items():
           print(f"{k}: {str(v)[:300]}...")
   elif choice == '2':
       print("\nFull Cleaned Text (truncated preview):")
       print("Previewing the first 1000 characters:\n"+parsed["full_text"][:1000], "...")
   elif choice == '3':
       docs = pipeline.process_single_pdf(file_path, output_format="langchain", chunk_documents=False)
       print(f"\nLangChain Documents: {len(docs)}")
       print("Previewing the first 500 characters:\n", docs[0].page_content[:500], "...")
   elif choice == '4':
       docs = pipeline.process_single_pdf(file_path, output_format="langchain", chunk_documents=True)
       print(f"\nLangChain Chunks: {len(docs)}")
       print("Sample chunk content (first 500 chars):")
       print(docs[0].page_content[:500], "...")
   elif choice == '5':
       print("\nDocument Metadata:")
       for key, value in parsed["document_metadata"].items():
           print(f"{key}: {value}")
   elif choice == '6':
       print("\nPer-page Metadata:")
       for i, page in enumerate(parsed["pages"]):
           print(f"Page {i+1}: {page['metadata']}")
   elif choice == '7':
       print("\nCleaned Text After Header/Footer Removal.")
       print("Showing the first 3 pages and first 500 characters of the text from each page.")
       for i, page in enumerate(parsed["pages"][:3]):  # First 3 pages
           print(f"\n--- Page {i+1} ---")
           print(page["text"][:500], "...")
   elif choice == '8':
       print("\nExtracted Image Metadata (if available):")
       found = False
       for i, page in enumerate(parsed["pages"]):
           images = page["metadata"].get("images", [])
           if images:
               found = True
               print(f"\n--- Page {i+1} ---")
               for img in images:
                   print(img)
       if not found:
           print("No image metadata found.")
if __name__ == "__main__":
   main()

이것을 실행하면 선택 번호와 PDF 로의 경로를 입력하도록 지시됩니다. 입력하십시오. 내가 사용하고있는 PDF는 공개적으로 액세스 할 수 있으며 링크를 사용하여 다운로드 할 수 있습니다.

👋 Welcome to the Custom PDF Parser!
What would you like to do?
1. View full parsed raw data
2. Extract full plain text
3. Get LangChain documents (no chunking)
4. Get LangChain documents (with chunking)
5. Show document metadata
6. Show per-page metadata
7. Show cleaned page text (header/footer removed)
8. Show extracted image metadata.
Enter the number of your choice: 5
Enter the path to your PDF file: /content/articles.pdf

Output:
LangChain Chunks: 16
First chunk preview:
San José State University Writing Center
www.sjsu.edu/writingcenter
Written by Ben Aldridge

Articles (a/an/the), Spring 2014.                                                                                   1 of 4
Articles (a/an/the)

There are three articles in the English language: a, an, and the. They are placed before nouns
and show whether a given noun is general or specific.

Examples of Articles

결론

이 안내서에서는 오픈 소스 도구 만 사용하여 유연하고 강력한 PDF 처리 파이프 라인을 구축하는 방법을 배웠습니다. 모듈 식이기 때문에 쉽게 확장하고, Sleamlit을 사용하여 검색 창을 추가하거나, 더 똑똑한 조회를 위해 Faiss와 같은 벡터 데이터베이스에 청크를 저장하거나, 챗봇에 연결할 수 있습니다. 당신은 아무것도 재건 할 필요가 없습니다. 다음 조각을 연결합니다. pdfs는 더 이상 잠긴 상자처럼 느껴질 필요가 없습니다. 이 접근법을 사용하면 모든 문서를 용어를 읽고 검색하고 이해할 수있는 것으로 전환 할 수 있습니다.

Kanwal Mehreen Kanwal은 머신 러닝 엔지니어이자 데이터 과학에 대한 열정과 AI의 의학 교차점을 가진 기술 작가입니다. 그녀는 eBook “Chatgpt의 생산성을 극대화하는 것”을 공동 저술했습니다. APAC의 Google Generation Scholar 2022로서 그녀는 다양성과 학업 우수성을 챔피언시킵니다. 그녀는 또한 Tech Scholar, Mitacs Globalink Research Scholar 및 Harvard Wecode Scholar의 Teradata 다양성으로 인정 받고 있습니다. Kanwal은 STEM 분야의 여성에게 힘을 실어주기 위해 펨코드를 설립 한 변화에 대한 열렬한 옹호자입니다.

출처 참조