카테고리 없음

Trafilatura를 이용한 뉴스 스크래핑

catalystmind 2025. 5. 30. 21:03

728x90

TL;DR

Trafilatura: 웹페이지에서 핵심 본문만 자동 추출하는 파이썬 라이브러리

✅ 주요 장점

웹사이트에서 광고 및 불필요한 내용 제거하고 본문만 추출, 웹사이트마다 본문의 구조가 다른 점을 해결

📊 성능 평가 결과

500개 URL 테스트에서 성공률 81.4%, 평균 처리시간 0.52초 달성

⚠️ 주요 한계

자바스크립트 렌더링, 한글 인코딩 오류 시 본문 추출 실패

Trafilatura로 시작하는 뉴스 자동화의 전환점

뉴스 데이터는 투자, 여론 분석, 정보 탐색 등 다양한 목적으로 활용되는 핵심 정보다. 특히, 시간이 부족한 직장인 투자자에게 빠르게 뉴스와 시장의 흐름을 판단하기 위해서는 자동화가 필요한데, 자동으로 기사를 수집하고 정제하는 과정에는 많은 어려움이 있다.

⚠️ 기존 방식의 한계: Power Automate + GPT

이전에 Microsoft Power Automate를 활용해 뉴스 기사를 수집하고, GPT를 이용해 본문을 정제하는 자동화 워크플로우를 구성했다. 기사의 웹싸이트에서 텍스트 추출까지는 성공했지만, 곧 다음과 같은 문제에 부딪혔다.

1. 정제 비용 문제: 웹사이트에서 본문뿐만 아니라 광고, 댓글, 관련 기사 링크 등 불필요한 정보까지 함께 수집되다 보니, GPT 후처리 시 토큰 낭비가 심각했다. 불필요한 텍스트는 곧 비용으로 연결된다.

2. 유지보수 문제: 뉴스 사이트마다 HTML 구조가 달라 본문 위치를 정확히 찾는 작업을 반복해야 했고, 사이트가 조금만 바뀌어도 본문 추출 방법이 무력화된다.

Trafilatura: 웹에서 핵심 본문만 추출하는 파이썬 도구

Trafilatura는 웹페이지에서 핵심 콘텐츠만 뽑아주는 파이썬 기반 라이브러리로, 뉴스, 블로그에서 본문을 추출하는데 특화되어 있어 위에서 언급한 문제점을 해결할 수 있는 가장 현실적인 방안이다.

Trafilatura의 주요 장점

1. 불필요한 요소 자동 제거 광고, 네비게이션 바, 댓글, 추천 기사 등 콘텐츠 이외 요소를 필터링해서 본문만 남겨준다. 이렇게 정제된 데이터는 GPT 같은 언어 모델에 넘기기에 최적화되어 있다.

2. 다양한 사이트 구조 대응 뉴스, 블로그 등 다양한 HTML 구조에 자동으로 적응한다. 별도 규칙 정의 없이도 대부분의 웹사이트에서 본문을 추출할 수 있어 유지보수 부담이 크게 줄어든다.

Trafilatura는 뉴스 데이터 정제 비용을 줄이고 유지보수를 간소화하는 핵심 도구다.

Trafilatura의 기본 사용법

다음과 같은 간단한 코드로 쉽게 시작할 수 있다.

import trafilatura
from trafilatura.metadata import extract_metadata

# 대상 URL 설정
url = "https://www.yna.co.kr/view/AKR20250511050400001?section=election2025/news&site=topnews01"

# URL에서 HTML 다운로드
downloaded = trafilatura.fetch_url(url)

# 메타데이터 및 본문 텍스트 추출
metadata = extract_metadata(downloaded)
text = trafilatura.extract(downloaded, output_format='txt', include_comments=False, favor_precision=True)

# 결과 출력
print(f"📰 제목: {metadata.title}")
print(f"📅 날짜: {metadata.date}")
print(f"📝 본문:\n{text}")

📰 제목: 21대 대선에 7명 후보 등록…이재명 1번·김문수 2번·이준석 4번 | 연합뉴스
📅 날짜: 2025-05-11
📝 본문:
제21대 대통령 선거에 총 7명의 후보가 등록한 것으로 11일 집계됐다.
이준석 후보, 구주와 후보, 송진호 후보는 군 복무를 마쳤다고 신고했다.
후보자 기호는 1번 더불어민주당 이재명, 2번 국민의힘 김문수, 4번 개혁신당 이준석, 5번 민주노동당 권영국, 6번 자유통일당 구주와, 7번 무소속 황교안, 8번 무소속 송진호 후보로 결정됐다.
전체 내용을 이...

Trafilatura의 성능을 평가해 보자

Trafilatura의 성능을 평가하기 위해, 500개의 다양한 기사 원문 URL이 담긴 파일에서 URL을 불러와 처리 속도와 성공률을 측정하는 코드를 추가했다.

📄 URL Batch Processor (Trafilatura) 🔽 펼치기


import pandas as pd
import trafilatura
from trafilatura.metadata import extract_metadata
import time
from datetime import datetime
import csv


def process_single_url(url, index, total_urls):
    """단일 URL 처리 함수"""
    start_time = time.time()
    result = {
        "index": index,
        "url": url,
        "status": "failed",
        "title": None,
        "date": None,
        "content": None,
        "processing_time": 0,
        "error_message": None,
    }

    try:
        print(f"Processing URL {index + 1}/{total_urls}: {url[:60]}...")

        # URL 다운로드
        downloaded = trafilatura.fetch_url(url)

        if downloaded:
            # 메타데이터 추출
            metadata = extract_metadata(downloaded)

            # 본문 추출
            text = trafilatura.extract(
                downloaded,
                output_format="txt",
                include_comments=False,
                favor_precision=True,
            )

            if text and len(text.strip()) > 0:
                result.update(
                    {
                        "status": "success",
                        "title": metadata.title if metadata else "No title",
                        "date": str(metadata.date) if metadata and metadata.date else "No date",
                        "content": text[:200] + "..." if len(text) > 200 else text,
                    }
                )
            else:
                result["error_message"] = "Empty content extracted"
        else:
            result["error_message"] = "Failed to download page"

    except Exception as e:
        result["error_message"] = str(e)

    # 처리 시간 계산
    processing_time = time.time() - start_time
    result["processing_time"] = processing_time

    return result


def save_results_to_csv(results, output_file=None):
    """결과를 CSV 파일로 저장"""

    # 파일명이 지정되지 않으면 현재 시간을 포함한 파일명 생성
    if output_file is None:
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        output_file = f"url_processing_results_{timestamp}.csv"

    try:
        with open(output_file, "w", newline="", encoding="utf-8") as csvfile:
            fieldnames = [
                "index",
                "url",
                "status",
                "title",
                "date",
                "content",
                "processing_time",
                "error_message",
            ]
            writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

            writer.writeheader()
            for result in results:
                writer.writerow(result)

        print(f"✅ 결과가 '{output_file}' 파일에 저장되었습니다.")
        return output_file  # 실제 저장된 파일명 반환

    except Exception as e:
        print(f"❌ 결과 저장 실패: {e}")
        return None


def print_summary(results, total_time, skipped_count=0):
    """처리 결과 요약 출력"""
    total_urls = len(results)
    successful_count = sum(1 for r in results if r["status"] == "success")
    failed_count = total_urls - successful_count
    processing_times = [r["processing_time"] for r in results]

    print("\n" + "=" * 7 + " SUMMARY " + "=" * 7)
    print(f"Total URLs processed: {total_urls}")
    print("Workers used: 1")  # 단일 스레드 처리
    print(f"Successfully decoded: {successful_count} ({successful_count/total_urls*100:.1f}%)")
    print(f"Failed to decode: {failed_count} ({failed_count/total_urls*100:.1f}%)")
    print(f"Skipped (Google News URLs): {skipped_count} ({skipped_count/(total_urls + skipped_count)*100:.1f}%)")

    if processing_times:
        avg_time = sum(processing_times) / len(processing_times)

        print("\n" + "-" * 5 + " TIMING INFORMATION " + "-" * 5)
        print(f"Total processing time: {int(total_time//60)}:{total_time%60:05.2f}")
        print(f"Average processing time per URL: {avg_time:.2f} seconds")
        print(f"Fastest URL processing time: {min(processing_times):.2f} seconds")
        print(f"Slowest URL processing time: {max(processing_times):.2f} seconds")

    print("\nProcess completed successfully. Results saved to CSV file.")


def process_urls_from_csv(csv_file_path, url_column="decoded_url"):
    """CSV 파일에서 URL들을 읽어서 순차 처리"""

    print("=" * 50)
    print("🚀 URL 배치 처리 시작")
    print("=" * 50)

    # CSV 파일 읽기
    try:
        df = pd.read_csv(csv_file_path)
        print(f"📋 CSV 파일 컬럼들: {list(df.columns)}")
        print(f"📋 총 행 수: {len(df)}")

        if url_column not in df.columns:
            raise ValueError(f"Column '{url_column}' not found in CSV file")

        print(f"📋 '{url_column}' 컬럼의 NULL이 아닌 값 개수: {df[url_column].notna().sum()}")

        all_urls = df[url_column].dropna().tolist()
        print(f"📋 첫 번째 URL 샘플: {all_urls[0] if all_urls else 'None'}")

        # news.google.com이 포함되지 않은 URL만 필터링
        urls = [url for url in all_urls if "news.google.com" not in str(url)]

        total_urls = len(urls)
        skipped_urls = len(all_urls) - total_urls

        print(f"📊 전체 URL: {len(all_urls)}개")
        print(f"📊 처리 대상 URL (decoded URLs): {total_urls}개")
        print(f"📊 건너뛴 URL (Google News URLs): {skipped_urls}개")

        if urls:
            print(f"📋 첫 번째 디코딩된 URL 샘플: {urls[0]}")

        print("-" * 30)

    except Exception as e:
        print(f"❌ CSV 파일 읽기 실패: {e}")
        return None

    total_start_time = time.time()
    results = []
    successful_count = 0
    failed_count = 0

    for i, url in enumerate(urls):
        result = process_single_url(url, i, total_urls)
        results.append(result)

        if result["status"] == "success":
            successful_count += 1
        else:
            failed_count += 1

        if (i + 1) % 10 == 0 or (i + 1) == total_urls:
            print(
                f"진행: {i + 1}/{total_urls} "
                f"({(i + 1)/total_urls*100:.1f}%) "
                f"성공: {successful_count}, 실패: {failed_count}"
            )

    total_processing_time = time.time() - total_start_time
    print_summary(results, total_processing_time, skipped_urls)

    saved_file = save_results_to_csv(results)
    return results, saved_file


def main():
    csv_file_path = r"C:\Users\yhsur\Downloads\특징주\sample_data\Combined_sample_data_500_decoded_2025-05-20_224740.csv"
    results, saved_file = process_urls_from_csv(csv_file_path, url_column="decoded_url")

    if results:
        print(f"\n📁 저장된 파일: {saved_file}")

        print("\n🔍 처리 결과 샘플:")
        for i, result in enumerate(results[:3]):
            print(f"\n[{i+1}] {result['url'][:50]}...")
            print(f"    상태: {result['status']}")
            print(f"    제목: {result['title']}")
            print(f"    처리시간: {result['processing_time']:.2f}초")
            if result["status"] == "failed":
                print(f"    오류: {result['error_message']}")


if __name__ == "__main__":
    main()

🖨️ 실행결과

진행: 500/500 (100.0%) 성공: 407, 실패: 93

======= SUMMARY =======
Total URLs processed: 500
Workers used: 1
Successfully decoded: 407 (81.4%)
Failed to decode: 93 (18.6%)
Total errors: 93 (18.6%)
Skipped (Google News URLs): 0 (0.0%)

----- TIMING INFORMATION -----
Total processing time: 4:20.64
Average processing time per URL: 0.52 seconds
Average processing time per decoded URL: 0.52 seconds
Fastest URL processing time: 0.05 seconds
Slowest URL processing time: 5.26 seconds

Process completed successfully. Results saved to CSV file.
✅ 결과가 'url_processing_results_20250525_131353.csv' 파일에 저장되었습니다.

📁 저장된 파일: url_processing_results_20250525_131353.csv

📊 성능 평가 및 결과

Claude를 사용해서, 평가 결과를 분석 및 시각화를 진행하였다.

📋 분석 요약

📊 요약 정보

총 레코드 수

500

성공률

81.4%

평균 처리 시간

0.521초

중앙값 처리 시간

0.314초

처리 시간 분포

120 80 40 0

0.0-0.5

0.5-1.0

1.0-1.5

1.5-2.0

2.0-2.5

2.5초+

처리 시간 (초)

500

총 처리 건수

실패 건수

18.6%

실패율

2개

주요 문제 도메인

주요 문제: 전체 실패 96.8%(90건) - biz.chosun.com, www.msn.com 발생

📊 전체 처리 현황

성공

407건

81.4%

실패

93건

18.6%

알수없음

0건

🚨 주요 문제 도메인 상세 분석

1. biz.chosun.com

62건

총 시도

62건

실패

성공률

66.7%

전체 실패 중 비율

주요 오류:

Empty content extracted - 모든 시도 콘텐츠 추출 실패

원인: JavaScript 기반 동적 로딩, 접근 제한(Cloudflare, bot detection) 추정

2. www.msn.com

28건

총 시도

28건

실패

성공률

30.1%

전체 실패 중 비율

주요 오류:

Empty content extracted - 모든 시도 콘텐츠 추출 실패

Failed to download page - 일부 요청 페이지 다운로드 실패

원인: Microsoft 서비스 강력한 bot detection, 지역별 접근 제한

기타 실패: 나머지 3건 - 네트워크 일시적 장애, 페이지 구조 변경 추정

🔍 오류 패턴 분석

오류 유형별 분포

Empty content extracted

87건 (93.5%)

콘텐츠 추출 실패 - JavaScript 렌더링 필요, 접근 차단

Failed to download page

6건 (6.5%)

페이지 다운로드 실패 - 네트워크 오류, 서버 응답 없음

🔤 인코딩 및 문자 처리 문제

발견된 문제: 일부 성공 데이터 특수문자, 인코딩 문제 발견

주요 인코딩 문제:

한글 깨짐 현상 (UTF-8 vs EUC-KR 인코딩 충돌)
특수문자( ) 표시 - 원본 사이트의 인코딩 문제
HTML 엔티티 미변환 (&, <, > 등)
줄바꿈 문자 처리 문제 (\r\n, \n 혼재)

권장 해결책: chardet 라이브러리 자동 인코딩 감지, HTML 파싱 전 인코딩 정규화, 후처리 특수문자 정리 필요

마무리

Trafilatura는 정적 HTML 기사 추출에 특화된 도구다. Claude로 분석한 결과, 500개 기사 URL 테스트에서 평균 처리 속도는 0.521초, 성공률은 81.4%로 빠른 성능을 보여주었고, 추출에 실패한 웹사이트는 대부분 자바스크립트 기반 웹사이트 또는 한글 인코딩 문제가 있는 페이지로 확인되었다. Trafilatura는 정적 구조에 최적화되어 있으며, 동적 콘텐츠 처리에는 한계가 있다.다음 글에서는 한글 인코딩 문제가 있는 페이지와 자바스크립트 기반 웹사이트에 대한 대응 전략을 다루고자 한다.