LangChain RAG 실습(네이버 뉴스 기사 크롤링)

NLP/실습

LangChain RAG 실습(네이버 뉴스 기사 크롤링)

miimu 2025. 6. 4. 17:08

네이버 뉴스 기사를 크롤링하고, LangChain을 활용한 RAG 구현 실습

구글 Colab 사용

1. 데이터 크롤링

Python 파이썬 오픈api 로 네이버 뉴스 크롤링 (1)

참고 문헌 : IT CookBook, 데이터 과학 기반의 파이썬 빅데이터 분석(이지영), 네이버api 문서 소스코드는 참고 문헌을 통해 가져왔습니다. Python 파이썬으로 api 로 네이버 기사 크롤링 Crawling하는 법

wingyu-story.tistory.com

1. 크롬 드라이브 연결

from google.colab import drive
drive.mount('/content/drive')

2. 라이브러리

import os
import sys
import urllib.request
import datetime
import time
import json
import html

3. 네이버 api key json으로 저장해놓고, 불러오기

# naver api key 설정
with open("./drive/MyDrive/실습/RAG/api_key.json", 'r') as j :
    json_key = json.load(j)

client_id = json_key['client_id']
client_secret = json_key['client_secret']

4. 뉴스 기사 크롤링 코드

def getRequestUrl(url) :
    req = urllib.request.Request(url)
    req.add_header("X-Naver-Client-Id", client_id)
    req.add_header("X-Naver-Client-Secret", client_secret)

    try :
        response = urllib.request.urlopen(req)
        if response.getcode() == 200 :
            print("[%s] Url Request Success" % datetime.datetime.now())
            return response.read().decode('utf-8')
    except Exception as e :
        print(e)
        print("[%s] Error for URL : %s" %(datetime.datetime.now(), url))
        return None

def getNaverSearch(node, srcText, start, display) :
    base = "https://openapi.naver.com/v1/search/"
    node = "%s.json" % node
    parameters = "?query=%s&start=%s&display=%s" %(urllib.parse.quote(srcText), start, display)

    url = base + node + parameters
    responseDecode = getRequestUrl(url)

    if (responseDecode == None) :
        return None
    else :
        return json.loads(responseDecode)

def clean_text(text):
    # 유니코드 이스케이프 → 실제 문자로
    # text = text.encode('utf-8').decode('unicode_escape')
    # HTML 엔티티 디코드 (&amp; 등)
    text = html.unescape(text)
    # 간단한 HTML 태그 제거 (<b> 등)
    import re
    text = re.sub(r'<[^>]+>', '', text)

    return text

def getPostData(post, jsonResult, cnt) :
    title = clean_text(post['title'])
    description = clean_text(post['description'])
    org_link = post['originallink']
    link = post['link']

    pDate = datetime.datetime.strptime(post['pubDate'], '%a, %d %b %Y %H:%M:%S +0900')
    pDate = pDate.strftime('%Y-%m-%d %H:%M:%S')

    jsonResult.append({'cnt' : cnt, 'title': title, 'description' : description,
                       'org_link' : org_link, 'link' : org_link, 'pDate' : pDate})

    return

def getNews(src) :
    node = 'news' # 크롤링 대상
    srcText = src
    cnt = 0
    jsonResult = []

    jsonResponse = getNaverSearch(node, srcText, 1, 100)
    print(jsonResponse)
    total = jsonResponse['total']

    while ((jsonResponse != None) and (jsonResponse['display'] != 0)) :
        for post in jsonResponse['items'] :
            cnt += 1
            getPostData(post, jsonResult, cnt)

        start = jsonResponse['start'] + jsonResponse['display']
        jsonResponse = getNaverSearch(node, srcText, start, 100)

    print("전체 검색 : %d 건" %total)

    with open('./drive/MyDrive/실습/RAG/data/%s_naver_%s.json' %(srcText, node), 'w', encoding='utf-8') as output :
        jsonFile = json.dumps(jsonResult, indent=4, sort_keys=True, ensure_ascii=False)

        output.write(jsonFile)

    print("가져온 데이터 : %d 건" %(cnt))
    print("%s_naver_%s.json SAVED" %(srcText, node))

5. 뉴스 기사 크롤링하기

getNews('lck')
getNews('살인')
getNews('금리')
...

2. LangChain

참고 :

https://day-to-day.tistory.com/76

Basic RAG를 실습 코드와 함께 알아보기 (feat. llamaIndex, langchain)

들어가며RAG를 사용해야 하는 이유RAG의 패러다임 (RAG 변천사)Naive RAG의 구조langchain과 llamaIndex를 활용한 chromaDB로 Naive RAG 구현하기 What is RAG(Retrieval-Augmented Generation)?전통적인 LLM들은 특정 시점의

day-to-day.tistory.com

https://creboring.net/blog/how-to-use-jq/

[Linux] jq로 느낌 있게 JSON 데이터 가공하기

jq의 사용법과 실제 사용 사례들을 알아보자

creboring.net

https://python.langchain.com/api_reference/

Search - 🦜🔗 LangChain documentation

python.langchain.com

https://cheatsheet.md/ko/langchain-tutorials/langchain-load-json

Langchain에서 JSON 파일을로드하는 방법 - 단계별 가이드 – AI StartUps Product Information, Reviews, Latest Upd

Langchain로드 JSON을 마스터하기위한 효율적인 데이터 처리 방법을 알아보십시오.이 포괄적인 가이드에서는 기본 사항, 일반적인 문제 및 실제 작동 코드와 함께 실제 예제를 통해 안내합니다. 놓

cheatsheet.md

1. 패키지 설치

!pip install -qU "langchain[openai]"
!pip install -qU langchain-openai
!pip install -qU langchain-core
!pip install -qU langgraph
!pip install -qU langchain_community
!pip install -qU jq
!pip install -qU langchain_chroma
!pip install -qU langchain_experimental
!pip install -qU langchain

2. OpenAI에서 api key 생성 후 json 파일 생성. 이후 key 불러오기

import os
import getpass

# openai api key 설정
with open("./drive/MyDrive/실습/RAG/openai_api_key.json", 'r') as j :
    openai_key = json.load(j)

os.environ["OPENAI_API_KEY"] = openai_key['OPENAI_API_KEY']

3. json data path 저장

data_path = './drive/MyDrive/실습/RAG/data/'
json_paths = [data_path + json_file for json_file in os.listdir(data_path)]

4. Json 데이터 ChromaDB에 저장 ... //제목 : / 날짜 : / 기사 : // 라고 저장해뒀으면 인용 도큐먼트 내용이 더 잘 나왔을 듯

from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import JSONLoader
from langchain_chroma import Chroma

# embedding 설정
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

# ChromaDB Path
DB_PATH = "./drive/MyDrive/실습/RAG/db"

# Json 파일 Path
data_path = './drive/MyDrive/실습/RAG/data/'
json_paths = [data_path + json_file for json_file in os.listdir(data_path)]

# Document 로드 후 DB에 저장
for i, json_path in enumerate(json_paths) :
    loader = JSONLoader(
        file_path=json_path,
        jq_schema=".[] | .pDate + \" / \" + .title + \" / \" + .description",
        text_content=False
    )

    docs = loader.load()

    if i == 0 :
        db = Chroma.from_documents(
            documents=docs,
            embedding=embeddings,
            collection_name="2025_news",
            persist_directory=DB_PATH
        )

        db.get()
    
    else :
        db.add_documents(
            documents=docs,
            embedding=embeddings,
            collection_name="2025_news",
            persist_directory=DB_PATH
        )
        db.get()

5. Retriever 설정 (DB 저장할 때와 같은 임베딩 메서드 사용하기)

# Retriever
persist_db = Chroma(
    persist_directory=DB_PATH,
    embedding_function=embeddings,
    collection_name="2025_news"
)

retriever = persist_db.as_retriever()

6. 사용할 LLM 모델 설정 및 템플릿 설정

from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain.chat_models import init_chat_model

llm = init_chat_model('gpt-4o-mini', model_provider='openai')

def format_docs(docs) :
    return "\n\n".join(doc.page_content for doc in docs)

template = """
Question-Answering task Assistant 역할을 맡아주세요. retrieved context를 사용하여 질문에 답변해주세요.
만약에 답을 알지 못한다면, 모른다고 답해주세요. 
최대 3문장으로 간결하게 답해주시고, 답변에 사용한 document를 인용해주세요.

질문 : {question}

Context : {context}

Answer :
"""

prompt = ChatPromptTemplate.from_template(template)

rag_chain = (
    {"context" : retriever | format_docs, "question" : RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

질문과 답변 예시

DB 저장 시 url이랑 같이 넣어뒀으면 링크 타고 도큐먼트 직접 확인 가능할 듯

일단 처음 RAG 실습해봐서 다음엔 해당 부분 수정

'NLP > 실습' 카테고리의 다른 글

LangChain RAG 실습 2(네이버 뉴스 기사 크롤링) (0)	2025.06.09

현재글LangChain RAG 실습(네이버 뉴스 기사 크롤링)

와구와구

nlp, SSHFS, linux, ubuntu, LangChain, pretrain, 논문, 실습, 프로그래머스, LLM, centos, lv. 1, mysql, Rag,

Today :
Yesterday :

일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

와구와구