[크롤링]- 메뉴와 재료 크롤링하기

크롤링

robots.txt에 대해서 알아보고 본격적으로 크롤링을 한 내용을 포스팅하려 합니다.

이용하는 사이트는 '만개의 레시피'로 user들이 각자의 음식 레시피와 재료들을 등록하고 사람들과 소통하는 커뮤니티 사이트입니다.

www.10000recipe.com/

요리를 즐겁게~ 만개의레시피

www.10000recipe.com

BeatuifulSoup에는 정말 다양한 함수가 존재했지만 필요한 목적에 맞는 크롤링, 파싱 함수들만 집중적으로 공부하고 적용하였습니다.

이 과정에서 필요한 정보인 레시피 제목, 재료, url을 위해 10000개의 레시피 사이트의 html inspection을 확인하였습니다. inspection에서 메뉴, 재료 들의 태그를 확인하였고 크롤링한 페이지에서 이들을 파싱 해내어 db에 저장하였습니다.

import requests
from bs4 import BeautifulSoup

baseUrl = 'http://www.10000recipe.com/recipe/'

def CrawlingBetweenRanges(mydb, startRecipeId, endRecipeId):
    for i in range(startRecipeId, endRecipeId):
        if i % 10 == 0:
            print("count: " + str(i))
        res = PageCrawler(str(i))
        if res is None:
            continue

        menuId = mydb.insert_menu(res[0][0], baseUrl+str(i))
        for key, value in res[1].items():
            for name in value:
                if key == "[재료]" or key == "[양념]":
                    mydb.insert_ingredient(menuId, name)

def PageCrawler(recipeUrl):
    url = baseUrl + recipeUrl

    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')

    recipe_title = []  # 레시피 제목
    recipe_source = {}  # 레시피 재료
    # recipe_step = [] #레시피 순서

    try:
        res = soup.find('div', 'view2_summary')
        res = res.find('h3')
        recipe_title.append(res.get_text())
        res = soup.find('div', 'view2_summary_info')
        recipe_title.append(res.get_text().replace('\n', ''))
        res = soup.find('div', 'ready_ingre3')
    except(AttributeError):
        return

    # 재료 찾는 for문 가끔 형식에 맞지 않는 레시피들이 있어 try/ except 해준다
    try:
        for n in res.find_all('ul'):
            source = []
            title = n.find('b').get_text()
            recipe_source[title] = ''
            for tmp in n.find_all('li'):
                tempSource = tmp.get_text().replace('\n', '').replace(' ', ' ')
                source.append(tempSource.split("    ")[0])

            recipe_source[title] = source
    except (AttributeError):
        return

    recipe_all = [recipe_title, recipe_source]  #제목, 재료
    return (recipe_all)

크롤링된 결과를 출력해보면 다음과 같습니다.

이렇게 파싱 된 결과를 CrawlingBetweenRanges에서 불필요한 단어를 제거하고 원하는 데이터인 메뉴 이름과 재료들로 분류한 뒤 insert_ingredient와 insert_menu를 이용하여 Mysql DB에 원하는 데이터를 저장하였습니다.

DB 저장 (Mysql)

위 내용처럼 크롤링하여 대략 10만 개의 레시피를 크롤링하고 원하는 데이터만 정제하여 DB에 저장하는데 하루정도 사용했습니다.(중간중간 멈춰서 좀 오래 걸린 것도 있습니다 ㅎㅎ)

인증.. 실수로 DB 날아가면 1주일간은 슬플 것 같아요..ㅋㅋㅋㅋ

DB 스키마는 복잡하지 않아서 생략하겠습니다..(메뉴 id를 재료 테이블의 외래 키로 두는 점말고는 특별한 게 없습니다)

import pymysql

class MysqlController:
    def __init__(self, host, id, pw, db_name):
        try:
            self.conn = pymysql.connect(host=host, user=id, password=pw, db=db_name, charset='utf8')
            self.curs = self.conn.cursor(pymysql.cursors.DictCursor)
        except self.conn.DatabaseError as e:
            print(e)
            self.conn.close()

    # 메뉴 insert
    def insert_menu(self, mname, url):
        try:
            sql = 'INSERT INTO menu(mname, url) VALUES (%s, %s)'
            self.curs.execute(sql, (mname, url))
            self.conn.commit()
            return self.curs.lastrowid
        except self.conn.DatabaseError as e:
            print(e)

isnert_menu 이외에도 많은 함수들을 만들었었는데 결국 insert_ingredient와 select 하는 함수 외에는 필요하지 않아서 지웠습니다. (위에는 insert_menu만 올림 )

'프로젝트 > 레시피추천 프로그램' 카테고리의 다른 글

Mecab 형태소 분석기 dictionary 등록 (0)	2021.01.05
[데이터 전처리] 재료명 정확도 높이기 (0)	2021.01.03
크롤링 접근 차단 robots.txt (0)	2021.01.02

크롤링

DB 저장 (Mysql)

'프로젝트 > 레시피추천 프로그램' 카테고리의 다른 글

티스토리툴바