scrapy爬虫数据保存为txt,json,mysql的方法

2017年10月3日 MK PYTHON 1 阅读 5757次

上次我们写了一个将明凯博客首页的数据保存到数据库的方法。
但是有一些朋友说不需要将数据保存到mysql中，他们只需要保存到txt，或者csv，或者json格式中。

Python蜘蛛scrapy的采集数据到数据库的详细方法

那么这篇文章就是来教我们来怎么写，保存到txt，json，mysql的方法。

一、保存到txt
直接是处理文件的方法，同样适用于csv，csv可以当作于一个txt来操作。

class TxtPipeline(object):
    '''保存到txt'''
    def process_item(self, item, spider):
        reload(sys)
        sys.setdefaultencoding('utf-8')
        # 获取当前工作目录
        base_dir = os.getcwd()
        fiename = base_dir + '/news.txt'
        # 从内存以追加的方式打开文件，并写入对应的数据
        with open(fiename, 'a') as f:
            f.write(item['title'] + '\n')
            f.write(item['link'] + '\n')
        return item

二、保存到json
运用json dumps方法来操作。

class JsonPipeline(object):
    '''保存到json'''
    def process_item(self, item, spider):
        base_dir = os.getcwd()
        filename = base_dir + '/news.json'
        # 打开json文件，向里面以dumps的方式吸入数据
        # 注意需要有一个参数ensure_ascii=False ，不然数据会直接为utf编码的方式存入比如
        # :“/xe15”
        with codecs.open(filename, 'a') as f:
            line = json.dumps(dict(item), ensure_ascii=True) + '\n'
            f.write(line)
        return item

三、保存到mysql中
这个我们已经实现过，我们还是来贴一下代码。

class MysqlPipeline(object):
    '''保存到mysql'''
    def process_item(self,item,spider):
        # 将item里的数据拿出来
        title = item['title']
        link = item['link']
        content = item['content']
        # 和本地的数据库建立连接
        host = settings['MYSQL_HOSTS']
        user = settings['MYSQL_USER']
        psd = settings['MYSQL_PASSWORD']
        db = settings['MYSQL_DB']
        cha=settings['CHARSET']
        db = MySQLdb.connect(host=host,user=user,passwd=psd,db=db,charset=cha)
        # 使用cursor()方法获取操作游标
        cursor = db.cursor()
        # SQL 插入语句
        sql = "INSERT INTO aimks(title,link,content) VALUES (%s,%s,%s)"
        data=[title,link,content]
        try:
            # 执行SQL语句
            cursor.execute(sql,data)
            # 提交修改
            db.commit()
            print title+'：导入成功'
        except:
            db.rollback()
            print title+'：导入失败'
        finally:
            # 关闭连接
            db.close()
        return item

四、需要用到的库

很多同学直接用了上面的代码，没用库，就跑来问我，为啥不可以啊。

要学会看错误，看报什么错误，然后引入相应的库就可以了。

import os
import sys
import codecs
import json
import MySQLdb
import MySQLdb.cursors

五、声明PIPELINES类

如果只写了PIPELINES类，没有在setting里面声明的话管道是不会运行的。

ITEM_PIPELINES={
    'mkscrapy.pipelines.MysqlPipeline': 100,
    'mkscrapy.pipelines.JsonPipeline': 200,
    'mkscrapy.pipelines.TxtPipeline': 300,
}

mysql, python, 数据, 爬虫

python中出现IndentationError: unexpected indent的解决办法 scrapy在不同的Request之间传递数据的办法

1 条评论 “scrapy爬虫数据保存为txt,json,mysql的方法”

追女孩子的方法说道：

2017年10月4日下午6:54

学习了谢谢

回复

一	二	三	四	五	六	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

scrapy爬虫数据保存为txt,json,mysql的方法

相关文章

1 条评论 “scrapy爬虫数据保存为txt,json,mysql的方法”

发表回复取消回复

博主站点

近期文章

最近读者

scrapy爬虫数据保存为txt,json,mysql的方法

相关文章

1 条评论 “scrapy爬虫数据保存为txt,json,mysql的方法”

发表回复 取消回复

博主站点

近期文章

最近读者

发表回复取消回复