【解説】aws lambdaでscrapyを動かす

macにインストールしたscrapyをzip化してaws lambdaで使うとこんなエラーが発生。

 
 {
  "errorMessage": "Unable to import module 'lambda_function': 
        cannot import name   'etree' from 'lxml' (/var/task/lxml/__init__.py)",
  "errorType": "Runtime.ImportModuleError"
}

etree.hはC言語の拡張を使っている、ビルドするOSに関係するのかな？

ということで、lambdaを動かしているOSでビルドすればいいじゃないかと考え、amazon linuxでscrapyライブラリをzip化することにしました。

1. EC2でamazon linuxのインスタンスを作成

scrapyのライブラリを用意するためのインスタンスを作成します。「Amazon Linux 2 AMI (HVM), SSD Volume Type」のt2.nano、ts.microぐらいの小規模なものでOK。

インスタンスが立ち上がったらローカルからsshで接続します

 
#ローカル→EC2
$ssh -i hoge.pem ec2-user@111.222.333.444

2. Scrapyのインストール

 
#EC2
$sudo yum install git
$git clone git://github.com/yyuu/pyenv.git ~/.pyenv
$vi ~/.bash_profile

.bash_profileを書き換えます

 
# Get the aliases and functions
if [ -f ~/.bashrc ]; then
        . ~/.bashrc
fi

# User specific environment and startup programs
export PYENV_ROOT="$HOME/.pyenv"
export PATH="$PYENV_ROOT/bin:$PATH"
eval "$(pyenv init -)"

諸々必要なライブラリとpython3.8.0をインストールします。

 
#EC2
$sudo yum install libffi-devel
$sudo yum install gcc zlib-devel bzip2 bzip2-devel readline readline-devel sqlite sqlite-devel openssl openssl-devel -y
$pyenv install 3.8.0
$pyenv global 3.8.0
$pyenv rehash
$python --version

lambdaは/opt/python以下のライブラリを読む仕様なのでpythonフォルダを作成し、以下にscrapyをインストールし、zip化します。

 
#EC2
$mkdir python
$pip install -t ./python scrapy
$zip -r scrapy.zip python

一旦、zipファイルをローカルにダウンロードします。

 
#ローカル
$scp -i xxx.pem ec2-user@111.222.333.444:/home/ec2-user/dev/scrapy.zip ./

3. Scrapyライブラリの設定とlambdaの実行

ローカルにダウンロードしたzipファイルをlambdaレイヤーに登録します。lambdaレイヤーに登録したライブラリは共通ライブラリとしてどのlambdaからも使用することが可能です。

lambdaにScrapy（CrawlerProcess）を使った処理を記載して実行。エラーなく実行できた！

 
import logging
import scrapy
from scrapy.crawler import CrawlerProcess

logger = logging.getLogger()
logger.setLevel(logging.INFO)

class MySpider(scrapy.Spider):
    name = 'testspider'
    start_urls = [
        'http://www.xxx.go.jp/'
    ]

    def start_requests(self):
        logger.info('start_requests():')
        ・・・
            
    def parse(self, response):
        logger.info('parse():')
        ・・・
            
def lambda_handler(event, context):
    process = CrawlerProcess({
        'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
    })

    process.crawl(MySpider)
    process.start() # the script will block here until the crawling is finished

99. lambda + Scrapyで解決できていないこと

初回実行は問題なし。2回実行すると次のようなエラーが発生。

 
Response:
{
  "errorType": "ReactorNotRestartable",
  "stackTrace": [
    "  File \"/var/task/lambda_function.py\", line 77, in lambda_handler\n    process.start() # the script will block here until the crawling is finished\n",

Scrapyのサンプルプログラムに書かれているコメント「the script will block here until the crawling is finished」を見るに、クローリングが終了していない扱い？謎

stack overflowなどに投稿されている方がいますが、解決できてない模様。引き続き調査

TOP