March 20th, 2013

Setting up Raspberry Pi’s for scraping

Flavour filled raspberry

I previously completed the steps of setting up a Raspberry Pi from an unconfigured box to a stable simple server over the network. In this post I will mount a hard drive and place mysql on it as well as data on a separate partition. We’ll also enable outbound emails, cron jobs as well as a few other bits. The topics covered will be a bit scattered. I should mention that as we dive a little deeper into the OS it’s important to note that this Raspberry Pi is running Raspian, a Debian derivative.

Installing Beautiful Soup

Scraping often needs a good html/xml parser – for Python BeautifulSoup is best resource. If you need to install it for your user account because you don’t have root you can do so. After extracting (gzip -d, tar -xf) BeautifulSoup we can install it to the user account by using the –user flag:

python setup.py install --user

I can download Beautiful soup and install it normally on my RPis:

wget http://www.crummy.com/software/BeautifulSoup/bs4/download/beautifulsoup4-4.1.3.tar.gz
gzip -d beautifulsoup4-4.1.3.tar.gz
tar -xf beautifulsoup4-4.1.3.tar
cd beautifulsoup4-4.1.3/
sudo python setup.py install

Python-MySQL connector

I had to install the python mysql database connector (mysqldb) on to my mac for testing. Setting your path variable is important otherwise it just doesn’t work.

PATH="${PATH}:/usr/local/mysql/bin"

Static IP on network

I would like my RPi to have a static ip on my router, while I could set the IP to be static by following these directions and editing /etc/network/interfaces. I chose to simply have my router always allocate the same IP to the MAC address of my raspberry pi.

Hub setup

Connecting the RPi to the D-Link DUB-H4 hub was more complex than I would like. The hub’s high powered USB cuts power to the RPi when you plug the RPi port USB (not the power USB) into the back of the hub into the control port. It continually reboots. You have to plug the RPi into one of the other three ports – this may have repercussions if we draw too much power. It’s too bad because it’s a beautiful hub. Next time I’ll buy one of these RPi hubs.

Hard drive setup

I connected my hard-disk-drive (HDD) and set it up according to these instructions on raspi.tv.

You can mount the external drive to any location you desire. I decided to not be too creative and place it in /mnt:

sudo mkdir /mnt/mysql

I changed user to root for the next step. This required that I first define a password for root. I also formatted the hard drive to ext4 rather than the FAT32 which was installed on the drive.

# set/change root password, as there is none (unable to login)
sudo passwd root
# switch user to root, enter password
su
# partitioned the drive by following the sfdisk instructions
sfdisk /dev/sda
# I created two partitions
# One for the mysql data of ~ 80 GB /mnt/mysql
# Another ~170 GB, for data located at /home/pi/somepath/data/
#
# Formatting partitions
# If you need to install the file system type that is not present (I will use ext4)
apt-get install ntfs-3g
# You can determine which file systems are installed by typing
mkfs.
# and tab autocomplete to see the list
# I formatted my drive as ext4
mkfs.ext4 /dev/sda1
mkfs.ext4 /dev/sda2
#
# Then mounted the HDD partitions
mount /dev/sda1 /mnt/mysql
mount /dev/sda2 /home/pi/somepath/data/
# we can see our mounted drives and file format by typing:
mount
# let's go back to being pi, exit root
exit

We want the partition mounted each time our Pi restarts so we should edit our /etc/fstab file:

sudo vim /etc/fstab
# add the following in your fstab file
# <file system>        <dir>         <type>    <options>             <dump> <pass>
/dev/sda1       /mnt/mysql      ext4    defaults        0       0
/dev/sda2 /home/pi/somepath/data/      ext4    defaults        0       0

You can read more about formatting the fstab file on Debian’s fstab page.

If you’re curious as to how much space you have, or are using, use:

df -h

To see the same info for a specific folder, use:

du -h somefolder

Moving MySQL

MySQL is currently storing data on the SD card. Surprisingly it’s quite easy to move. We would like to move mysql to the path of the hard drive we just mounted. We first need to stop MySQL:

sudo /etc/init.d/mysql stop

We will give ownership of our mysql destination directory/drive to mysql, move the data from the old location to the new and change some settings.

# make the directory to be the destination for mysql
mkdir /mnt/mysql
# change owner to of destination mysql directory
sudo chown mysql:mysql /mnt/mysql

Move the original mysql data to the new path to the external hard drive.

# need to be root
su
mv /var/lib/mysql/* /mnt/mysql

We can find the configuration file containing the path by searching for it:

# still as root
find / -name my.cnf
# result
/etc/mysql/my.cnf

I edited my.cnf by changing the path and socket values to /mnt/mysql

# from
datadir = /var/lib/mysql
# Some tutorials suggest changing it, but I don't see the need
#socket = /var/run/mysqld/mysql.sock
#
# to
datadir = /mnt/mysql
# I could change the below as well, but again there is no need
# Note that the mysql.sock file only exists when mysql is running
#socket = /mnt/mysql/mysql.sock

Let’s restart MySQL:

sudo /etc/init.d/mysql start
# If something was completed incorrectly you will get an error
# You should receive a message saying:
[info] Checking for tables which need an upgrade, are corrupt or were not closed cleanly..
# This is normal 

Setting up MySQL

To setup mysql we need to log into mysql using root:

mysql -u root -p
# has the same password as your server root account
# this need not be the case for your account 

Welcome to mysql – let’s create another user with some privileges:

CREATE USER 'pi'@'localhost' IDENTIFIED BY 'somepassword';
GRANT ALL PRIVILEGES ON *.* TO 'pi'@'localhost';
# you can always change the password again:
SET PASSWORD FOR 'pi'@'localhost' = PASSWORD('new-password-here');

We should now have a functioning user with permission to create a database and tables. If you check the /mnt/mysql directory you should see new directories existing for any new databases you have created. Success.

cron jobs

I need my scraping to reoccur on a regular basis. Let’s see if cron is already running. Here are a few ways of doing it:

# see if crontab, used to load the tables to cron is functioning (debian)
# see if cron is running
pgrep cron
# see if the cron service is running the 'proper' way
service cron status

If you encounter difficulties, cron messages are in /var/log/syslog. Check them out:

cat /var/log/syslog | grep -i cron

Inside my cron file I will do two things. State which email to contact in the case of any output/error from a job, and define the reocurrence of the job.

MAILTO='myemailaddress@gmail.com'
# m h  dom mon dow   command
*/10 * * * * /usr/bin/python /home/pi/python_script.py

Default text editor

I discovered that the system default text editor is not what I desire. I set vim as the default by selecting it from the prompted list.

sudo update-alternatives --config editor

Logging in without a password

If you would like to log on to your RPi without using a password you can generate key pairs for your local machine and RPi. First generate a key on your laptop/desktop:

ssh-keygen -t rsa

Simply hit return when prompted, it will create the .ssh file in your home directory. The second prompt asks for a pass phrase, this isn’t necessary either as this is what our key is mean to replace.

# create a directory on your RPi named .ssh
ssh pi@192.168.1.42 mkdir -p .ssh

# copy your public key you generated into the directory you just created
cat .ssh/id_rsa.pub | ssh pi@192.168.1.42 'cat >> .ssh/authorized_keys'

You should now be able to ssh into your RPi without using a password from this computer. If you would like to access your RPi from additional computers without using your password you just repeat the steps above but must append the new public key to the .ssh/authorized_keys file.

Disabling SSH Passwords

If you do go ahead with making your server public it’s a good idea to disable password authentication. Passwords can be guessed through brute force. Edit /etc/ssh/sshd_config parameters to the following values:

ChallengeResponseAuthentication no
PasswordAuthentication no
UsePAM no
PubkeyAuthentication yes
PermitRootLogin no

Then restart the service:

sudo /etc/init.d/ssh reload

If you get the following errors when restarting the service it is because you are not using sudo.

Could not load host key: /etc/ssh/ssh_host_rsa_key
Could not load host key: /etc/ssh/ssh_host_dsa_key
Could not load host key: /etc/ssh/ssh_host_ecdsa_key

Enabling email sending

I want to know if my cron job encounters and error. I can enable email sending by installing ssmpt and configuring a few files.

# update 
sudo apt-get update

# install ssmtp
sudo apt-get install ssmtp

# edit /etc/ssmtp/ssmtp.conf
mailhub=smtp.gmail.com:587
AuthUser=YourGMailUserName@gmail.com
AuthPass=YourGMailPassword
UseSTARTTLS=YES
#optional
rewriteDomain=something_other_than_gmail.com

Some further edits:

# edited /etc/ssmtp/revaliases to add:
root:root@DOMAINNAME:smtp.gmail.com:587
pi:pi@DOMAINNAME:smtp.gmail.com:587
# replace DOMAINNAME with what you wish

# allow all users to send emails
sudo chmod 774 /etc/ssmtp/ssmtp.conf

# give username a pretty name by editing passwd file
sudo vim /etc/passwd

# example
pi:x:1000:1000:Yummy Pi:/home/pi:/bin/bash

Connect to RPi from WAN

Currently I can only connect to my RPi from within my home network. My home has a dynamic IP address. Without being home I cannot know what IP address my router at home temporarily has. Using a dynamic DNS service is the typical route used to solve this problem. I don’t plan on doing it at this point as I don’t need or plan to access my server externally. I will however gloss over the the steps to get it working.

You will need to install ddclient:

sudo apt-get install ddclient

During the installation you will be prompted for configuration. Your dynamic DNS provider should have some suggested settings.

You will then need to configure your ddclient configuration file (/etc/ddclient.conf) further again following the directions from your dns provider. If you are behind a router it will be important to reset the method of obtaining the IP address:

use=web, web=myip.dnsdynamic.com        # get ip from server.

You should also take a look at the /etc/default/ddclient file for deamon settings.

Restart the service and it should be working (hopefully):

/etc/init.d/ddclient restart

Alternatively you could do something similar without ddclient. Simply retrieving your ip regularly using an api or site.

#!/bin/sh
EMAIL=post@reply.com
PASSWORD=changeit
DOMAIN=user.dnsdynamic.com

IP=`curl --silent http://myip.dnsdynamic.com/`
curl --silent --user "$EMAIL:$PASSWORD" -k "https://www.dnsdynamic.org/api/?hostname=$DOMAIN&myip=$IP"

Called regularly with cron this could be a good solution.

More to come

I plan on appending further steps to setting up a scrapping server when I encounter the need myself.

Tags: , , , , , , , , , , ,

3 Responses to “Setting up Raspberry Pi’s for scraping”

  1. 2013james Says:

    You might be interested in the new cron sandbox at http://www.dataphyx.com/cronsandbox/. It also has newcomers in mind.

  2. Brennen Says:

    Yo the link for the New Link Hub is not working! Also did I miss why you are using a Hub?

  3. Cyrille Says:

    Fixed the link. Thanks.

    I needed a hub to connect external hard drives. Some HDD draw too much power to use the RPi USB. Perhaps the newer RPi do not need a hub, I’m not sure.

Leave a Reply