Requirements
- Ubuntu Server 14.04
- Apache Solr(distro package) 3.6.2+dfsg-2
- Apache Nutch 1.11
Installation
Follow the steps(Installing Solr using apt-get) outlined in [1] to install Solr. Download and extract the binary package of Nutch. In [2], follow the sections Verify your Nutch installation and Create a URL seed list. The configuration below is for indexing PDFs only.
After the installation, copy the $NUTCH_HOME/conf/schema.xml to /etc/solr/conf/schema.xml then restart tomcat
$sudo service tomcat6 restart
Download the nutch-site.xml below then replace the one in $NUTCH_HOME/conf with it.
nutch-site.xml
The script below recrawls the URLS. Make sure to change the SOLR_URL variable.
References
- https://www.digitalocean.com/community/tutorials/how-to-install-solr-on-ubuntu-14-04
- https://wiki.apache.org/nutch/NutchTutorial
0 comments:
Post a Comment