Crawler¶
Introduction¶
Note: Video Walkthrough of this documentation is available
Objective¶
- Crawl Data
Cases
Documents
Attorney Profile Pages
- Documents Search API
XC Dashboard
XpertConnect
Key Dependencies¶
- Apache Tika
Convert PDF into Text: This is actually indexed by Elastic Search
- Elastic Search
- We have two main collections for indexer
Case Related Document
Attorney Profile Pages
- Postgres
This is out DB of Choice for storing different pieces of information
- Django
DRF: Django Rest Framework
Setup Instructions¶
- Update ubuntu
sudo apt-get update
- Install Google Chrome & Chrome Driver
mkdir -p ~/Installs/chrome && cd ~/Installs/chrome
sudo apt-get install xdg-utils
sudo apt –fix-broken install
sudo apt-get install unzip
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
sudo dpkg -i google-chrome-stable_current_amd64.deb
mkdir -p ~/Installs/chrome-driver && cd ~/Installs/chrome-driver
wget https://chromedriver.storage.googleapis.com/91.0.4472.101/chromedriver_linux64.zip
unzip chromedriver_linux64.zip
- Ensure Java is installed
sudo apt install openjdk-17-jre-headless
java -version
- Create installs and code directory
mkdir ~/Code && mkdir ~/Installs
- Create a sub-directory for tika
mkdir ~/Installs/tika && cd ~/Installs/tika
- Download runnable jar file
wget https://dlcdn.apache.org/tika/1.28.1/tika-server-1.28.1.jar
Note: If the above URL doesn’t resolve it implies mirror might have changed. Kindly goto tika’s download page and download JAR file next in the series
- Run tika using ‘java -jar tika-server-1.28.1.jar` inside a screen
Create a screen using screen -t tika
Within the screen run java -jar tika-server-1.28.1.jar, once the Tika server is running you can
Press Ctrl + a, pause a second and press d
Note: To demonize this process following these instructions
Note: To check current list of available screen you can do screen -r
Note: To reattach to existing screen you can do something like screen -r <screen_name> e.g. screen -r 25101.pts-1.ftech-ThinkPad-E14-Gen-2
- Create a sub-drectory for elastic search. Also download and install it
mkdir ~/Installs/es && cd ~/Installs/es
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.13.2-linux-x86_64.tar.gz
tar xvfz elasticsearch-7.13.2-linux-x86_64.tar.gz
cd elasticsearch-7.13.2/
./bin/elasticsearch
- Check if ES is up and running using
- Install Postgres
sudo apt-get install -y postgresql
- Install Redis
sudo apt install -y redis-server
- Clone the Django Project’s repository
cd ~/Code/ && git clone git@github.com:rockslideanalyticsLLC/rsa_crawling.git
- Create virtual environment & build tools
sudo apt-get install -y build-essential
`sudo ap1. Update ubuntu
sudo apt-get update
- Install Google Chrome & Chrome Driver
mkdir -p ~/Installs/chrome && cd ~/Installs/chrome
sudo apt-get install xdg-utils
sudo apt –fix-broken install
sudo apt-get install unzip
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
sudo dpkg -i google-chrome-stable_current_amd64.deb
mkdir -p ~/Installs/chrome-driver && cd ~/Installs/chrome-driver
wget https://chromedriver.storage.googleapis.com/91.0.4472.101/chromedriver_linux64.zip
unzip chromedriver_linux64.zip
- Ensure Java is installed
sudo apt install openjdk-17-jre-headless
java -version
- Create installs and code directory
mkdir ~/Code && mkdir ~/Installs
- Create a sub-directory for tika
mkdir ~/Installs/tika && cd ~/Installs/tika
- Download runnable jar file
`wget https://dlcdn.apache.org/tika/1.28.1/tika-server-1.28.1.jar
Note: If the above URL doesn’t resolve it implies mirror might have changed. Kindly goto tika’s download page and download JAR file next in the series
Run tika using java -jar tika-server-1.28.1.jar inside a screen
Create a screen using screen -t tika
Within the screen run java -jar tika-server-1.28.1.jar, once the Tika server is running you can
Press Ctrl + a, pause a second and press d
Note: To demonize this process following these instructions
Note: To check current list of available screen you can do screen -r
Note: To reattach to existing screen you can do something like screen -r <screen_name> e.g. screen -r 25101.pts-1.ftech-ThinkPad-E14-Gen-2
- Create a sub-drectory for elastic search. Also download and install it
mkdir ~/Installs/es && cd ~/Installs/es
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.13.2-linux-x86_64.tar.gz
tar xvfz elasticsearch-7.13.2-linux-x86_64.tar.gz
cd elasticsearch-7.13.2/
./bin/elasticsearch
- Check if ES is up and running using
curl http://localhost:9200
- Install Postgres
sudo apt-get install -y postgresql
- Install Redis
sudo apt install -y redis-server
- Clone the Django Project’s repository
cd ~/Code/ && git clone git@github.com:rockslideanalyticsLLC/rsa_crawling.git
- Create virtual environment & build tools
sudo apt-get install -y build-essential
sudo apt-get install -y python3-dev
sudo apt-get install -y python3-venv
cd rsa_crawling && python3 -m venv venv
source ./venv/bin/activate
cd rsa_crawling && pip3 install -r requirements.txt
- Ensure DB is setup correctly
sudo su postgres
psql
CREATE DATABASE rsa_crawling;
CREATE USER rsa_crawling WITH ENCRYPTED PASSWORD ‘rsa_crawling’;
GRANT ALL PRIVILEGES ON DATABASE rsa_crawling TO rsa_crawling;
Once done type q and type exit to go back to user from which you were doing setup
- Migrate DB
python manage.py makemigrations && python manage.py makemigrations core && python manage.py migrate
- Create Super User
python manage.py createsuperuser
Update settings.py to include whatever IP address you’re running ther server on. Basically update ALLOWED_HOSTS list
- Run the application and check if things are fine
python manage.py runserver 0.0.0.0:8000./venv/bin/activate`
cd rsa_crawling && pip3 install -r requirements.txt
Ensure DB is setup correctly 1. sudo su postgres 2. psql 3. CREATE DATABASE rsa_crawling; 4. CREATE USER rsa_crawling WITH ENCRYPTED PASSWORD ‘rsa_crawling’; 5. GRANT ALL PRIVILEGES ON DATABASE rsa_crawling TO rsa_crawling; 6. Once done type q and type exit to go back to user from which you were doing setup
- Migrate DB
python manage.py makemigrations && python manage.py makemigrations core && python manage.py migrate
- Create Super User
python manage.py createsuperuser
Update settings.py to include whatever IP address you’re running ther server on. Basically update ALLOWED_HOSTS list
- Run the application and check if things are fine
python manage.py runserver 0.0.0.0:8000