Crawler¶

Introduction¶

Note: Video Walkthrough of this documentation is available

Objective¶

Crawl Data
1. Cases
2. Documents
3. VCards
4. Attorney Profile Pages
Documents Search API
1. XC Dashboard
2. XpertConnect

Key Dependencies¶

Google Chrome
Google Chrome Driver
Apache Tika
1. Convert PDF into Text: This is actually indexed by Elastic Search
Elastic Search
1. We have two main collections for indexer
  
  Case Related Document
  
  Attorney Profile Pages
Celery
1. Dispatching emails: Alerts & Sharing Documents
2. Redis: Acts as a broker for Celery
Postgres
1. This is out DB of Choice for storing different pieces of information
Django
1. DRF: Django Rest Framework
2. Django Elastic Search
3. Django Filters

Setup Instructions¶

Update ubuntu
1. sudo apt-get update
Install Google Chrome & Chrome Driver
1. mkdir -p ~/Installs/chrome && cd ~/Installs/chrome
2. sudo apt-get install xdg-utils
3. sudo apt –fix-broken install
4. sudo apt-get install unzip
5. wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
6. sudo dpkg -i google-chrome-stable_current_amd64.deb
7. mkdir -p ~/Installs/chrome-driver && cd ~/Installs/chrome-driver
8. wget https://chromedriver.storage.googleapis.com/91.0.4472.101/chromedriver_linux64.zip
9. unzip chromedriver_linux64.zip
Ensure Java is installed
1. sudo apt install openjdk-17-jre-headless
2. java -version
Create installs and code directory
1. mkdir ~/Code && mkdir ~/Installs
Create a sub-directory for tika
1. mkdir ~/Installs/tika && cd ~/Installs/tika
Download runnable jar file
1. wget https://dlcdn.apache.org/tika/1.28.1/tika-server-1.28.1.jar
2. Note: If the above URL doesn’t resolve it implies mirror might have changed. Kindly goto tika’s download page and download JAR file next in the series
Run tika using ‘java -jar tika-server-1.28.1.jar` inside a screen
1. Create a screen using screen -t tika
2. Within the screen run java -jar tika-server-1.28.1.jar, once the Tika server is running you can
3. Press Ctrl + a, pause a second and press d
4. Note: To demonize this process following these instructions
5. Note: To check current list of available screen you can do screen -r
6. Note: To reattach to existing screen you can do something like screen -r <screen_name> e.g. screen -r 25101.pts-1.ftech-ThinkPad-E14-Gen-2
Create a sub-drectory for elastic search. Also download and install it
1. mkdir ~/Installs/es && cd ~/Installs/es
2. wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.13.2-linux-x86_64.tar.gz
3. tar xvfz elasticsearch-7.13.2-linux-x86_64.tar.gz
4. cd elasticsearch-7.13.2/
5. ./bin/elasticsearch
Check if ES is up and running using
1. `curl http://localhost:9200`_
Install Postgres
1. sudo apt-get install -y postgresql
Install Redis
1. sudo apt install -y redis-server
Clone the Django Project’s repository
1. cd ~/Code/ && git clone git@github.com:rockslideanalyticsLLC/rsa_crawling.git
Create virtual environment & build tools
1. sudo apt-get install -y build-essential
2. `sudo ap1. Update ubuntu
3. sudo apt-get update
Install Google Chrome & Chrome Driver
1. mkdir -p ~/Installs/chrome && cd ~/Installs/chrome
2. sudo apt-get install xdg-utils
3. sudo apt –fix-broken install
4. sudo apt-get install unzip
5. wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
6. sudo dpkg -i google-chrome-stable_current_amd64.deb
7. mkdir -p ~/Installs/chrome-driver && cd ~/Installs/chrome-driver
8. wget https://chromedriver.storage.googleapis.com/91.0.4472.101/chromedriver_linux64.zip
9. unzip chromedriver_linux64.zip
Ensure Java is installed
1. sudo apt install openjdk-17-jre-headless
2. java -version
Create installs and code directory
1. mkdir ~/Code && mkdir ~/Installs
Create a sub-directory for tika
1. mkdir ~/Installs/tika && cd ~/Installs/tika
Download runnable jar file
1. `wget https://dlcdn.apache.org/tika/1.28.1/tika-server-1.28.1.jar
2. Note: If the above URL doesn’t resolve it implies mirror might have changed. Kindly goto tika’s download page and download JAR file next in the series
3. Run tika using java -jar tika-server-1.28.1.jar inside a screen
4. Create a screen using screen -t tika
5. Within the screen run java -jar tika-server-1.28.1.jar, once the Tika server is running you can
6. Press Ctrl + a, pause a second and press d
7. Note: To demonize this process following these instructions
8. Note: To check current list of available screen you can do screen -r
9. Note: To reattach to existing screen you can do something like screen -r <screen_name> e.g. screen -r 25101.pts-1.ftech-ThinkPad-E14-Gen-2
Create a sub-drectory for elastic search. Also download and install it
1. mkdir ~/Installs/es && cd ~/Installs/es
2. wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.13.2-linux-x86_64.tar.gz
3. tar xvfz elasticsearch-7.13.2-linux-x86_64.tar.gz
4. cd elasticsearch-7.13.2/
5. ./bin/elasticsearch
Check if ES is up and running using
1. curl http://localhost:9200
Install Postgres
1. sudo apt-get install -y postgresql
Install Redis
1. sudo apt install -y redis-server
Clone the Django Project’s repository
1. cd ~/Code/ && git clone git@github.com:rockslideanalyticsLLC/rsa_crawling.git
Create virtual environment & build tools
1. sudo apt-get install -y build-essential
2. sudo apt-get install -y python3-dev
3. sudo apt-get install -y python3-venv
4. cd rsa_crawling && python3 -m venv venv
5. source ./venv/bin/activate
6. cd rsa_crawling && pip3 install -r requirements.txt
Ensure DB is setup correctly
1. sudo su postgres
2. psql
3. CREATE DATABASE rsa_crawling;
4. CREATE USER rsa_crawling WITH ENCRYPTED PASSWORD ‘rsa_crawling’;
5. GRANT ALL PRIVILEGES ON DATABASE rsa_crawling TO rsa_crawling;
6. Once done type q and type exit to go back to user from which you were doing setup
Migrate DB
1. python manage.py makemigrations && python manage.py makemigrations core && python manage.py migrate
Create Super User
1. python manage.py createsuperuser
Update settings.py to include whatever IP address you’re running ther server on. Basically update ALLOWED_HOSTS list
Run the application and check if things are fine
1. python manage.py runserver 0.0.0.0:8000./venv/bin/activate`
2. cd rsa_crawling && pip3 install -r requirements.txt
Ensure DB is setup correctly 1. sudo su postgres 2. psql 3. CREATE DATABASE rsa_crawling; 4. CREATE USER rsa_crawling WITH ENCRYPTED PASSWORD ‘rsa_crawling’; 5. GRANT ALL PRIVILEGES ON DATABASE rsa_crawling TO rsa_crawling; 6. Once done type q and type exit to go back to user from which you were doing setup
Migrate DB
1. python manage.py makemigrations && python manage.py makemigrations core && python manage.py migrate
Create Super User
1. python manage.py createsuperuser
Update settings.py to include whatever IP address you’re running ther server on. Basically update ALLOWED_HOSTS list
Run the application and check if things are fine
1. python manage.py runserver 0.0.0.0:8000

Crawler¶

Introduction¶

Objective¶

Key Dependencies¶

Setup Instructions¶

Get it up and Running¶

Production mode settings¶

Table of Contents

Previous topic

Next topic

This Page