Crawler

Introduction

Note: Video Walkthrough of this documentation is available

Objective

  1. Crawl Data
    1. Cases

    2. Documents

    3. VCards

    4. Attorney Profile Pages

  2. Documents Search API
    1. XC Dashboard

    2. XpertConnect

Key Dependencies

  1. Google Chrome

  2. Google Chrome Driver

  3. Apache Tika
    1. Convert PDF into Text: This is actually indexed by Elastic Search

  4. Elastic Search
    1. We have two main collections for indexer
      1. Case Related Document

      2. Attorney Profile Pages

  5. Celery
    1. Dispatching emails: Alerts & Sharing Documents

    2. Redis: Acts as a broker for Celery

  6. Postgres
    1. This is out DB of Choice for storing different pieces of information

  7. Django
    1. DRF: Django Rest Framework

    2. Django Elastic Search

    3. Django Filters

Setup Instructions

  1. Update ubuntu
    1. sudo apt-get update

  2. Install Google Chrome & Chrome Driver
    1. mkdir -p ~/Installs/chrome && cd ~/Installs/chrome

    2. sudo apt-get install xdg-utils

    3. sudo apt –fix-broken install

    4. sudo apt-get install unzip

    5. wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb

    6. sudo dpkg -i google-chrome-stable_current_amd64.deb

    7. mkdir -p ~/Installs/chrome-driver && cd ~/Installs/chrome-driver

    8. wget https://chromedriver.storage.googleapis.com/91.0.4472.101/chromedriver_linux64.zip

    9. unzip chromedriver_linux64.zip

  3. Ensure Java is installed
    1. sudo apt install openjdk-17-jre-headless

    2. java -version

  4. Create installs and code directory
    1. mkdir ~/Code && mkdir ~/Installs

  5. Create a sub-directory for tika
    1. mkdir ~/Installs/tika && cd ~/Installs/tika

  6. Download runnable jar file
    1. wget https://dlcdn.apache.org/tika/1.28.1/tika-server-1.28.1.jar

    2. Note: If the above URL doesn’t resolve it implies mirror might have changed. Kindly goto tika’s download page and download JAR file next in the series

  7. Run tika using ‘java -jar tika-server-1.28.1.jar` inside a screen
    1. Create a screen using screen -t tika

    2. Within the screen run java -jar tika-server-1.28.1.jar, once the Tika server is running you can

    3. Press Ctrl + a, pause a second and press d

    4. Note: To demonize this process following these instructions

    5. Note: To check current list of available screen you can do screen -r

    6. Note: To reattach to existing screen you can do something like screen -r <screen_name> e.g. screen -r 25101.pts-1.ftech-ThinkPad-E14-Gen-2

  8. Create a sub-drectory for elastic search. Also download and install it
    1. mkdir ~/Installs/es && cd ~/Installs/es

    2. wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.13.2-linux-x86_64.tar.gz

    3. tar xvfz elasticsearch-7.13.2-linux-x86_64.tar.gz

    4. cd elasticsearch-7.13.2/

    5. ./bin/elasticsearch

  9. Check if ES is up and running using
    1. `curl http://localhost:9200`_

  10. Install Postgres
    1. sudo apt-get install -y postgresql

  11. Install Redis
    1. sudo apt install -y redis-server

  12. Clone the Django Project’s repository
    1. cd ~/Code/ && git clone git@github.com:rockslideanalyticsLLC/rsa_crawling.git

  13. Create virtual environment & build tools
    1. sudo apt-get install -y build-essential

    2. `sudo ap1. Update ubuntu

    3. sudo apt-get update

  14. Install Google Chrome & Chrome Driver
    1. mkdir -p ~/Installs/chrome && cd ~/Installs/chrome

    2. sudo apt-get install xdg-utils

    3. sudo apt –fix-broken install

    4. sudo apt-get install unzip

    5. wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb

    6. sudo dpkg -i google-chrome-stable_current_amd64.deb

    7. mkdir -p ~/Installs/chrome-driver && cd ~/Installs/chrome-driver

    8. wget https://chromedriver.storage.googleapis.com/91.0.4472.101/chromedriver_linux64.zip

    9. unzip chromedriver_linux64.zip

  15. Ensure Java is installed
    1. sudo apt install openjdk-17-jre-headless

    2. java -version

  16. Create installs and code directory
    1. mkdir ~/Code && mkdir ~/Installs

  17. Create a sub-directory for tika
    1. mkdir ~/Installs/tika && cd ~/Installs/tika

  18. Download runnable jar file
    1. `wget https://dlcdn.apache.org/tika/1.28.1/tika-server-1.28.1.jar

    2. Note: If the above URL doesn’t resolve it implies mirror might have changed. Kindly goto tika’s download page and download JAR file next in the series

    3. Run tika using java -jar tika-server-1.28.1.jar inside a screen

    4. Create a screen using screen -t tika

    5. Within the screen run java -jar tika-server-1.28.1.jar, once the Tika server is running you can

    6. Press Ctrl + a, pause a second and press d

    7. Note: To demonize this process following these instructions

    8. Note: To check current list of available screen you can do screen -r

    9. Note: To reattach to existing screen you can do something like screen -r <screen_name> e.g. screen -r 25101.pts-1.ftech-ThinkPad-E14-Gen-2

  19. Create a sub-drectory for elastic search. Also download and install it
    1. mkdir ~/Installs/es && cd ~/Installs/es

    2. wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.13.2-linux-x86_64.tar.gz

    3. tar xvfz elasticsearch-7.13.2-linux-x86_64.tar.gz

    4. cd elasticsearch-7.13.2/

    5. ./bin/elasticsearch

  20. Check if ES is up and running using
    1. curl http://localhost:9200

  21. Install Postgres
    1. sudo apt-get install -y postgresql

  22. Install Redis
    1. sudo apt install -y redis-server

  23. Clone the Django Project’s repository
    1. cd ~/Code/ && git clone git@github.com:rockslideanalyticsLLC/rsa_crawling.git

  24. Create virtual environment & build tools
    1. sudo apt-get install -y build-essential

    2. sudo apt-get install -y python3-dev

    3. sudo apt-get install -y python3-venv

    4. cd rsa_crawling && python3 -m venv venv

    5. source ./venv/bin/activate

    6. cd rsa_crawling && pip3 install -r requirements.txt

  25. Ensure DB is setup correctly
    1. sudo su postgres

    2. psql

    3. CREATE DATABASE rsa_crawling;

    4. CREATE USER rsa_crawling WITH ENCRYPTED PASSWORD ‘rsa_crawling’;

    5. GRANT ALL PRIVILEGES ON DATABASE rsa_crawling TO rsa_crawling;

    6. Once done type q and type exit to go back to user from which you were doing setup

  26. Migrate DB
    1. python manage.py makemigrations && python manage.py makemigrations core && python manage.py migrate

  27. Create Super User
    1. python manage.py createsuperuser

  28. Update settings.py to include whatever IP address you’re running ther server on. Basically update ALLOWED_HOSTS list

  29. Run the application and check if things are fine
    1. python manage.py runserver 0.0.0.0:8000./venv/bin/activate`

    2. cd rsa_crawling && pip3 install -r requirements.txt

  30. Ensure DB is setup correctly 1. sudo su postgres 2. psql 3. CREATE DATABASE rsa_crawling; 4. CREATE USER rsa_crawling WITH ENCRYPTED PASSWORD ‘rsa_crawling’; 5. GRANT ALL PRIVILEGES ON DATABASE rsa_crawling TO rsa_crawling; 6. Once done type q and type exit to go back to user from which you were doing setup

  31. Migrate DB
    1. python manage.py makemigrations && python manage.py makemigrations core && python manage.py migrate

  32. Create Super User
    1. python manage.py createsuperuser

  33. Update settings.py to include whatever IP address you’re running ther server on. Basically update ALLOWED_HOSTS list

  34. Run the application and check if things are fine
    1. python manage.py runserver 0.0.0.0:8000

Get it up and Running

Production mode settings