Crawler
=======

Introduction
~~~~~~~~~~~~

Note: `Video Walkthrough <https://www.youtube.com/watch?v=vpdneqczH5s&list=PLUObevFMSCxHDHWpGts0VP0otUzpDjUSC&index=12>`_ of this documentation is available

Objective
~~~~~~~~~

1. Crawl Data
    1. Cases
    2. Documents
    3. `VCards <https://en.wikipedia.org/wiki/VCard>`_
    4. Attorney Profile Pages
2. Documents Search API
    1. XC Dashboard
    2. XpertConnect


Key Dependencies
~~~~~~~~~~~~~~~~
1. `Google Chrome <https://www.google.com/intl/en_in/chrome/>`_
2. `Google Chrome Driver <https://chromedriver.chromium.org/downloads>`_
3. `Apache Tika <https://tika.apache.org/>`_
    1. Convert PDF into Text: This is actually indexed by Elastic Search
4. `Elastic Search <https://www.elastic.co/>`_
    1. We have two main collections for indexer
        1. Case Related Document
        2. Attorney Profile Pages
5. `Celery <https://docs.celeryproject.org/en/stable/getting-started/introduction.html>`_
    1. Dispatching emails: Alerts & Sharing Documents
    2. `Redis <https://redis.io/>`_: Acts as a broker for Celery
6. `Postgres <https://www.postgresql.org/>`_
    1. This is out DB of Choice for storing different pieces of information
7. `Django <https://www.djangoproject.com/>`_
    1. DRF: Django Rest Framework
    2. `Django Elastic Search <https://pypi.org/project/django-elasticsearch-dsl/>`_
    3. `Django Filters <https://django-filter.readthedocs.io/en/stable/>`_


Setup Instructions
~~~~~~~~~~~~~~~~~~
1. Update ubuntu
    1. sudo apt-get update
2. Install Google Chrome & Chrome Driver
    1. mkdir -p ~/Installs/chrome && cd ~/Installs/chrome
    2. sudo apt-get install xdg-utils
    3. sudo apt --fix-broken install
    4. sudo apt-get install unzip
    5. `wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb`
    6. sudo dpkg -i google-chrome-stable_current_amd64.deb
    7. mkdir -p ~/Installs/chrome-driver && cd ~/Installs/chrome-driver
    8. `wget https://chromedriver.storage.googleapis.com/91.0.4472.101/chromedriver_linux64.zip`
    9. unzip chromedriver_linux64.zip
3. Ensure Java is installed
    1. sudo apt install openjdk-17-jre-headless
    2. java -version
4. Create installs and code directory
    1. mkdir ~/Code && mkdir ~/Installs
5. Create a sub-directory for tika
    1. mkdir ~/Installs/tika && cd ~/Installs/tika
6. Download runnable jar file
    1. `wget https://dlcdn.apache.org/tika/1.28.1/tika-server-1.28.1.jar`
    2. Note: If the above URL doesn't resolve it implies mirror might have changed. Kindly goto `tika's download page <https://tika.apache.org/download.html>`_  and download JAR file next in the series
7. Run tika using 'java -jar tika-server-1.28.1.jar` inside a `screen <https://www.gnu.org/software/screen/>`_
    1. Create a screen using `screen -t tika`
    2. Within the screen run `java -jar tika-server-1.28.1.jar`, once the Tika server is running you can
    3. Press `Ctrl + a`, pause a second and press `d`
    4. Note: To demonize this process following `these instructions <https://dzone.com/articles/run-your-java-application-as-a-service-on-ubuntu>`_
    5. Note: To check current list of available screen you can do `screen -r`
    6. Note: To reattach to existing screen you can do something like `screen -r <screen_name>` e.g. `screen -r 25101.pts-1.ftech-ThinkPad-E14-Gen-2`
8. Create a sub-drectory for elastic search. Also download and install it
    1. `mkdir ~/Installs/es && cd ~/Installs/es`
    2. `wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.13.2-linux-x86_64.tar.gz`
    3. `tar xvfz elasticsearch-7.13.2-linux-x86_64.tar.gz`
    4. `cd elasticsearch-7.13.2/`
    5. `./bin/elasticsearch`
9. Check if ES is up and running using
    1. `curl http://localhost:9200`_
10. Install Postgres
     1. `sudo apt-get install -y postgresql`
11. Install Redis
     1. `sudo apt install -y redis-server`
12. Clone the Django Project's repository
     1. `cd ~/Code/ && git clone git@github.com:rockslideanalyticsLLC/rsa_crawling.git` 
13. Create virtual environment & build tools
     1. `sudo apt-get install -y build-essential`
     2. `sudo ap1. Update ubuntu
     3. `sudo apt-get update`
14. Install Google Chrome & Chrome Driver
     1. `mkdir -p ~/Installs/chrome && cd ~/Installs/chrome`
     2. `sudo apt-get install xdg-utils`
     3. `sudo apt --fix-broken install`
     4. `sudo apt-get install unzip`
     5. `wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb`
     6. `sudo dpkg -i google-chrome-stable_current_amd64.deb`
     7. `mkdir -p ~/Installs/chrome-driver && cd ~/Installs/chrome-driver`
     8. `wget https://chromedriver.storage.googleapis.com/91.0.4472.101/chromedriver_linux64.zip`
     9. `unzip chromedriver_linux64.zip`
15. Ensure Java is installed
     1. `sudo apt install openjdk-17-jre-headless`
     2. `java -version`
16. Create installs and code directory
     1. `mkdir ~/Code && mkdir ~/Installs`
17. Create a sub-directory for tika
     1. `mkdir ~/Installs/tika && cd ~/Installs/tika`
18. Download runnable jar file
     1. `wget https://dlcdn.apache.org/tika/1.28.1/tika-server-1.28.1.jar
     2. Note: If the above URL doesn't resolve it implies mirror might have changed. Kindly goto `tika's download page <https://tika.apache.org/download.html>`_ and download JAR file next in the series
     3. Run tika using `java -jar tika-server-1.28.1.jar` inside a `screen <https://www.gnu.org/software/screen/>`_
     4. Create a screen using `screen -t tika`
     5. Within the screen run `java -jar tika-server-1.28.1.jar`, once the Tika server is running you can
     6. Press `Ctrl + a`, pause a second and press `d`
     7. Note: To demonize this process following `these instructions <https://dzone.com/articles/run-your-java-application-as-a-service-on-ubuntu>`_
     8. Note: To check current list of available screen you can do `screen -r`
     9. Note: To reattach to existing screen you can do something like `screen -r <screen_name>` e.g. `screen -r 25101.pts-1.ftech-ThinkPad-E14-Gen-2`
19. Create a sub-drectory for elastic search. Also download and install it
     1. `mkdir ~/Installs/es && cd ~/Installs/es`
     2. `wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.13.2-linux-x86_64.tar.gz`
     3. `tar xvfz elasticsearch-7.13.2-linux-x86_64.tar.gz`
     4. `cd elasticsearch-7.13.2/`
     5. `./bin/elasticsearch`
20. Check if ES is up and running using
     1. `curl http://localhost:9200`
21. Install Postgres
     1. `sudo apt-get install -y postgresql`
22. Install Redis
     1. `sudo apt install -y redis-server`
23. Clone the Django Project's repository
     1. `cd ~/Code/ && git clone git@github.com:rockslideanalyticsLLC/rsa_crawling.git` 
24. Create virtual environment & build tools
     1. `sudo apt-get install -y build-essential`
     2. `sudo apt-get install -y python3-dev`
     3. `sudo apt-get install -y python3-venv`
     4. `cd rsa_crawling && python3 -m venv venv`
     5. `source ./venv/bin/activate`
     6. `cd rsa_crawling && pip3 install -r requirements.txt`
25. Ensure DB is setup correctly
     1. `sudo su postgres` 
     2. `psql`
     3. `CREATE DATABASE rsa_crawling;`
     4. `CREATE USER rsa_crawling WITH ENCRYPTED PASSWORD 'rsa_crawling';`
     5. `GRANT ALL PRIVILEGES ON DATABASE rsa_crawling TO rsa_crawling;`
     6. Once done type `\q` and type `exit` to go back to user from which you were doing setup
26. Migrate DB
     1. `python manage.py makemigrations && python manage.py makemigrations core && python manage.py migrate`
27. Create Super User
     1. `python manage.py createsuperuser`
28. Update `settings.py` to include whatever IP address you're running ther server on. Basically update `ALLOWED_HOSTS` list
29. Run the application and check if things are fine
     1. `python manage.py runserver 0.0.0.0:8000`./venv/bin/activate`
     2. `cd rsa_crawling && pip3 install -r requirements.txt`
30.  Ensure DB is setup correctly
     1. `sudo su postgres` 
     2. `psql`
     3. `CREATE DATABASE rsa_crawling;`
     4. `CREATE USER rsa_crawling WITH ENCRYPTED PASSWORD 'rsa_crawling';`
     5. `GRANT ALL PRIVILEGES ON DATABASE rsa_crawling TO rsa_crawling;`
     6. Once done type `\q` and type `exit` to go back to user from which you were doing setup
31. Migrate DB
     1. `python manage.py makemigrations && python manage.py makemigrations core && python manage.py migrate`
32. Create Super User
     1. `python manage.py createsuperuser`
33. Update `settings.py` to include whatever IP address you're running ther server on. Basically update `ALLOWED_HOSTS` list
34. Run the application and check if things are fine
     1. `python manage.py runserver 0.0.0.0:8000`

Get it up and Running
~~~~~~~~~~~~~~~~~~~~~

Production mode settings
~~~~~~~~~~~~~~~~~~~~~~~~