Crawler ======= Introduction ~~~~~~~~~~~~ Note: `Video Walkthrough `_ of this documentation is available Objective ~~~~~~~~~ 1. Crawl Data 1. Cases 2. Documents 3. `VCards `_ 4. Attorney Profile Pages 2. Documents Search API 1. XC Dashboard 2. XpertConnect Key Dependencies ~~~~~~~~~~~~~~~~ 1. `Google Chrome `_ 2. `Google Chrome Driver `_ 3. `Apache Tika `_ 1. Convert PDF into Text: This is actually indexed by Elastic Search 4. `Elastic Search `_ 1. We have two main collections for indexer 1. Case Related Document 2. Attorney Profile Pages 5. `Celery `_ 1. Dispatching emails: Alerts & Sharing Documents 2. `Redis `_: Acts as a broker for Celery 6. `Postgres `_ 1. This is out DB of Choice for storing different pieces of information 7. `Django `_ 1. DRF: Django Rest Framework 2. `Django Elastic Search `_ 3. `Django Filters `_ Setup Instructions ~~~~~~~~~~~~~~~~~~ 1. Update ubuntu 1. sudo apt-get update 2. Install Google Chrome & Chrome Driver 1. mkdir -p ~/Installs/chrome && cd ~/Installs/chrome 2. sudo apt-get install xdg-utils 3. sudo apt --fix-broken install 4. sudo apt-get install unzip 5. `wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb` 6. sudo dpkg -i google-chrome-stable_current_amd64.deb 7. mkdir -p ~/Installs/chrome-driver && cd ~/Installs/chrome-driver 8. `wget https://chromedriver.storage.googleapis.com/91.0.4472.101/chromedriver_linux64.zip` 9. unzip chromedriver_linux64.zip 3. Ensure Java is installed 1. sudo apt install openjdk-17-jre-headless 2. java -version 4. Create installs and code directory 1. mkdir ~/Code && mkdir ~/Installs 5. Create a sub-directory for tika 1. mkdir ~/Installs/tika && cd ~/Installs/tika 6. Download runnable jar file 1. `wget https://dlcdn.apache.org/tika/1.28.1/tika-server-1.28.1.jar` 2. Note: If the above URL doesn't resolve it implies mirror might have changed. Kindly goto `tika's download page `_ and download JAR file next in the series 7. Run tika using 'java -jar tika-server-1.28.1.jar` inside a `screen `_ 1. Create a screen using `screen -t tika` 2. Within the screen run `java -jar tika-server-1.28.1.jar`, once the Tika server is running you can 3. Press `Ctrl + a`, pause a second and press `d` 4. Note: To demonize this process following `these instructions `_ 5. Note: To check current list of available screen you can do `screen -r` 6. Note: To reattach to existing screen you can do something like `screen -r ` e.g. `screen -r 25101.pts-1.ftech-ThinkPad-E14-Gen-2` 8. Create a sub-drectory for elastic search. Also download and install it 1. `mkdir ~/Installs/es && cd ~/Installs/es` 2. `wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.13.2-linux-x86_64.tar.gz` 3. `tar xvfz elasticsearch-7.13.2-linux-x86_64.tar.gz` 4. `cd elasticsearch-7.13.2/` 5. `./bin/elasticsearch` 9. Check if ES is up and running using 1. `curl http://localhost:9200`_ 10. Install Postgres 1. `sudo apt-get install -y postgresql` 11. Install Redis 1. `sudo apt install -y redis-server` 12. Clone the Django Project's repository 1. `cd ~/Code/ && git clone git@github.com:rockslideanalyticsLLC/rsa_crawling.git` 13. Create virtual environment & build tools 1. `sudo apt-get install -y build-essential` 2. `sudo ap1. Update ubuntu 3. `sudo apt-get update` 14. Install Google Chrome & Chrome Driver 1. `mkdir -p ~/Installs/chrome && cd ~/Installs/chrome` 2. `sudo apt-get install xdg-utils` 3. `sudo apt --fix-broken install` 4. `sudo apt-get install unzip` 5. `wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb` 6. `sudo dpkg -i google-chrome-stable_current_amd64.deb` 7. `mkdir -p ~/Installs/chrome-driver && cd ~/Installs/chrome-driver` 8. `wget https://chromedriver.storage.googleapis.com/91.0.4472.101/chromedriver_linux64.zip` 9. `unzip chromedriver_linux64.zip` 15. Ensure Java is installed 1. `sudo apt install openjdk-17-jre-headless` 2. `java -version` 16. Create installs and code directory 1. `mkdir ~/Code && mkdir ~/Installs` 17. Create a sub-directory for tika 1. `mkdir ~/Installs/tika && cd ~/Installs/tika` 18. Download runnable jar file 1. `wget https://dlcdn.apache.org/tika/1.28.1/tika-server-1.28.1.jar 2. Note: If the above URL doesn't resolve it implies mirror might have changed. Kindly goto `tika's download page `_ and download JAR file next in the series 3. Run tika using `java -jar tika-server-1.28.1.jar` inside a `screen `_ 4. Create a screen using `screen -t tika` 5. Within the screen run `java -jar tika-server-1.28.1.jar`, once the Tika server is running you can 6. Press `Ctrl + a`, pause a second and press `d` 7. Note: To demonize this process following `these instructions `_ 8. Note: To check current list of available screen you can do `screen -r` 9. Note: To reattach to existing screen you can do something like `screen -r ` e.g. `screen -r 25101.pts-1.ftech-ThinkPad-E14-Gen-2` 19. Create a sub-drectory for elastic search. Also download and install it 1. `mkdir ~/Installs/es && cd ~/Installs/es` 2. `wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.13.2-linux-x86_64.tar.gz` 3. `tar xvfz elasticsearch-7.13.2-linux-x86_64.tar.gz` 4. `cd elasticsearch-7.13.2/` 5. `./bin/elasticsearch` 20. Check if ES is up and running using 1. `curl http://localhost:9200` 21. Install Postgres 1. `sudo apt-get install -y postgresql` 22. Install Redis 1. `sudo apt install -y redis-server` 23. Clone the Django Project's repository 1. `cd ~/Code/ && git clone git@github.com:rockslideanalyticsLLC/rsa_crawling.git` 24. Create virtual environment & build tools 1. `sudo apt-get install -y build-essential` 2. `sudo apt-get install -y python3-dev` 3. `sudo apt-get install -y python3-venv` 4. `cd rsa_crawling && python3 -m venv venv` 5. `source ./venv/bin/activate` 6. `cd rsa_crawling && pip3 install -r requirements.txt` 25. Ensure DB is setup correctly 1. `sudo su postgres` 2. `psql` 3. `CREATE DATABASE rsa_crawling;` 4. `CREATE USER rsa_crawling WITH ENCRYPTED PASSWORD 'rsa_crawling';` 5. `GRANT ALL PRIVILEGES ON DATABASE rsa_crawling TO rsa_crawling;` 6. Once done type `\q` and type `exit` to go back to user from which you were doing setup 26. Migrate DB 1. `python manage.py makemigrations && python manage.py makemigrations core && python manage.py migrate` 27. Create Super User 1. `python manage.py createsuperuser` 28. Update `settings.py` to include whatever IP address you're running ther server on. Basically update `ALLOWED_HOSTS` list 29. Run the application and check if things are fine 1. `python manage.py runserver 0.0.0.0:8000`./venv/bin/activate` 2. `cd rsa_crawling && pip3 install -r requirements.txt` 30. Ensure DB is setup correctly 1. `sudo su postgres` 2. `psql` 3. `CREATE DATABASE rsa_crawling;` 4. `CREATE USER rsa_crawling WITH ENCRYPTED PASSWORD 'rsa_crawling';` 5. `GRANT ALL PRIVILEGES ON DATABASE rsa_crawling TO rsa_crawling;` 6. Once done type `\q` and type `exit` to go back to user from which you were doing setup 31. Migrate DB 1. `python manage.py makemigrations && python manage.py makemigrations core && python manage.py migrate` 32. Create Super User 1. `python manage.py createsuperuser` 33. Update `settings.py` to include whatever IP address you're running ther server on. Basically update `ALLOWED_HOSTS` list 34. Run the application and check if things are fine 1. `python manage.py runserver 0.0.0.0:8000` Get it up and Running ~~~~~~~~~~~~~~~~~~~~~ Production mode settings ~~~~~~~~~~~~~~~~~~~~~~~~