Skip to main content

Webscraper Tutorial

General Info

https://github.com/OSU-Sustainability-Office/automated-jobs

https://login.oregonstate.edu/apps/aws/

https://pptr.dev/

Automated-Jobs Dev Setup

  • Go here for the .env file - put inside the automated-jobs/SEC directory
    • Need to be a paid OSU Sustainability Office Employee to see this above link
  • .env needs DASHBOARD_API = https://api.sustainability.oregonstate.edu/v2/energy unless you are making changes to energy dashboard (more on this in Testing Pipeline section)

In general, from the directory of any given tool (SEC, check-acq, etc. Note that SunnyWebBox and Tesla Solar City are deprecated and no longer used)

  • npm i
  • node <Javascript file name>, e.g. node readsec.js
    • Need NodeJS v16

AWS ECR (Elastic Container Registry)

https://us-west-2.console.aws.amazon.com/ecr/repositories?region=us-west-2

alt_text

  • create repository
    • Just follow default options
  • View push commands
  • If you just want to make an update to the webscraper, you just need to edit ECR and not ECS. ECS should be configured to pick up the latest ECR revision anyways

AWS ECS (Elastic Container Service)

Task Definition

alt_text

Clusters

https://us-west-2.console.aws.amazon.com/ecs/v2/clusters?region=us-west-2

alt_text

  • Create cluster

    • I think you can keep default options here but don’t quote me. Fargate option
  • Click on a cluster > scheduled tasks

    • alt_text
  • Click update on an existing scheduled task for reference before making a new one (have them side by side on different tabs!)

    • alt_text
  • While testing something for the first time, it's a good idea to set the interval for running the CRON job as something like every minute or every 5 minutes. But once you are certain it works, make sure to turn the interval back to once every 24 hours or 48 hours etc.

AWS Cloudwatch

https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logsV2:log-groups

The log group may be created automatically, if not, create it. May error otherwise. This is also where you can check if the task is executed.

Name: /ecs/<task name>

See this page on Cloudwatch as well for more information

https://us-west-2.console.aws.amazon.com/ecs/v2/task-definitions?region=us-west-2

(I think)

Testing Pipeline Guide

  • Local test with energy dashboard (both frontend and backend local), MySQL workbench
    • Move on when you have successfully added new data to SQL database with node readsec.js (or whatever you named it), and you get the right data from local frontend > inspect element > network tab
  • Unless you are making changes to the energy dashboard backend code, then just edit the DASHBOARD_API value in your automated-jobs/SEC/.env file to the production URL (https://api.sustainability.oregonstate.edu/v2/energy)
    • docker build . -t test
    • docker run -t test
  • Local test with webscraper on Docker
    • Move on when docker build . -t test and docker run -t test works and successfully adds data to SQL database
  • If making changes to backend energy dashboard code:
  • AWS ECR and ECS
    • Inspect Element > Network > see the network request sent starting with “data…”
    • If you just want to make an update to the webscraper, you just need to push changes to ECR and not ECS. ECS should be configured to pick up the latest ECR revision anyways
    • Change interval to 1 minute or something to test (ECS > cluster > scheduled task > update):

alt_text

  • Double check this part via Cloudwatch, and also check the data entries production site directly (SEC Solar and OSU Operations), as well as in the SQL database via MySQL workbench
  • Remember to delete duplicate data from SQL database
    • DELETE from Solar_Meters where id = <some id>
    • Although redundant data is handled on the frontend, it's good practice