This commit is contained in:
Kaleb Lim 2024-06-10 20:54:57 +07:00
parent 8495f146ae
commit b7d862e046
11 changed files with 206 additions and 0 deletions

3
.dvc/.gitignore vendored Normal file
View File

@ -0,0 +1,3 @@
/config.local
/tmp
/cache

View File

@ -0,0 +1 @@
test,data

BIN
.dvc/btime Normal file

Binary file not shown.

9
.dvc/config Normal file
View File

@ -0,0 +1,9 @@
[core]
remote = aquaremote
['remote "aquaremote"']
url = s3://aqua01
endpointurl = https://sos-de-fra-1.exo.io
['remote "aquaremote2"']
url = s3://aqua02
['remote "aquaremote3"']
url = s3://aqua03

1
.dvc/lock Normal file
View File

@ -0,0 +1 @@
1416394

1
.gitignore vendored Normal file
View File

@ -0,0 +1 @@
/data.csv

74
DEVPOD.md Normal file
View File

@ -0,0 +1,74 @@
# Getting Started with DevPod
DevPod is a powerful tool used for creating reproducible developer environments. It enables teams to maintain consistency in their development environments by allowing access to a shared project setup. In this guide, we'll walk you through the setup process for DevPod, which is commonly used at Aqua Research to access projects and run experiments in a standardized environment.
## What is DevPod?
DevPod serves two main purposes:
- Access to the Project: DevPod allows developers to access the project's source code and data, ensuring everyone is working in the same development environment.
- Running Experiments: It also provides a platform for running experiments and generating models. The results of these experiments can be compared and visualized on Gitea.
In Aqua Research we use our private server (Hetzner machine) to run the devcontainer instance and ExoScale to manage the actual data access.
## Prerequisites
Before you can set up and use DevPod, make sure you have the following prerequisites:
- Docker installation: On the private server Docker has been installed.
- DevPod Installation: You should have DevPod installed on your local machine. You can find installation instructions at https://devpod.sh/. For this guide, we will only use the CLI of DevPod.
- SSH Access: You need SSH access with a public key to the private machine where you plan to run DevPod. This is typically set up in advance.
## Setup Your DevPod Environment
### Create a Provider (SSH)
To get started, you need to create a provider that points to our private machine using SSH. Use the following commands:
```bash
devpod provider add ssh
# enter root@95.217.101.177
```
This command will create a new provider in DevPod (locally), enabling you to connect to our private machine through SSH using your public key.
## Create a New Workspace
Now, it's time to create a new workspace. This command will open Visual Studio Code and connect it to the DevPod environment inside a Docker container running on our private remote machine. Execute the following command:
```bash
devpod up --provider ssh git@github.com:gradientzero/aqua-research.git --ide vscode --debug
```
Visual Studio Code will open, automatically connecting to the DevPod environment.
## Apply Local Exoscale Credentials (Inside DevContainer)
To access data used in Aqua Research and stored on ExoScale, you need to provide your credentials. Create a file named ```.dvc/config.local``` and add the following content (Note: Replace <1password> with your actual access key and secret access key):
```bash
# create new file: .dvc/config.local
['remote "aquaremote"']
access_key_id = <1password>
secret_access_key = <1password>
```
(TODO: havn't found a better way, yet. But at least this has to be done only once)
Now, you can simply use the ```dvc pull``` command to retrieve remote data into this DevPod instance.
## How to Connect to a Workspace
If you've already set up a DevPod workspace and need to reconnect, use the following command:
```bash
# connect to existing devpod on remote machine
devpod up aqua-research --ide vscode --debug
```
(TODO: not sure how to connect from scratch, yet)
## (Optional) Use dvclive to Track Experiments
You can use dvclive, a tool for tracking and visualizing experiments. A new Python file ```test.py``` may have been created for this purpose, which outputs experimental metrics. You can use dvclive to monitor these experiments.
That's it! You are now set up and ready to work with DevPod for your development and experimentation needs at Aqua Research.

101
README.md Normal file
View File

@ -0,0 +1,101 @@
# Aqua Predict Research
Aqua Predict Research Repo.
AI/ML-based groundwater analysis and prediction solutions.
## Repo Structure
* [data](data) - research and development data sets. Data is managed by DVC
* [code](code) - code repo
* [papers](papers) - scientific papers and other information
## Data Versioning
Data is managed by [DVC](https://dvc.org/doc). Later [DetaBord](https://detabord.com) will offer more advanced data and AI management.
DVC is built on top of git. This means everything is git managed. Use the normal git workflow to use this repository. DVC adds additional features to manage (large) data files.
### DVC Setup
DVC manages data metadate and uses remote data repositories to store the actual data sets. The preferred data storage provider is Exoscale. But this S3 service is not ready yet, in the meantime Azure Blob Storage with a local German zone (west germany) is used.
Environment (python)
```bash
conda create -n dvc python=3.11
conda activate dvc
pip install -r requirements.txt
```
DVC Version (3.4.0)
```bash
# ensure you have installed DVC version 3.4.0 or higher
dvc --version
```
More information for DVC installation:
https://dvc.org/doc/install
#### ExoScale (primary S3 storage)
Follow the installation instructions: https://community.exoscale.com/documentation/storage/quick-start/
```bash
brew install s3cmd
```
Create a config file `~/.s3cfg` with the following content:
```bash
[default]
host_base = sos-de-fra-1.exo.io
host_bucket = %(bucket)s.sos-de-fra-1.exo.io
access_key = $EXO_SOS_KEY
secret_key = $EXO_SOS_SECRET
use_https = True
```
Both `$EXO_SOS_KEY` and `$EXO_SOS_SECRET` you have to request from us once. host_bucket should stay as above.
Ensure you have access to ExoScale:
```bash
s3cmd ls
# 2023-07-03 13:40 s3://aqua01
```
Add data remote and use custom ExoScale endpoint:
```bash
dvc remote add -d aquaremote s3://aqua01 --force
dvc remote modify aquaremote endpointurl https://sos-de-fra-1.exo.io
# this will modify the file ".dvc/config"
```
DVC requires ExoScale credentials, we will provide them locally only:
```bash
dvc remote modify aquaremote --local access_key_id $EXO_SOS_KEY
dvc remote modify aquaremote --local secret_access_key $EXO_SOS_SECRET
# this will create a new file "config.local" that contains credentials for using ExoScale
```
Again, both `$EXO_SOS_KEY` and `$EXO_SOS_SECRET` equals to values we already have stored in `~/.s3cfg`
Use `dvc push` and `dvc pull` for data handling.
#### Azure (alternative S3 storage)
Azure accounts are managed by Active Directory. Invites shall be sent via email. Contact jb@gradient0.com for help with the accounts.
Users with access to the aqua01 storage account have the "Storage Blog Data Contributor" role assignment. To access the blog storage setup the connection via the Azure CLI.
Install Azure CLI
[https://learn.microsoft.com/en-us/cli/azure/install-azure-cli?source=recommendations](https://learn.microsoft.com/en-us/cli/azure/install-azure-cli?source=recommendations)
Then, login to your Azure Account
`az login`
And add and config the data remote:
`dvc remote add -d aquaremote azure://aqua01`
`dvc remote modify aquaremote account_name 'aqua01'`
This will use the local Azure CLI config for storage access.
Use `dvc push` and `dvc pull` for data handling. Refer to the DVC docs (see above) for detailed information.

14
dvc.yaml Normal file
View File

@ -0,0 +1,14 @@
params:
- dvclive/params.yaml
metrics:
- dvclive/metrics.json
plots:
- dvclive/plots/metrics:
x: step
- dvclive/plots/custom/iris_feature_importance.json:
template: bar_horizontal
x: importance
y: name
title: 'Iris Dataset: Feature Importance'
x_label: Feature Importance
y_label: Feature Name

2
requirements.txt Normal file
View File

@ -0,0 +1,2 @@
dvc[all]==3.4.0
dvclive

View File