Go to file
Kaleb Lim b7d862e046 add data 2024-06-10 20:54:57 +07:00
.dvc add data 2024-06-10 20:54:57 +07:00
.gitignore add data 2024-06-10 20:54:57 +07:00
DEVPOD.md add data 2024-06-10 20:54:57 +07:00
README.md add data 2024-06-10 20:54:57 +07:00
dvc.yaml add data 2024-06-10 20:54:57 +07:00
requirements.txt add data 2024-06-10 20:54:57 +07:00

README.md

Aqua Predict Research

Aqua Predict Research Repo.

AI/ML-based groundwater analysis and prediction solutions.

Repo Structure

  • data - research and development data sets. Data is managed by DVC
  • code - code repo
  • papers - scientific papers and other information

Data Versioning

Data is managed by DVC. Later DetaBord will offer more advanced data and AI management. DVC is built on top of git. This means everything is git managed. Use the normal git workflow to use this repository. DVC adds additional features to manage (large) data files.

DVC Setup

DVC manages data metadate and uses remote data repositories to store the actual data sets. The preferred data storage provider is Exoscale. But this S3 service is not ready yet, in the meantime Azure Blob Storage with a local German zone (west germany) is used.

Environment (python)

conda create -n dvc python=3.11
conda activate dvc
pip install -r requirements.txt

DVC Version (3.4.0)

# ensure you have installed DVC version 3.4.0 or higher
dvc --version

More information for DVC installation: https://dvc.org/doc/install

ExoScale (primary S3 storage)

Follow the installation instructions: https://community.exoscale.com/documentation/storage/quick-start/

brew install s3cmd

Create a config file ~/.s3cfg with the following content:

[default]
host_base = sos-de-fra-1.exo.io
host_bucket = %(bucket)s.sos-de-fra-1.exo.io
access_key = $EXO_SOS_KEY
secret_key = $EXO_SOS_SECRET
use_https = True

Both $EXO_SOS_KEY and $EXO_SOS_SECRET you have to request from us once. host_bucket should stay as above.

Ensure you have access to ExoScale:

s3cmd ls
# 2023-07-03 13:40  s3://aqua01

Add data remote and use custom ExoScale endpoint:

dvc remote add -d aquaremote s3://aqua01 --force
dvc remote modify aquaremote endpointurl https://sos-de-fra-1.exo.io
# this will modify the file ".dvc/config"

DVC requires ExoScale credentials, we will provide them locally only:

dvc remote modify aquaremote --local access_key_id $EXO_SOS_KEY
dvc remote modify aquaremote --local secret_access_key $EXO_SOS_SECRET
# this will create a new file "config.local" that contains credentials for using ExoScale

Again, both $EXO_SOS_KEY and $EXO_SOS_SECRET equals to values we already have stored in ~/.s3cfg

Use dvc push and dvc pull for data handling.

Azure (alternative S3 storage)

Azure accounts are managed by Active Directory. Invites shall be sent via email. Contact jb@gradient0.com for help with the accounts.

Users with access to the aqua01 storage account have the "Storage Blog Data Contributor" role assignment. To access the blog storage setup the connection via the Azure CLI.

Install Azure CLI https://learn.microsoft.com/en-us/cli/azure/install-azure-cli?source=recommendations

Then, login to your Azure Account az login

And add and config the data remote:

dvc remote add -d aquaremote azure://aqua01

dvc remote modify aquaremote account_name 'aqua01'

This will use the local Azure CLI config for storage access.

Use dvc push and dvc pull for data handling. Refer to the DVC docs (see above) for detailed information.