Go to file
Artur Susdorf 802ceeb397 new metric 2024-06-26 10:59:17 +02:00
.devcontainer inital commit 2024-06-26 10:52:38 +02:00
.dvc inital commit 2024-06-26 10:52:38 +02:00
code inital commit 2024-06-26 10:52:38 +02:00
data inital commit 2024-06-26 10:52:38 +02:00
dvclive new metric 2024-06-26 10:59:17 +02:00
.devcontainer.json inital commit 2024-06-26 10:52:38 +02:00
.dvcignore inital commit 2024-06-26 10:52:38 +02:00
.gitignore inital commit 2024-06-26 10:52:38 +02:00
README.md inital commit 2024-06-26 10:52:38 +02:00
dvc.lock inital commit 2024-06-26 10:52:38 +02:00
dvc.yaml inital commit 2024-06-26 10:52:38 +02:00
main.py inital commit 2024-06-26 10:52:38 +02:00
requirements-dev.txt inital commit 2024-06-26 10:52:38 +02:00
requirements.txt inital commit 2024-06-26 10:52:38 +02:00

README.md

Detabord Template

Detabord template Repo.

Repo Structure

  • data - Data sets. Data is managed by DVC
  • code - Code repo

Data Versioning

Data is managed by DVC. DVC is a version control system for data sets. It is used to track changes in data sets and to share data sets between team members. DVC is built on top of git. This means everything is git managed. Use the normal git workflow to use this repository. DVC adds additional features to manage (large) data files. With DVC you can easily track your experiments and their progress by only instrumenting your code, and collaborate on ML experiments like software engineers do for code.

Setup Environment

Create and activate your python environment first:

conda create -n my-env python=3.11
conda activate my-env

Use the package manager pip to install dependencies:

pip install -r requirements.txt

Ensure you have installed DVC version 3.4.0 or higher:

dvc --version

More information for DVC installation: https://dvc.org/doc/install

Setup S3 storage credentials (ExoScale Demo Bucket)

Follow the installation instructions: https://community.exoscale.com/documentation/storage/quick-start/ We prefer to use the CLI tool s3cmd. Install it with:

brew install s3cmd

Create a config file ~/.s3cfg with the following content:

[default]
host_base = sos-at-vie-1.exo.io
host_bucket = %(bucket)s.sos-at-vie-1.exo.io
access_key = PLEASE_REQUEST_YOUR_API_ACCESS_KEY
secret_key = PLEASE_REQUEST_YOUR_API_ACCESS_SECRET
use_https = True

Both PLEASE_REQUEST_YOUR_API_ACCESS_KEY and PLEASE_REQUEST_YOUR_API_ACCESS_SECRET you have to request from the Gradient0 Team. Host Base and Host Bucket should stay as above.

Ensure you have access to ExoScale's bucket called detabord-demo:

s3cmd info s3://detabord-demo
# s3://detabord-demo/ (bucket):
#    Location:  at-vie-1
#    Payer:     BucketOwner
#    Expiration Rule: none
#    Policy:    none
#    CORS:      <?xml version="1.0" encoding="UTF-8"?><CORSConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/"></CORSConfiguration>
#    ACL:       gradient-zero-softwareentwicklungsgmbh: FULL_CONTRO

Connect S3 bucket to DVC

Already prepared in this repo

Initializing DVC in this project (already done in this repo):

# dvc init

Create new remote in DVC and use custom ExoScale endpoint:

# dvc remote add -d detabord-demo-remote s3://detabord-demo --force
# dvc remote modify detabord-demo-remote endpointurl https://sos-at-vie-1.exo.io
# this will modify the file ".dvc/config"

Provide your credentials locally

DVC requires ExoScale credentials, we will provide them locally to not commit to github:

dvc remote modify detabord-demo-remote --local access_key_id PLEASE_REQUEST_YOUR_API_ACCESS_KEY
dvc remote modify detabord-demo-remote --local secret_access_key PLEASE_REQUEST_YOUR_API_ACCESS_SECRET
# this will create a new file "config.local" that contains credentials for using ExoScale

Again, both PLEASE_REQUEST_YOUR_API_ACCESS_KEY and PLEASE_REQUEST_YOUR_API_ACCESS_SECRET equals to values we already have stored in ~/.s3cfg

Pushing data with DVC

If you want to work with data, please follow the instructions: https://dvc.org/doc/start/data-management/data-versioning

# just for reference how data/super-secret.txt was added to DVC and uploaded to bucket:
```bash
dvc add data/super-secret.txt
git add data/.gitignore data/super-secret.txt.dvc
dvc push
# 1 file pushed

Pulling data with DVC

dvc pull

Prepare Remote Execution

Make sure that a remote machine instance is already connected to your organization in Detabord. If not, please follow the instructions:

Generate a new SSH key pair for the remote machine

# contenx: local machine
ssh-keygen -t ed25519 -C "remote@machine.com" -f org-key
# no passphrase

Copy content to Detabord - SSH Key for Organizations.

Connect to the remote machine and add the public key to the authorized keys:

# context: remote machine
nano /root/.ssh/authorized_keys
# add the content of org-key.pub

Add Remote Machine

Add Remote Machine in Detabord at "Machine: Machine for Organizations":

Name: remote-machine
user : root
SSH Key: select org-key
Host: 95.217.101.177
Port: 22

Provide Remote Machine Access to Repo

In Detabord, go to User's Settings > Applications. A new user can also be created in whose name the remote machine can access and commit to the repository. For the sake of simplicity, the current user is used here. However, it is recommended to create your own user for this.

Token Name: User Token for Remote machine
Select Permissions:
- organization: read
- repository: read and write

Copy the token and save it in Organization's Settings > Gitea Token:

Name: User Token for Remote machine
Token: <...>

Provide Remote Machine Access to Remote Datasets

In Organization's Settings, go to "Devpod Credential" and add following credentials:

# must be in sync with your local DVC credentials
Remote: detabord-demo-remote
key: access_key_id
value: <...>

and a second one:

# must be in sync with your local DVC credentials
Remote: detabord-demo-remote
key: secret_access_key
value: <...>

Appendix: Create experiment stage (test)

# create 
dvc stage add -n simple_run \
  -p simple \
  -d code/simple.py \
  -d data/super-secret.txt \
  python main.py