# Detabord Template Detabord template Repo. ## Repo Structure * [data](data) - Data sets. Data is managed by DVC * [code](code) - Code repo ## Data Versioning Data is managed by [DVC](https://dvc.org/doc). DVC is a version control system for data sets. It is used to track changes in data sets and to share data sets between team members. DVC is built on top of git. This means everything is git managed. Use the normal git workflow to use this repository. DVC adds additional features to manage (large) data files. With DVC you can easily track your experiments and their progress by only instrumenting your code, and collaborate on ML experiments like software engineers do for code. ## Setup Environment Create and activate your python environment first: ```bash conda create -n my-env python=3.11 conda activate my-env ``` Use the package manager pip to install dependencies: ```bash pip install -r requirements.txt ``` Ensure you have installed DVC version 3.4.0 or higher: ```bash dvc --version ``` More information for DVC installation: https://dvc.org/doc/install ## Setup S3 storage credentials (ExoScale Demo Bucket) Follow the installation instructions: https://community.exoscale.com/documentation/storage/quick-start/ We prefer to use the CLI tool `s3cmd`. Install it with: ```bash brew install s3cmd ``` Create a config file `~/.s3cfg` with the following content: ```bash [default] host_base = sos-at-vie-1.exo.io host_bucket = %(bucket)s.sos-at-vie-1.exo.io access_key = PLEASE_REQUEST_YOUR_API_ACCESS_KEY secret_key = PLEASE_REQUEST_YOUR_API_ACCESS_SECRET use_https = True ``` Both `PLEASE_REQUEST_YOUR_API_ACCESS_KEY` and `PLEASE_REQUEST_YOUR_API_ACCESS_SECRET` you have to request from the Gradient0 Team. Host Base and Host Bucket should stay as above. Ensure you have access to ExoScale's bucket called `detabord-demo`: ```bash s3cmd info s3://detabord-demo # s3://detabord-demo/ (bucket): # Location: at-vie-1 # Payer: BucketOwner # Expiration Rule: none # Policy: none # CORS: # ACL: gradient-zero-softwareentwicklungsgmbh: FULL_CONTRO ``` ## Connect S3 bucket to DVC ### Already prepared in this repo Initializing DVC in this project (already done in this repo): ```bash # dvc init ``` Create new remote in DVC and use custom ExoScale endpoint: ```bash # dvc remote add -d detabord-demo-remote s3://detabord-demo --force # dvc remote modify detabord-demo-remote endpointurl https://sos-at-vie-1.exo.io # this will modify the file ".dvc/config" ``` ### Provide your credentials locally DVC requires ExoScale credentials, we will provide them locally to not commit to github: ```bash dvc remote modify detabord-demo-remote --local access_key_id PLEASE_REQUEST_YOUR_API_ACCESS_KEY dvc remote modify detabord-demo-remote --local secret_access_key PLEASE_REQUEST_YOUR_API_ACCESS_SECRET # this will create a new file "config.local" that contains credentials for using ExoScale ``` Again, both `PLEASE_REQUEST_YOUR_API_ACCESS_KEY` and `PLEASE_REQUEST_YOUR_API_ACCESS_SECRET` equals to values we already have stored in `~/.s3cfg` ## Pushing data with DVC If you want to work with data, please follow the instructions: https://dvc.org/doc/start/data-management/data-versioning ```bash # just for reference how data/super-secret.txt was added to DVC and uploaded to bucket: ```bash dvc add data/super-secret.txt git add data/.gitignore data/super-secret.txt.dvc dvc push # 1 file pushed ``` ## Pulling data with DVC ```bash dvc pull ``` ## Prepare Remote Execution Make sure that a remote machine instance is already connected to your organization in Detabord. If not, please follow the instructions: ### Generate a new SSH key pair for the remote machine ```bash # contenx: local machine ssh-keygen -t ed25519 -C "remote@machine.com" -f org-key # no passphrase ``` Copy content to Detabord - SSH Key for Organizations. Connect to the remote machine and add the public key to the authorized keys: ```bash # context: remote machine nano /root/.ssh/authorized_keys # add the content of org-key.pub ``` ### Add Remote Machine Add Remote Machine in Detabord at "Machine: Machine for Organizations": ```bash Name: remote-machine user : root SSH Key: select org-key Host: 95.217.101.177 Port: 22 ``` ### Provide Remote Machine Access to Repo In Detabord, go to User's Settings > Applications. A new user can also be created in whose name the remote machine can access and commit to the repository. For the sake of simplicity, the current user is used here. However, it is recommended to create your own user for this. ```bash Token Name: User Token for Remote machine Select Permissions: - organization: read - repository: read and write ``` Copy the token and save it in Organization's Settings > Gitea Token: ```bash Name: User Token for Remote machine Token: <...> ``` ### Provide Remote Machine Access to Remote Datasets In Organization's Settings, go to "Devpod Credential" and add following credentials: ```bash # must be in sync with your local DVC credentials Remote: detabord-demo-remote key: access_key_id value: <...> ``` and a second one: ```bash # must be in sync with your local DVC credentials Remote: detabord-demo-remote key: secret_access_key value: <...> ``` ## Appendix: Create experiment stage (test) ```bash # create dvc stage add -n simple_run \ -p simple \ -d code/simple.py \ -d data/super-secret.txt \ python main.py ```