inital commit

This commit is contained in:
Artur Susdorf 2024-06-26 10:52:38 +02:00
commit 0b44893a77
24 changed files with 421 additions and 0 deletions

1
.devcontainer.json Normal file
View File

@ -0,0 +1 @@
{"image":"mcr.microsoft.com/devcontainers/python:3"}

27
.devcontainer/Dockerfile Normal file
View File

@ -0,0 +1,27 @@
FROM python:3.11
# Add non-root user
ARG USERNAME=nonroot
RUN groupadd --gid 1000 $USERNAME && \
useradd --uid 1000 --gid 1000 -m $USERNAME
## Make sure to reflect new user in PATH
ENV PATH="/home/${USERNAME}/.local/bin:${PATH}"
USER $USERNAME
## Pip dependencies
# Upgrade pip
RUN pip install --upgrade pip
# Install production dependencies
COPY --chown=nonroot:1000 requirements.txt /tmp/requirements.txt
RUN pip install -r /tmp/requirements.txt && \
rm /tmp/requirements.txt
# Install development dependencies
COPY --chown=nonroot:1000 requirements-dev.txt /tmp/requirements-dev.txt
RUN pip install -r /tmp/requirements-dev.txt && \
rm /tmp/requirements-dev.txt
# fix: https://github.com/iterative/dvc/issues/10431
RUN pip install pygit2==1.14.1

View File

@ -0,0 +1,8 @@
{
"build": {
"dockerfile": "Dockerfile",
"context": ".."
},
"remoteUser": "nonroot"
}

3
.dvc/.gitignore vendored Normal file
View File

@ -0,0 +1,3 @@
/config.local
/tmp
/cache

5
.dvc/config Normal file
View File

@ -0,0 +1,5 @@
[core]
remote = detabord-demo-remote
['remote "detabord-demo-remote"']
url = s3://detabord-demo
endpointurl = https://sos-at-vie-1.exo.io

3
.dvcignore Normal file
View File

@ -0,0 +1,3 @@
# Add patterns of files dvc should ignore, which could improve
# the performance. Learn more at
# https://dvc.org/doc/user-guide/dvcignore

8
.gitignore vendored Normal file
View File

@ -0,0 +1,8 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
.exoscale
.boto
.s3cfg

197
README.md Normal file
View File

@ -0,0 +1,197 @@
# Detabord Template
Detabord template Repo.
## Repo Structure
* [data](data) - Data sets. Data is managed by DVC
* [code](code) - Code repo
## Data Versioning
Data is managed by [DVC](https://dvc.org/doc). DVC is a version control system for data sets. It is used to track changes in data sets and to share data sets between team members. DVC is built on top of git. This means everything is git managed. Use the normal git workflow to use this repository. DVC adds additional features to manage (large) data files. With DVC you can easily track your experiments and their progress by only instrumenting your code, and collaborate on ML experiments like software engineers do for code.
## Setup Environment
Create and activate your python environment first:
```bash
conda create -n my-env python=3.11
conda activate my-env
```
Use the package manager pip to install dependencies:
```bash
pip install -r requirements.txt
```
Ensure you have installed DVC version 3.4.0 or higher:
```bash
dvc --version
```
More information for DVC installation:
https://dvc.org/doc/install
## Setup S3 storage credentials (ExoScale Demo Bucket)
Follow the installation instructions: https://community.exoscale.com/documentation/storage/quick-start/
We prefer to use the CLI tool `s3cmd`. Install it with:
```bash
brew install s3cmd
```
Create a config file `~/.s3cfg` with the following content:
```bash
[default]
host_base = sos-at-vie-1.exo.io
host_bucket = %(bucket)s.sos-at-vie-1.exo.io
access_key = PLEASE_REQUEST_YOUR_API_ACCESS_KEY
secret_key = PLEASE_REQUEST_YOUR_API_ACCESS_SECRET
use_https = True
```
Both `PLEASE_REQUEST_YOUR_API_ACCESS_KEY` and `PLEASE_REQUEST_YOUR_API_ACCESS_SECRET` you have to request from the Gradient0 Team. Host Base and Host Bucket should stay as above.
Ensure you have access to ExoScale's bucket called `detabord-demo`:
```bash
s3cmd info s3://detabord-demo
# s3://detabord-demo/ (bucket):
# Location: at-vie-1
# Payer: BucketOwner
# Expiration Rule: none
# Policy: none
# CORS: <?xml version="1.0" encoding="UTF-8"?><CORSConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/"></CORSConfiguration>
# ACL: gradient-zero-softwareentwicklungsgmbh: FULL_CONTRO
```
## Connect S3 bucket to DVC
### Already prepared in this repo
Initializing DVC in this project (already done in this repo):
```bash
# dvc init
```
Create new remote in DVC and use custom ExoScale endpoint:
```bash
# dvc remote add -d detabord-demo-remote s3://detabord-demo --force
# dvc remote modify detabord-demo-remote endpointurl https://sos-at-vie-1.exo.io
# this will modify the file ".dvc/config"
```
### Provide your credentials locally
DVC requires ExoScale credentials, we will provide them locally to not commit to github:
```bash
dvc remote modify detabord-demo-remote --local access_key_id PLEASE_REQUEST_YOUR_API_ACCESS_KEY
dvc remote modify detabord-demo-remote --local secret_access_key PLEASE_REQUEST_YOUR_API_ACCESS_SECRET
# this will create a new file "config.local" that contains credentials for using ExoScale
```
Again, both `PLEASE_REQUEST_YOUR_API_ACCESS_KEY` and `PLEASE_REQUEST_YOUR_API_ACCESS_SECRET` equals to values we already have stored in `~/.s3cfg`
## Pushing data with DVC
If you want to work with data, please follow the instructions: https://dvc.org/doc/start/data-management/data-versioning
```bash
# just for reference how data/super-secret.txt was added to DVC and uploaded to bucket:
```bash
dvc add data/super-secret.txt
git add data/.gitignore data/super-secret.txt.dvc
dvc push
# 1 file pushed
```
## Pulling data with DVC
```bash
dvc pull
```
## Prepare Remote Execution
Make sure that a remote machine instance is already connected to your organization in Detabord. If not, please follow the instructions:
### Generate a new SSH key pair for the remote machine
```bash
# contenx: local machine
ssh-keygen -t ed25519 -C "remote@machine.com" -f org-key
# no passphrase
```
Copy content to Detabord - SSH Key for Organizations.
Connect to the remote machine and add the public key to the authorized keys:
```bash
# context: remote machine
nano /root/.ssh/authorized_keys
# add the content of org-key.pub
```
### Add Remote Machine
Add Remote Machine in Detabord at "Machine: Machine for Organizations":
```bash
Name: remote-machine
user : root
SSH Key: select org-key
Host: 95.217.101.177
Port: 22
```
### Provide Remote Machine Access to Repo
In Detabord, go to User's Settings > Applications. A new user can also be created in whose name the remote machine can access and commit to the repository. For the sake of simplicity, the current user is used here. However, it is recommended to create your own user for this.
```bash
Token Name: User Token for Remote machine
Select Permissions:
- organization: read
- repository: read and write
```
Copy the token and save it in Organization's Settings > Gitea Token:
```bash
Name: User Token for Remote machine
Token: <...>
```
### Provide Remote Machine Access to Remote Datasets
In Organization's Settings, go to "Devpod Credential" and add following credentials:
```bash
# must be in sync with your local DVC credentials
Remote: detabord-demo-remote
key: access_key_id
value: <...>
```
and a second one:
```bash
# must be in sync with your local DVC credentials
Remote: detabord-demo-remote
key: secret_access_key
value: <...>
```
## Appendix: Create experiment stage (test)
```bash
# create
dvc stage add -n simple_run \
-p simple \
-d code/simple.py \
-d data/super-secret.txt \
python main.py
```

0
code/__init__.py Normal file
View File

27
code/simple.py Normal file
View File

@ -0,0 +1,27 @@
from dvclive import Live
def run_simple_experiment():
datapoints = [
{"name": "petal_width", "importance": 0.4},
{"name": "petal_length", "importance": 0.33},
{"name": "sepal_width", "importance": 0.24},
{"name": "sepal_length", "importance": 0.03}
]
with Live() as live:
live.log_param("myParam", 123)
live.log_metric("myMetric", 543)
live.log_metric("new_metric", 333)
live.log_plot(
"iris_feature_importance",
datapoints,
x="importance",
y="name",
template="bar_horizontal",
title="Iris Dataset: Feature Importance",
y_label="Feature Name",
x_label="Feature Importance"
)

1
data/.gitignore vendored Normal file
View File

@ -0,0 +1 @@
/super-secret.txt

1
data/data.local.txt Normal file
View File

@ -0,0 +1 @@
This data is fully visible in the repository

View File

@ -0,0 +1,5 @@
outs:
- md5: 5ebe2294ecd0e0f08eab7690d2a6ee69
size: 6
hash: md5
path: super-secret.txt

13
dvc.lock Normal file
View File

@ -0,0 +1,13 @@
schema: '2.0'
stages:
simple_run:
cmd: python main.py
deps:
- path: code/simple.py
hash: md5
md5: 8647c8d1057de1a1cba9ceb2d0bb7d5a
size: 724
- path: data/super-secret.txt
hash: md5
md5: 5ebe2294ecd0e0f08eab7690d2a6ee69
size: 6

6
dvc.yaml Normal file
View File

@ -0,0 +1,6 @@
stages:
simple_run:
cmd: python main.py
deps:
- code/simple.py
- data/super-secret.txt

14
dvclive/dvc.yaml Normal file
View File

@ -0,0 +1,14 @@
params:
- params.yaml
metrics:
- metrics.json
plots:
- plots/metrics:
x: step
- plots/custom/iris_feature_importance.json:
template: bar_horizontal
x: importance
y: name
title: 'Iris Dataset: Feature Importance'
x_label: Feature Importance
y_label: Feature Name

3
dvclive/metrics.json Normal file
View File

@ -0,0 +1,3 @@
{
"myMetric": 543
}

1
dvclive/params.yaml Normal file
View File

@ -0,0 +1 @@
myParam: 123

View File

@ -0,0 +1,18 @@
[
{
"name": "petal_width",
"importance": 0.4
},
{
"name": "petal_length",
"importance": 0.33
},
{
"name": "sepal_width",
"importance": 0.24
},
{
"name": "sepal_length",
"importance": 0.03
}
]

View File

@ -0,0 +1,2 @@
step myMetric
0 543
1 step myMetric
2 0 543

64
dvclive/report.html Normal file
View File

@ -0,0 +1,64 @@
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="refresh" content="5">
<title>DVC Plot</title>
<script src="https://cdn.jsdelivr.net/npm/vega@5.20.2"></script>
<script src="https://cdn.jsdelivr.net/npm/vega-lite@5.2.0"></script>
<script src="https://cdn.jsdelivr.net/npm/vega-embed@6.18.2"></script>
<style>
table {
border-spacing: 15px;
}
</style>
</head>
<body>
<div id="params_yaml" style="text-align: center; padding: 10x">
<p>params_yaml</p>
<div style="display: flex;justify-content: center;">
<table>
<thead>
<tr><th style="text-align: right;"> myParam</th></tr>
</thead>
<tbody>
<tr><td style="text-align: right;"> 123</td></tr>
</tbody>
</table>
</div>
</div>
<div id="metrics_json" style="text-align: center; padding: 10x">
<p>metrics_json</p>
<div style="display: flex;justify-content: center;">
<table>
<thead>
<tr><th style="text-align: right;"> myMetric</th></tr>
</thead>
<tbody>
<tr><td style="text-align: right;"> 543</td></tr>
</tbody>
</table>
</div>
</div>
<div id = "static_myMetric">
<script type = "text/javascript">
var spec = {"$schema": "https://vega.github.io/schema/vega-lite/v5.json", "data": {"values": [{"step": "0", "myMetric": "543", "rev": "workspace"}]}, "title": "myMetric", "width": 300, "height": 300, "params": [{"name": "smooth", "value": 0.001, "bind": {"input": "range", "min": 0.001, "max": 1, "step": 0.001}}], "layer": [{"mark": "line", "encoding": {"x": {"field": "step", "type": "quantitative", "title": "step"}, "y": {"field": "myMetric", "type": "quantitative", "title": "myMetric", "scale": {"zero": false}}, "color": {"field": "rev", "type": "nominal"}, "tooltip": [{"field": "step", "title": "step", "type": "quantitative"}, {"field": "myMetric", "title": "myMetric", "type": "quantitative"}]}, "transform": [{"loess": "myMetric", "on": "step", "groupby": ["rev", "filename", "field", "filename::field"], "bandwidth": {"signal": "smooth"}}]}, {"mark": {"type": "line", "opacity": 0.2}, "encoding": {"x": {"field": "step", "type": "quantitative", "title": "step"}, "y": {"field": "myMetric", "type": "quantitative", "title": "myMetric", "scale": {"zero": false}}, "color": {"field": "rev", "type": "nominal"}, "tooltip": [{"field": "step", "title": "step", "type": "quantitative"}, {"field": "myMetric", "title": "myMetric", "type": "quantitative"}]}}, {"mark": {"type": "circle", "size": 10, "tooltip": {"content": "encoding"}}, "encoding": {"x": {"aggregate": "max", "field": "step", "type": "quantitative", "title": "step"}, "y": {"aggregate": {"argmax": "step"}, "field": "myMetric", "type": "quantitative", "title": "myMetric", "scale": {"zero": false}}, "color": {"field": "rev", "type": "nominal"}}}]};
vegaEmbed('#static_myMetric', spec);
</script>
</div>
<div id = "iris_feature_importance">
<script type = "text/javascript">
var spec = {"$schema": "https://vega.github.io/schema/vega-lite/v5.json", "data": {"values": [{"name": "petal_width", "importance": 0.4, "rev": "workspace"}, {"name": "petal_length", "importance": 0.33, "rev": "workspace"}, {"name": "sepal_width", "importance": 0.24, "rev": "workspace"}, {"name": "sepal_length", "importance": 0.03, "rev": "workspace"}]}, "title": "Iris Dataset: Feature Importance", "width": 300, "height": 300, "mark": {"type": "bar"}, "encoding": {"x": {"field": "importance", "type": "quantitative", "title": "Feature Importance", "scale": {"zero": false}}, "y": {"field": "name", "type": "nominal", "title": "Feature Name"}, "yOffset": {"field": "rev"}, "color": {"field": "rev", "type": "nominal"}}};
vegaEmbed('#iris_feature_importance', spec);
</script>
</div>
</body>
</html>

11
main.py Normal file
View File

@ -0,0 +1,11 @@
from code.simple import run_simple_experiment
def main():
print("Running main...")
run_simple_experiment()
print("Running done!")
if __name__ == "__main__":
main()

1
requirements-dev.txt Normal file
View File

@ -0,0 +1 @@
# no one

2
requirements.txt Normal file
View File

@ -0,0 +1,2 @@
dvc[all]==3.4.0
dvclive