Lab 03 - Introduction to Python - external libraries

Introduction to Python - external libraries

IRiM and Fossbot4AI logos

1. Activity Identity

Activity title Introduction to Robotics
Topic Python / AI / NLP
Authors Institute of Robotics and Machine Intelligence
Dominik Belter, Jakub Chudzinski, Marcin Czajka, Kamil Młodzikowski
Target learners Bachelor (Computer Science / IT, Robotics)
Estimated duration 1.5 hour
Difficulty level Beginner
FOSSBot environment Hybrid
Licence CC BY 4.0

2. Learning Objectives and Competences

ID Learning outcome Related competences Assessment evidence
LO1 Students will be able to install and isolate external Python libraries using venv and pip. Knowledge of Python tooling; selecting programming tools Screenshot of the working virtual environment after pip install
LO2 Students will be able to implement a small classical text-classification pipeline with scikit-learn (TF-IDF + LogisticRegression). Selecting programming tools; using libraries for designing perception components Working classifier_sklearn.py and its JSON output on the basic test file
LO3 Students will be able to use a pretrained multilingual model from sentence-transformers to match text by meaning. Using libraries for designing perception components; selecting programming tools Working classifier_st.py and its JSON output on the multilingual test file

3. Prerequisites

4. Required Material and Setup

Category Item Version / Quantity Notes
Hardware Workstation 1 per student Any Linux PC.
Software Python 3.10+, pip, venv bundled with most Linux distributions Pre-installed on the lab workstations.
Software git bundled with most Linux distributions Used to clone the starter repository.
Dataset / model paraphrase-multilingual-MiniLM-L12-v2 downloaded on first run Around 120 MB, cached after the first use. Hosted on Hugging Face.
Starter code fossbot-text-to-cmd from GitHub Provides the CLI skeleton, dataset and TODO blocks you will fill in.

5. Safety, Ethics and Accessibility Notes

The only risks in this lab are operational:

6. Scenario and Problem Statement

In this lab you will build a small command-line tool that takes a natural-language command (for example "go forward") and outputs the corresponding wheel motor speeds as JSON - the kind of format a low-level robot driver would consume.

You will implement two text-classifiers and compare them:

  1. A classical machine-learning pipeline built from scratch with scikit-learn (TF-IDF + LogisticRegression).
  2. A pretrained multilingual sentence-transformer used through similarity matching.

7. Lab Workflow

Phase Student action Expected output Time
1. Setup Create venv, install dependencies Working environment 10 min
2. Classical ML Implement TF-IDF + LogisticRegression in classifier_sklearn.py sklearn classifier passes basic test 25 min
3. Pretrained AI Implement similarity matcher in classifier_st.py sentence-transformer classifier works 20 min
4. Experiments Add the two multilingual runs and inspect the diff 4 JSON outputs 10 min
5. Understand Read the conceptual explanation Understand the role of features vs classifier 10 min
6. Bonus (optional) Drive a physical robot with your classifier Robot moves on text commands -
7. Cleanup Deactivate venv, remove starter directory Clean /tmp for the next user 2 min
8. Reflection Answer the analysis questions Short answers 13 min

8. Step-by-Step Instructions

Step 1 - Environment preparation

💡 Lab workstation credentials. Every workstation in the lab uses the same local account: username put, password lrm.

  1. Log in to your lab workstation.

  2. Open a terminal (Ctrl+Alt+T on Ubuntu).

  3. Clean up state from any previous lab session. Remove leftover screenshots and any starter directory from a previous run, so your final submission only contains artifacts from this session and git clone does not fail with destination path already exists:

rm -rf ~/Pictures/Screenshots /tmp/fossbot-text-to-cmd
  1. Clone the starter repository into a fresh directory and enter it:
cd /tmp
git clone https://github.com/LRMPUT/fossbot-text-to-cmd.git
cd fossbot-text-to-cmd

💡 Tip: All lab work happens in /tmp rather than in your home directory. /tmp is the conventional location for scratch work and is wiped on every reboot, so the workstation stays clean for the next user.

  1. Create an isolated Python environment with venv and activate it:
python3 -m venv .venv
source .venv/bin/activate

After activation your prompt should change to start with (.venv). From now on, every python and pip command runs inside this environment, not the system one.

  1. Install a CPU-only PyTorch first. By default pip would download the CUDA build of PyTorch (~5 GB of NVIDIA libraries). We do not need GPU support in this lab, so pull the much smaller CPU build from PyTorch’s dedicated index:
pip install torch --index-url https://download.pytorch.org/whl/cpu
  1. Install the remaining dependencies declared in requirements.txt:
pip install -r requirements.txt

Together the two commands download roughly 900 MB of packages. The first install takes a few minutes.

  1. Verify the install by importing each library:
python -c "import pandas, sklearn, sentence_transformers; print('OK')"

Expected result: The terminal prints OK. If anything fails to import, re-read the pip install output for an error and re-run the install.

📸 Capture for submission: screenshot the terminal showing the successful OK line and a prompt that starts with (.venv).

Step 2 - Classical ML classifier

You will now fill in src/classifier_sklearn.py. The class builds a tiny but complete classical-ML text classifier from two scikit-learn building blocks:

Open src/classifier_sklearn.py in any editor and complete the TODOs (each TODO block in the skeleton also has a direct link to the relevant function docs):

  1. TODO 1 - build the pipeline. In __init__, create a sklearn Pipeline with two named steps - a TfidfVectorizer first, then a LogisticRegression:
self.pipeline = Pipeline([
    ("vec", TfidfVectorizer(lowercase=True, ngram_range=(1, 2))),
    ("clf", LogisticRegression(max_iter=1000)),
])
  1. TODO 2 - load the CSV. Use pandas to read the file pointed to by csv_path. The CSV has two columns: text and action.

  2. TODO 3 - train / test split. Use train_test_split with test_size=0.2 and random_state=42 so the result is reproducible:

x_train, x_test, y_train, y_test = train_test_split(
    df["text"], df["action"], test_size=0.2, random_state=42
)
  1. TODO 4 - fit the pipeline on the training split.

  2. TODO 5 - print accuracy. Predict on the test set, compute the accuracy with accuracy_score, and print the result:

predictions = self.pipeline.predict(x_test)
print(f"Test accuracy: {accuracy_score(y_test, predictions):.3f}")
  1. TODO 6 and 7 - predict on a single text. This one is a small synthesis challenge - the API of Pipeline.predict and Pipeline.predict_proba has a small twist that you need to work around. Implement predict(self, text) so it returns a (action, confidence) tuple where:

    • action is the predicted class label as a single string.
    • confidence is the maximum class probability (a float between 0 and 1).
Hint - things to figure out from the docs

When the file is complete, run the classifier on the basic input file:

python -m src.text_to_wheels \
    --input data/examples/basic.txt \
    --output outputs/sklearn_basic.json \
    --classifier sklearn

Expected result: The terminal prints something like Test accuracy: 0.80 (your exact number may vary because of the random split) and Processed 7 commands with the 'sklearn' classifier. Open outputs/sklearn_basic.json and verify that each command was mapped to the right action (forward, turn_left, stop, …).

📸 Capture for submission: screenshot the terminal showing the test accuracy, the success message and a one-line-per-prediction summary of the result (so the screenshot fits on screen):

python -c "import json; [print(f\"{r['input']:22s} -> {r['action']:12s} ({r['confidence']:.3f})\") for r in json.load(open('outputs/sklearn_basic.json'))]"

Experiment with the classical pipeline

Once the classifier passes the basic test, try the following variations one at a time. Re-run after each change and note how the test accuracy changes.

  1. Change ngram_range from (1, 2) to (1, 1) in your TfidfVectorizer. This drops bigrams - the model only sees individual words now. Does accuracy go up, down or stay the same? Think about why that might be for our short English commands.
Expected outcome

For our small dataset the accuracy on the held-out test set actually goes up slightly with (1, 1) - around 0.93 versus 0.80 with (1, 2) or (1, 3). The English commands are short and the unigrams are already very informative on their own; adding bigrams introduces extra TF-IDF features that overfit the small training set and hurt generalisation on the test set. The takeaway: more features is not always better - on small datasets they can add noise rather than signal.

  1. Change random_state=42 to another value (for example 7 or 0). Run multilingual.txt again with the sklearn classifier. Does the same class still win the bias collapse, or did another class take over? What does that tell you about the role of random_state?
Expected outcome

backward only wins the collapse for random_state=42. With random_state=0, 1, 3, 7 or 100 the dominant prediction switches to stop (15 of 15 multilingual inputs are predicted as stop). The bias-winner is an artefact of how the random shuffle happened to split the classes; it is not a property of the data, the classifier or the Polish/German/Spanish inputs. The mechanism (all-zero TF-IDF vector -> only biases matter) is identical for every seed.

  1. Optional - bigger change. Append a few Polish phrases for each action to data/training_commands.csv (e.g. do przodu,forward). Re-run on multilingual.txt. Did the sklearn classifier suddenly become multilingual on Polish? Why or why not?
Expected outcome

With three Polish phrases per action added, all five Polish inputs in multilingual.txt (do przodu, skręć w lewo, stop, do tyłu, skręć w prawo) are now classified correctly. The German and Spanish inputs are almost all still wrong, because their words are still missing from the TF-IDF vocabulary - they continue to produce all-zero vectors and collapse to whichever class wins the bias (now stop after the new training distribution). The lesson is concrete: TF-IDF cannot generalise beyond the languages and exact words in its training data. Full multilingual support would require training examples for every language you care about.

Revert the changes before moving on so the rest of the lab runs against the original setup.

Step 3 - Pretrained multilingual classifier

You will now fill in src/classifier_st.py. This classifier needs no training data of its own. It uses a pretrained multilingual model from the sentence-transformers library that already understands the meaning of sentences in around 50 languages.

The idea:

TEMPLATES = {
    "forward":    ["go forward", "forward", "ahead"],
    "backward":   ["go back", "backwards", "reverse"],
    "turn_left":  ["turn left", "left", "rotate left"],
    "turn_right": ["turn right", "right", "rotate right"],
    "stop":       ["stop", "halt", "stay still"],
}

Open src/classifier_st.py in any editor and complete the TODOs (each TODO block in the skeleton also has a direct link to the relevant function docs):

  1. TODO 1 - load the model. In __init__, instantiate the pretrained model:
self.model = SentenceTransformer(model_name)

The first call downloads the model (around 120 MB) and caches it under ~/.cache/huggingface/.

  1. TODO 2 - pre-encode the templates. For each action in TEMPLATES, encode its template phrases once and store the resulting matrix in self.template_embeddings:
for action, phrases in TEMPLATES.items():
    self.template_embeddings[action] = self.model.encode(phrases)
  1. TODO 3, 4, 5 - predict. The main coding challenge. Implement predict(self, text) so it returns a (action, similarity) tuple where action is the action whose templates are most similar to the input and similarity is the cosine similarity of that best match.
Hint - composition pattern

When the file is complete, run the classifier on the same basic input:

python -m src.text_to_wheels \
    --input data/examples/basic.txt \
    --output outputs/st_basic.json \
    --classifier st

Expected result: The terminal prints Processed 7 commands with the 'st' classifier. and outputs/st_basic.json looks similar to the sklearn output - every command was mapped to the right action.

📸 Capture for submission: screenshot the terminal showing the success message and the compact summary of the result:

python -c "import json; [print(f\"{r['input']:22s} -> {r['action']:12s} ({r['confidence']:.3f})\") for r in json.load(open('outputs/st_basic.json'))]"

Experiment with the sentence-transformer

Try the following one at a time and observe what changes:

  1. Reduce the templates. Keep only one phrase per action in TEMPLATES (delete the other two). Re-run on multilingual.txt. Does accuracy hold up, or do some inputs now miss? Why might that be?
Expected outcome

Accuracy holds up - on the standard multilingual.txt you still get 15 out of 15 correct even with a single template per action. The pretrained model already understands meaning, so a single anchor phrase per action is enough; the multilingual capability comes from the model itself, not from the number of templates. Extra phrases mainly help on tricky or ambiguous inputs that are equally close to several actions - they widen the “catchment area” without changing the language coverage.

  1. Test your own languages. Append 2-3 phrases in another language you know (Italian, French, Russian, …) to data/examples/multilingual.txt. Re-run and check the predictions. Did the model handle the new languages?
Expected outcome

The model typically handles them well - paraphrase-multilingual-MiniLM-L12-v2 was pretrained on around 50 languages, so Italian (“avanti”, “indietro”), French (“avance”, “arrête”), Russian (“вперёд”, “стой”), Czech (“dopředu”) and Portuguese (“para frente”) all map to the right action without any extra work. The occasional miss happens on rarer words, but the coverage is impressive for a free 22 MB model that you did not have to “tell” about any of those languages.

  1. Swap the model. Change model_name to "all-MiniLM-L6-v2" (an English-only sibling of our multilingual model). Re-run on both basic.txt and multilingual.txt. What happens? What does this tell you about where the multilingual capability lives?
Expected outcome

On basic.txt (English inputs) the English-only model still works perfectly - it knows English and the templates are English. But multilingual.txt collapses to roughly 4 out of 15: a couple of Polish words happen to be close enough to English ones for the model to recover, but German and Spanish drop almost entirely. The lesson: the multilingual behaviour lives in the pretrained model, not in our templates or in the classifier logic on top. The model AND the inputs need to share a common embedding space - when you swap to a model that only learned English, only English inputs continue to work.

Revert the changes before moving on.

Step 4 - Compare the two approaches

You have two working classifiers and two example files. Now run all four combinations and look at the differences:

# Sklearn on basic.txt (already done in Step 2)
python -m src.text_to_wheels \
    --input data/examples/basic.txt \
    --output outputs/sklearn_basic.json \
    --classifier sklearn

# Sklearn on multilingual
python -m src.text_to_wheels \
    --input data/examples/multilingual.txt \
    --output outputs/sklearn_multilingual.json \
    --classifier sklearn

# Sentence transformer on basic.txt (already done in Step 3)
python -m src.text_to_wheels \
    --input data/examples/basic.txt \
    --output outputs/st_basic.json \
    --classifier st

# Sentence transformer on multilingual
python -m src.text_to_wheels \
    --input data/examples/multilingual.txt \
    --output outputs/st_multilingual.json \
    --classifier st

Compare the four output files. The expected pattern is:

Input Classifier Expected accuracy
basic.txt sklearn high (~7/7)
basic.txt st high (~7/7)
multilingual.txt sklearn low (~2/15)
multilingual.txt st high (~14-15/15)

Look at outputs/sklearn_multilingual.json - the sklearn classifier collapses to almost always predicting the same action with very low confidence. Compare with outputs/st_multilingual.json, where Polish, German and Spanish commands all map to the correct action.

📸 Capture for submission: screenshot the compact summaries of both multilingual outputs printed back to back. Run the one-liner once per file:

echo "=== sklearn ===" && python -c "import json; [print(f\"{r['input']:22s} -> {r['action']:12s} ({r['confidence']:.3f})\") for r in json.load(open('outputs/sklearn_multilingual.json'))]" && echo "=== st ===" && python -c "import json; [print(f\"{r['input']:22s} -> {r['action']:12s} ({r['confidence']:.3f})\") for r in json.load(open('outputs/st_multilingual.json'))]"

Step 5 - Why these approaches differ on multilingual input

Every ML classifier operates on numbers. Text must first be turned into a vector. The whole difference between our two approaches is how that conversion happens, not which algorithm consumes the result afterwards.

How sklearn TF-IDF represents text

Step 1 - build a vocabulary. TfidfVectorizer reads all 75 examples from training_commands.csv and builds a dictionary of unique tokens. The vocabulary might look like:

vocab = ["go", "forward", "ahead", "turn", "left", "right", "back", "stop", ...]
                                  (say a few hundred unique tokens)

Step 2 - one vector per text. Each text becomes a vector with one dimension per vocabulary token (so a few hundred dimensions for our vocabulary). For "go forward":

[1, 1, 0, 0, 0, 0, 0, 0, ...]
 ^  ^
 |  └── "forward" is present
 └───── "go" is present
        all other positions are zero

(In practice the values are not 1 or 0 but TF * IDF weights; the principle is the same.)

Step 3 - the classifier learns. LogisticRegression learns rules like “when the positions for go and forward are set, the answer is forward”.

What happens with "fahre vorwärts"?

TF-IDF looks up "fahre" in the vocabulary    -> NOT THERE (never seen)
TF-IDF looks up "vorwärts" in the vocabulary -> NOT THERE (never seen)

Vector: [0, 0, 0, 0, 0, 0, ..., 0]   (all zeros)

The classifier receives an all-zero vector. There is no signal to act on, so it always predicts the same class. Which one? LogisticRegression has learned a per-class bias - a small built-in tendency that says, in effect, “when nothing else is informative, this is my default guess for this class”. The class with the highest bias wins. Biases are not arbitrary; they were adjusted during training and roughly reflect how often each action appeared in the training split. With our exact data and random_state=42, the backward class ended up with the highest bias, so it wins every tie - that is the backward-with-confidence-around-0.25 collapse you saw in Step 4. The exact confidence value may differ by a few thousandths between scikit-learn versions, but the winning class is reproducible. With a different random seed a different class would win (in fact stop wins for most other seeds), but the mechanism is the same.

Key idea: TF-IDF only knows the words it saw during training. This is lexical matching (letter matching), not semantic.

How sentence-transformers represents text

A sentence-transformer is a model that has already been trained, in our case paraphrase-multilingual-MiniLM-L12-v2. It was trained on billions of sentences in around 50 languages. During that pretraining it learned a crucial property: sentences with similar meaning get similar vectors - regardless of the language they are written in.

Step 1 - no vocabulary from our data. The model simply uses what it learned during pretraining.

Step 2 - each text becomes a fixed-length dense vector (a vector where every dimension carries a meaningful value, in contrast to TF-IDF where most are zero) that represents its meaning rather than the literal words (tokens) that make it up. The length of that vector is a property of the model: our paraphrase-multilingual-MiniLM-L12-v2 always produces 384 numbers. For illustration:

"go forward"     -> [ 0.20, -0.41,  0.56, ..., -0.11]   (384 numbers)
"do przodu"      -> [ 0.21, -0.43,  0.55, ..., -0.12]   very close
"fahre vorwärts" -> [ 0.22, -0.42,  0.54, ..., -0.13]   also close
"adelante"       -> [ 0.19, -0.40,  0.57, ..., -0.10]   also close
"stop"           -> [-0.15,  0.71, -0.22, ...,  0.34]   FAR from the four above
"halt"           -> [-0.16,  0.70, -0.23, ...,  0.33]   close to "stop"

# Note: a single number (e.g. 0.21) does not represent a word or a single
# concept. The 384 dimensions together form an abstract coordinate;
# meaning is encoded by the position of the whole vector in this space -
# sentences with similar meaning end up close to each other.

In this 384-dimensional embedding space, sentences cluster by meaning - English, Polish, German and Spanish forward-commands all land in the same neighbourhood, while "stop" sits elsewhere.

Step 3 - cosine similarity measures the angle between vectors. Close vectors -> similarity close to 1; far vectors -> close to 0.

What happens with "fahre vorwärts"?

1. The model encodes "fahre vorwärts" -> [0.22, -0.42, 0.54, ...]
   (the model understands the meaning - it IS a forward command, in German)

2. Compare to the template embeddings of every action:
   - similarity with "go forward"    = 0.93   <-- VERY HIGH
   - similarity with "go back"       = 0.21
   - similarity with "turn left"     = 0.18
   - similarity with "turn right"    = 0.17
   - similarity with "stop"          = 0.12

3. Highest similarity wins -> action = "forward"

This is semantic matching. The model never saw our specific templates during pretraining, but it learned on billions of sentence pairs that German "fahre vorwärts" and English "go forward" mean the same thing.

Side-by-side comparison

Aspect TF-IDF (sklearn) Sentence Transformer
Text representation Sparse vector indexed by training-vocabulary tokens Fixed-length dense vector representing meaning
Where does word knowledge come from? Only from our 75-row training CSV From pretraining on billions of multilingual sentences
Unseen word Ignored (contributes 0) Mapped to a meaningful position in the embedding space - the model breaks unknown words into sub-word units it has seen, so even brand-new words can be located near other texts with similar meaning
Languages Only the language of the training data Multilingual out of the box
What is “knowledge”? Word statistics from our 75 examples Meaning patterns learned from billions of sentences in many languages

The takeaway

The choice of classifier on top (LogisticRegression vs cosine similarity) is not what gives multilingual capability. The representation of the text - the features the classifier sees - is what makes the difference.

TF-IDF features only know the words from our training data. Pretrained sentence embeddings carry knowledge from a massive multilingual corpus. The same lesson applies across modern ML: it is often the features (and how you obtain them, for example by pretraining), not the algorithm, that decides whether a system works.

Step 6 - Optional bonus: drive a physical robot

If you have access to a robot, wire your classifier to a low-level wheel-control API to see the full text -> wheels -> motion pipeline live. A minimal sketch:

Step 7 - Cleanup

When you have collected all the submission artefacts, leave the workstation in a clean state for the next user. From any directory:

deactivate 2>/dev/null; cd ~ && rm -rf /tmp/fossbot-text-to-cmd

deactivate exits the virtual environment (the leading 2>/dev/null silences the message if the venv was already inactive), cd ~ steps out of the starter directory so it can be removed, and rm -rf deletes the starter together with the .venv and all its installed packages.

The Hugging Face model cache under ~/.cache/huggingface/ can be left in place - it speeds up the next session and does not contain any session-specific state.

Expected result: ls /tmp/fossbot-text-to-cmd reports No such file or directory.

9. Analysis Questions

  1. Look at the confidences in /tmp/fossbot-text-to-cmd/outputs/sklearn_multilingual.json. Almost every input ended up with the same prediction and the same confidence value (around 0.25). Explain why this happens.

  2. Look at the templates dictionary in /tmp/fossbot-text-to-cmd/src/classifier_st.py. There are only 3 English phrases per action - no Polish, German or Spanish. Why is the sentence-transformer classifier still able to handle multilingual input correctly?

  3. The wheels field in the JSON output uses values in [-1.0, 1.0]. The mapping from action name to wheel speeds is defined in /tmp/fossbot-text-to-cmd/src/wheel_mapping.py. What would you change there to make the robot turn faster on the spot?

  4. The sklearn pipeline is trained from scratch on 75 examples; the sentence-transformer model is loaded already trained. List one advantage and one disadvantage of each approach for a project that needs to recognise 50 different commands instead of 5.

After attempting it yourself, you may review the suggested answer

sklearn

Sentence transformer

10. Submission Requirements

11. References and Open Licence

Direct links to the specific functions used in this lab are in the TODO comments of the skeleton files (src/classifier_sklearn.py, src/classifier_st.py).

The Creative Commons Attribution 4.0 International (CC BY 4.0) license allows users to share, copy, distribute, and adapt the work, even for commercial purposes, as long as proper credit is given to the original creator.

EU funding disclaimer

Funded by the European Union. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Education and Culture Executive Agency (EACEA). Neither the European Union nor EACEA can be held responsible for them.