{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Getting Started\n",
"\n",
"In this notebook, we will walk through the most basic functionalities of SmartSim.\n",
"\n",
" - Experiment and Models\n",
" - Ensembles\n",
" - Running and Communicating with the Orchestrator\n",
" - Ensembles using SmartRedis\n",
"\n",
"## Experiments and Models \n",
"\n",
"`Experiment`s are how users define workflows in SmartSim. The `Experiment` is used to create `Model` instances which represent applications, scripts, or generally a program. An experiment can start and stop a `Model` and monitor execution.\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from smartsim import Experiment"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The next step is to initialize an `Experiment` instance. The `Experiment` must be provided a name. This name can be any string, but it is best practice to give it a meaningful name as a broad title for what types of models the experiment will be supervising. For our purposes, our `Experiment` will be named `\"getting-started\"`.\n",
"\n",
"The `Experiment` also needs to have a `launcher` specified. Launchers provide SmartSim the ability to construct and execute complex workloads on HPC systems with schedulers (workload managers) like Slurm, or PBS. SmartSim currently supports\n",
" * `slurm`\n",
" * `pbs`\n",
" * `cobalt`\n",
" * `lsf`\n",
" * `local` (single node/laptops)\n",
" * `auto`\n",
"\n",
"If `launcher=\"auto\"` is used, the experiment will attempt to find a launcher on the system, and use the first one it encounters. If a launcher cannot be found or no launcher parameter is provided, the default value of `launcher=\"local\"` will be used. \n",
"\n",
"For simplicity, we will start on a single host and only launch single-host jobs, and as such will set the launcher argument to `\"local\"`"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# Init Experiment and specify to launch locally\n",
"exp = Experiment(name=\"getting-started\", launcher=\"local\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To run a workload through SmartSim, a `Model` instance must be created, and started.\n",
"\n",
"Our first `Model` will simply print `hello` using the shell command `echo`.\n",
"\n",
"`Experiment.create_run_settings` is used to create a `RunSettings` instance for our `Model`. `RunSettings` describe *how* a `Model` should be executed provided the system and available computational resources.\n",
"\n",
"`create_run_settings` is a factory method that will instantiate a `RunSettings` object of the appropriate type based on the `run_command` argument (i.e. `mpirun`, `aprun`, `srun`, etc). The default argument of `auto` will attempt to choose a `run_command` based on the available system software and the launcher specified in the experiment. If `run_command=None` is provided, the command will be launched without one."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# settings to execute the command \"echo hello!\"\n",
"settings = exp.create_run_settings(exe=\"echo\", exe_args=\"hello!\", run_command=None)\n",
"\n",
"# create the simple model instance so we can run it.\n",
"M1 = exp.create_model(name=\"tutorial-model\", run_settings=settings)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once the `Model` has been created by the `Experiment`, it can be started.\n",
"\n",
"By setting `summary=True`, we can see a summary of the experiment printed before it is launched. The summary will stay for 10 seconds, and it is useful as a last check. If we set `summary=False`, then the experiment would be launched immediately.\n",
"\n",
"We also explicitly set `block=True` (even though it is the default), so that `Experiment.start` waits until the last `Model` has finished before returning: it will act like a job monitor, letting us know if processes run, complete, or fail."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"00:18:27 e3fbeabfdb3e SmartSim[1408] INFO \n",
"\n",
"=== Launch Summary ===\n",
"Experiment: getting-started\n",
"Experiment Path: /home/craylabs/tutorials/getting_started/getting-started\n",
"Launcher: local\n",
"Models: 1\n",
"Database Status: inactive\n",
"\n",
"=== Models ===\n",
"tutorial-model\n",
"Executable: /usr/bin/echo\n",
"Executable Arguments: hello!\n",
"\n",
"\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" \r"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"00:18:39 e3fbeabfdb3e SmartSim[1408] INFO tutorial-model(1428): Completed\n"
]
}
],
"source": [
"exp.start(M1, block=True, summary=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The model has completed. Let's look at the content of the current working directory. We can see that two files, `tutorial-model.out` and `tutorial-model.err` have been created."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Content of tutorial-model.out:\n",
"hello!\n",
"\n",
"Content of tutorial-model.err:\n",
"\n"
]
}
],
"source": [
"outputfile = './tutorial-model.out'\n",
"errorfile = './tutorial-model.err'\n",
"\n",
"print(\"Content of tutorial-model.out:\")\n",
"with open(outputfile, 'r') as fin:\n",
" print(fin.read())\n",
"print(\"Content of tutorial-model.err:\")\n",
"with open(errorfile, 'r') as fin:\n",
" print(fin.read())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `.out` file contains the output generated by `tutorial-model`, and the `.err` file would contain the error messages generated by it. Since there were no errors, the `.err` file is empty.\n",
"\n",
"Now let's run two different `Model` instances at the same time. This is just as easy as running one `Model`, and takes the same steps. This time, we will skip the summary. "
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"00:18:45 e3fbeabfdb3e SmartSim[1408] INFO tutorial-model-1(1431): Completed\n",
"00:18:48 e3fbeabfdb3e SmartSim[1408] INFO tutorial-model-2(1432): Running\n",
"00:18:49 e3fbeabfdb3e SmartSim[1408] INFO tutorial-model-2(1432): Completed\n"
]
}
],
"source": [
"run_settings_1 = exp.create_run_settings(exe=\"echo\", exe_args=\"hello!\", run_command=None)\n",
"run_settings_2 = exp.create_run_settings(exe=\"sleep\", exe_args=\"5\", run_command=None)\n",
"model_1 = exp.create_model(\"tutorial-model-1\", run_settings_1)\n",
"model_2 = exp.create_model(\"tutorial-model-2\", run_settings_2)\n",
"exp.start(model_1, model_2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For users of parallel applications, launch binaries (run commands) can also be specified in `RunSettings`. For example, if `mpirun` is installed on the system, we can run a model through it, by specifying it as `run_command` in `create_run_settings`.\n",
"\n",
"Please note that to run this you need to have OpenMPI installed. "
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"00:18:53 e3fbeabfdb3e SmartSim[1408] INFO \n",
"\n",
"=== Launch Summary ===\n",
"Experiment: getting-started\n",
"Experiment Path: /home/craylabs/tutorials/getting_started/getting-started\n",
"Launcher: local\n",
"Models: 1\n",
"Database Status: inactive\n",
"\n",
"=== Models ===\n",
"tutorial-model-mpirun\n",
"Executable: /usr/bin/echo\n",
"Executable Arguments: hello world!\n",
"Run Command: mpirun\n",
"Run Arguments:\n",
"\tn = 2\n",
"\n",
"\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" \r"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"00:19:05 e3fbeabfdb3e SmartSim[1408] INFO tutorial-model-mpirun(1435): Completed\n"
]
}
],
"source": [
"# settings to execute the command \"mpirun -np 2 echo hello world!\"\n",
"openmpi_settings = exp.create_run_settings(exe=\"echo\",\n",
" exe_args=\"hello world!\",\n",
" run_command=\"mpirun\")\n",
"openmpi_settings.set_tasks(2)\n",
"\n",
"# create and start the MPI model\n",
"ompi_model = exp.create_model(\"tutorial-model-mpirun\", openmpi_settings)\n",
"exp.start(ompi_model, summary=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This time, since we asked `mpirun` to run two tasks by calling `openmpi_settings.set_tasks(2)`, in the output file we should find the line `hello world!` twice."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Content of tutorial-model-mpirun.out:\n",
"hello world!\n",
"hello world!\n",
"\n"
]
}
],
"source": [
"outputfile = './tutorial-model-mpirun.out'\n",
"\n",
"print(\"Content of tutorial-model-mpirun.out:\")\n",
"with open(outputfile, 'r') as fin:\n",
" print(fin.read())\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Ensembles\n",
"\n",
"In the previous example, the two `Model` instances were created separately. The `Ensemble` SmartSim object is a more convenient way of setting up multiple models, potentially with different configurations. `Ensemble`s are groups of `Model` instances that can be treated as a single reference. We start by specifying `RunSettings` similar to how we did with our `Model`s."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"# define how we want each ensemble member to execute\n",
"# in this case we create settings to execute \"sleep 3\"\n",
"ens_settings = exp.create_run_settings(exe=\"sleep\", exe_args=\"3\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then, instead of creating a single model like we did in previously, we will call the `Experiment.create_ensemble` method to create an `Ensemble`. Let's assume we want to run the same experiment four times in parallel. We will then pass the method the same arguemnts that we might pass `Experiment.create_model` in addition to the `replicas=4` argument. Finally, we simply start the `Ensemble` the same way we wold start a `Model`."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"00:19:08 e3fbeabfdb3e SmartSim[1408] INFO \n",
"\n",
"=== Launch Summary ===\n",
"Experiment: getting-started\n",
"Experiment Path: /home/craylabs/tutorials/getting_started/getting-started\n",
"Launcher: local\n",
"Ensembles: 1\n",
"Database Status: inactive\n",
"\n",
"=== Ensembles ===\n",
"ensemble-replica\n",
"Members: 4\n",
"Batch Launch: False\n",
"\n",
"\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" \r"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"00:19:24 e3fbeabfdb3e SmartSim[1408] INFO ensemble-replica_0(1443): Completed\n",
"00:19:24 e3fbeabfdb3e SmartSim[1408] INFO ensemble-replica_2(1445): Completed\n",
"00:19:24 e3fbeabfdb3e SmartSim[1408] INFO ensemble-replica_1(1444): Completed\n",
"00:19:25 e3fbeabfdb3e SmartSim[1408] INFO ensemble-replica_3(1446): Completed\n",
"00:19:26 e3fbeabfdb3e SmartSim[1408] INFO ensemble-replica_1(1444): Completed\n",
"00:19:26 e3fbeabfdb3e SmartSim[1408] INFO ensemble-replica_3(1446): Completed\n"
]
}
],
"source": [
"ensemble = exp.create_ensemble(\"ensemble-replica\",\n",
" replicas=4,\n",
" run_settings=ens_settings)\n",
"\n",
"exp.start(ensemble, summary=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From the output, we see that four copies of our `Model`, named *ensemble-replica_0*, *ensemble-replica_1*, ... were run. In each output file, we will see that the same output was generated.\n",
"\n",
"Now let's imagine that we don't want to run the *same* model four times, but we want to run variations of it. One way of doing this would be to define four models, and starting them through the `Experiment`.\n",
"\n",
"For a few, simple `Model`s, this would be OK, but what if we needed to run a large number of models, which only differ for some parameter? Defining and adding each one separately would be tedious. For such cases, we will rely on a parameterized `Ensemble` of models.\n",
"\n",
"Say we had a python file `output_my_parameter.py` with this contents:\n",
"```py\n",
"# contents of output_my_parameter.py\n",
"import time\n",
"\n",
"time.sleep(2)\n",
"print(\"Hello, my name is ;tutorial_name; \" + \n",
" \"and my parameter is ;tutorial_parameter;\")\n",
"```\n",
"\n",
"Our goal is to run \n",
"\n",
"```python output_my_parameter.py```\n",
"\n",
"with multiple parameter values substituted where the text contains `;tutorial_name;` and `;tutorial_parameter;`. Clearly, we could pass the parameters as arguments, but in some cases, this could not be possible (e.g. if the parameters were stored in a file or the executable would not accept them from the command line).\n",
"\n",
"First thing first, is that we must again create our run settings:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"rs = exp.create_run_settings(exe=\"python\", exe_args=\"output_my_parameter.py\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then, we define the parameters we are going to set and values for those parameters in a dictionary. In this example, we are setting:\n",
"\n",
" 1. `tutorial_name` with values `\"Ellie\"` and `\"John\"`\n",
" 2. `tutorial_parameter` with values `2` and `11`\n",
" \n",
"In the original file `output_my_parameter.py`, which acts as a template, they occur as `;tutorial_name;` and `;tutorial_parameter;`. The semi-colons are used to perform a regexp substitution with the desired values. The semi-colon in this case, is called a *tag* and can be changed.\n",
"\n",
"We pass the parameter ditionary to `Experiment.create_ensemble`, along with the argument `perm_strategy=\"all_perm\"`. This argument means that we want all possible permutations of the given parameters, which are stored in the argument `params`. We have two options for both parameters, thus our ensemble will run 4 instances of the same `Experiment`, just using a different copy of `output_my_parameter.py` created by calling `Experiment.generate()`. We attach the template file to the `Ensemble` instance, generate the augmented python files, and run the experiment."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"00:19:30 e3fbeabfdb3e SmartSim[1408] INFO Working in previously created experiment\n",
"00:19:34 e3fbeabfdb3e SmartSim[1408] INFO ensemble_0(1449): Completed\n",
"00:19:34 e3fbeabfdb3e SmartSim[1408] INFO ensemble_1(1450): Completed\n",
"00:19:34 e3fbeabfdb3e SmartSim[1408] INFO ensemble_2(1451): Completed\n",
"00:19:35 e3fbeabfdb3e SmartSim[1408] INFO ensemble_3(1452): Completed\n",
"00:19:36 e3fbeabfdb3e SmartSim[1408] INFO ensemble_3(1452): Completed\n"
]
}
],
"source": [
"params = {\n",
" \"tutorial_name\": [\"Ellie\", \"John\"],\n",
" \"tutorial_parameter\": [2, 11]\n",
"}\n",
"ensemble = exp.create_ensemble(\"ensemble\", params=params, run_settings=rs, perm_strategy=\"all_perm\")\n",
"\n",
"# to_configure specifies that the files attached should be read and tags should be looked for\n",
"config_file = \"./output_my_parameter.py\"\n",
"ensemble.attach_generator_files(to_configure=config_file)\n",
"\n",
"exp.generate(ensemble, overwrite=True)\n",
"exp.start(ensemble)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see from the output that four instances of our experiment were run, each one named like the `Experiment`, with a numeric suffix at the end: `ensemble_0`, `ensemble_1`, etc. The call to ``Experiment.generate()`` created isolated output directories for each created `Model` in the ensemble and each ensemble member generated its own output files, which was stored in its respective directory."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Content of getting-started/ensemble/ensemble_0/ensemble_0.out:\n",
"Hello, my name is Ellie and my parameter is 2\n",
"\n",
"Content of getting-started/ensemble/ensemble_1/ensemble_1.out:\n",
"Hello, my name is Ellie and my parameter is 11\n",
"\n",
"Content of getting-started/ensemble/ensemble_2/ensemble_2.out:\n",
"Hello, my name is John and my parameter is 2\n",
"\n",
"Content of getting-started/ensemble/ensemble_3/ensemble_3.out:\n",
"Hello, my name is John and my parameter is 11\n",
"\n"
]
}
],
"source": [
"for id in range(4):\n",
" outputfile = f\"getting-started/ensemble/ensemble_{id}/ensemble_{id}.out\"\n",
"\n",
" print(f\"Content of {outputfile}:\")\n",
" with open(outputfile, 'r') as fin:\n",
" print(fin.read())\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That's it! All possible permutations of the input parameters were used to execute the experiment! Sometimes, the parameter space can be too large to be explored exhaustively. In that case, we can use a different permutation strategy, i.e. `random`. For example, if we want to only use two possible random combinations of our parameter space, we can run the following code, where we specify `n_models=2` and `perm_strategy=\"random\"`."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"00:19:40 e3fbeabfdb3e SmartSim[1408] INFO Working in previously created experiment\n",
"00:19:45 e3fbeabfdb3e SmartSim[1408] INFO ensemble_0(1455): Completed\n",
"00:19:45 e3fbeabfdb3e SmartSim[1408] INFO ensemble_1(1456): Completed\n"
]
}
],
"source": [
"params = {\n",
" \"tutorial_name\": [\"Ellie\", \"John\"],\n",
" \"tutorial_parameter\": [2, 11]\n",
"}\n",
"ensemble = exp.create_ensemble(\"ensemble\", params=params, run_settings=rs, perm_strategy=\"random\", n_models=2)\n",
"config_file = \"./output_my_parameter.py\"\n",
"ensemble.attach_generator_files(to_configure=config_file)\n",
"\n",
"exp.generate(ensemble, overwrite=True)\n",
"exp.start(ensemble)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another possible permutation strategy is `stepped`, but it is also possible to pass a function, which will need to generate combinations of parameters starting from the dictionary. Please refer to the documentation to learn more about this.\n",
"\n",
"\n",
"It is also possible to use different delimiters for the parameter regexp. For example, if we had simmlarly parameterized file named `output_my_parameter_new_tag.py`, with contents:\n",
"```py\n",
"# Contents of output_my_parameter_new_tag.py\n",
"import time\n",
"\n",
"time.sleep(2)\n",
"print(\"Hello, my name is @tutorial_name@ \" + \n",
" \"and my parameter is @tutorial_parameter@\")\n",
"```\n",
"\n",
"We would want to use `@`, instead of `;`, as our *tag*. We can trivially make this adaptation by passing a `tag` argument to our `Experiment.generate` call."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"00:19:46 e3fbeabfdb3e SmartSim[1408] INFO Working in previously created experiment\n",
"00:19:51 e3fbeabfdb3e SmartSim[1408] INFO ensemble_new_tag_0(1459): Completed\n",
"00:19:51 e3fbeabfdb3e SmartSim[1408] INFO ensemble_new_tag_1(1460): Completed\n",
"00:19:51 e3fbeabfdb3e SmartSim[1408] INFO ensemble_new_tag_2(1461): Completed\n",
"00:19:52 e3fbeabfdb3e SmartSim[1408] INFO ensemble_new_tag_3(1462): Completed\n",
"00:19:53 e3fbeabfdb3e SmartSim[1408] INFO ensemble_new_tag_3(1462): Completed\n"
]
}
],
"source": [
"rs = exp.create_run_settings(exe=\"python\", exe_args=\"output_my_parameter_new_tag.py\")\n",
"params = {\n",
" \"tutorial_name\": [\"Ellie\", \"John\"],\n",
" \"tutorial_parameter\": [2, 11]\n",
"}\n",
"ensemble = exp.create_ensemble(\"ensemble_new_tag\",\n",
" params=params,\n",
" run_settings=rs,\n",
" perm_strategy=\"all_perm\")\n",
"\n",
"config_file = \"./output_my_parameter_new_tag.py\"\n",
"ensemble.attach_generator_files(to_configure=config_file)\n",
"\n",
"exp.generate(ensemble, overwrite=True, tag='@')\n",
"exp.start(ensemble)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Last, we can see all the kernels we have executed by calling `Experiment.summary()`"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"| | Name | Entity-Type | JobID | RunID | Time | Status | Returncode |\n",
"|----|-----------------------|---------------|---------|---------|---------|-----------|--------------|\n",
"| 0 | tutorial-model | Model | 1428 | 0 | 2.00734 | Completed | 0 |\n",
"| 1 | tutorial-model-1 | Model | 1431 | 0 | 2.22411 | Completed | 0 |\n",
"| 2 | tutorial-model-2 | Model | 1432 | 0 | 5.98942 | Completed | 0 |\n",
"| 3 | tutorial-model-mpirun | Model | 1435 | 0 | 2.00939 | Completed | 0 |\n",
"| 4 | ensemble-replica_0 | Model | 1443 | 0 | 4.64557 | Completed | 0 |\n",
"| 5 | ensemble-replica_2 | Model | 1445 | 0 | 4.2261 | Completed | 0 |\n",
"| 6 | ensemble-replica_1 | Model | 1444 | 0 | 6.44562 | Completed | 0 |\n",
"| 7 | ensemble-replica_3 | Model | 1446 | 0 | 6.02451 | Completed | 0 |\n",
"| 8 | ensemble_2 | Model | 1451 | 0 | 4.22712 | Completed | 0 |\n",
"| 9 | ensemble_3 | Model | 1452 | 0 | 6.02064 | Completed | 0 |\n",
"| 10 | ensemble_0 | Model | 1449 | 0 | 4.64088 | Completed | 0 |\n",
"| 11 | ensemble_0 | Model | 1455 | 1 | 4.21892 | Completed | 0 |\n",
"| 12 | ensemble_1 | Model | 1450 | 0 | 4.43377 | Completed | 0 |\n",
"| 13 | ensemble_1 | Model | 1456 | 1 | 4.00995 | Completed | 0 |\n",
"| 14 | ensemble_new_tag_0 | Model | 1459 | 0 | 4.60659 | Completed | 0 |\n",
"| 15 | ensemble_new_tag_1 | Model | 1460 | 0 | 4.39902 | Completed | 0 |\n",
"| 16 | ensemble_new_tag_2 | Model | 1461 | 0 | 4.19067 | Completed | 0 |\n",
"| 17 | ensemble_new_tag_3 | Model | 1462 | 0 | 5.9866 | Completed | 0 |\n"
]
}
],
"source": [
"print(exp.summary())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Orchestrator\n",
"\n",
"The Orchestrator is an in-memory database (Redis/KeyDB) that is launched prior to all other entities within an Experiment.\n",
"The Orchestrator can be used to store and retrieve data across languages (Fortran, C, C++, Python) during the course\n",
"of an experiment and across multiple workloads. In order to stream data into or receive data from the Orchestrator,\n",
"one of the SmartSim clients (SmartRedis) has to be used within your workload. \n",
"\n",
"
\n",
"\n",
"The Orchestrator is capable of hosting and executing AI models written in Python on CPU or GPU.\n",
"The Orchestrator supports models written with TensorFlow, Pytorch, or models saved in an ONNX format (e.g. scikit-learn).\n",
"See the inference tutorial for more information on how to use the machine learning runtimes built into\n",
"the Orchestrator database.\n",
"\n",
"Orchestrators can either be deployed on a single host, or many hosts as shown in the diagram below. \n",
"\n",
"
\n",
"\n",
"In this tutorial, a single-host host Orchestrator is deployed locally (as we specified `local` for the Experiment launcher)\n",
"and used to demonstrate how to use the SmartRedis Python client within a workload."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"from smartredis import Client\n",
"import numpy as np\n",
"\n",
"REDIS_PORT=6899"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"00:19:57 e3fbeabfdb3e SmartSim[1408] INFO Working in previously created experiment\n"
]
}
],
"source": [
"# start a new Experiment for this section\n",
"exp = Experiment(\"tutorial-smartredis\", launcher=\"local\")\n",
"\n",
"# create and start an instance of the Orchestrator database\n",
"db = exp.create_database(db_nodes=1,\n",
" port=REDIS_PORT,\n",
" interface=\"lo\")\n",
"# create an output directory for the database log files\n",
"exp.generate(db)\n",
"\n",
"# start the database\n",
"exp.start(db)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that the `Orchestrator` is running, a SmartRedis client can be used to store and retrieve data from the database."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"# connect a SmartRedis client at the address supplied by the launched\n",
"# Orchestrator instance.\n",
"# Cluster=False as the Orchestrator was deployed on a single compute host (local)\n",
"client = Client(address=db.get_address()[0], cluster=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then, we can use the client to put and retrieve Tensors. Tensors are the native array format of the client language being used. For example, in Python, NumPy arrays are the Tensor format for SmartRedis. Each stored tensors needs a unique key at which to be stored."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Receive tensor:\n",
"\n",
" [[[1. 1. 1.]\n",
" [1. 1. 1.]\n",
" [1. 1. 1.]]\n",
"\n",
" [[1. 1. 1.]\n",
" [1. 1. 1.]\n",
" [1. 1. 1.]]\n",
"\n",
" [[1. 1. 1.]\n",
" [1. 1. 1.]\n",
" [1. 1. 1.]]\n",
"\n",
" [[1. 1. 1.]\n",
" [1. 1. 1.]\n",
" [1. 1. 1.]]]\n"
]
}
],
"source": [
"send_tensor = np.ones((4,3,3))\n",
"\n",
"client.put_tensor(\"tutorial_tensor_1\", send_tensor)\n",
"\n",
"receive_tensor = client.get_tensor(\"tutorial_tensor_1\")\n",
"\n",
"print('Receive tensor:\\n\\n', receive_tensor)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With the SmartRedis `Client` and its possible to store and run a PyTorch, TensorFlow, or ONNX model in the database. The example below shows a PyTorch model being created, set in the database, and called from a SmartRedis client.\n",
"\n",
"For more information on ML inference in SmartSim, see the inference tutorial."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"import torch\n",
"import torch.nn as nn\n",
"\n",
"# taken from https://pytorch.org/docs/master/generated/torch.jit.trace.html\n",
"class Net(nn.Module):\n",
" def __init__(self):\n",
" super().__init__()\n",
" self.conv = nn.Conv2d(1, 1, 3)\n",
"\n",
" def forward(self, x):\n",
" return self.conv(x)\n",
"\n",
"\n",
"net = Net()\n",
"example_forward_input = torch.rand(1, 1, 3, 3)\n",
"module = torch.jit.trace(net, example_forward_input)\n",
"\n",
"# Save the traced model to a file\n",
"torch.jit.save(module, \"./torch_cnn.pt\") "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we send the model to the database, again, we assign it a unique key, `tutorial-cnn`. This key is provided to run the model in the `Client.run_model` method."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"# Set the model in the Redis database from the file\n",
"client.set_model_from_file(\"tutorial-cnn\", \"./torch_cnn.pt\", \"TORCH\", \"CPU\")"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"# Put a tensor in the database as a test input\n",
"data = torch.rand(1, 1, 3, 3).numpy()\n",
"client.put_tensor(\"torch_cnn_input\", data)\n",
"\n",
"# Run model and retrieve the output\n",
"client.run_model(\"tutorial-cnn\", inputs=[\"torch_cnn_input\"], outputs=[\"torch_cnn_output\"])\n",
"out_data = client.get_tensor(\"torch_cnn_output\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that we could have defined the model as an object (without storing it on disk) and send it to the DB using `set_model` instead of `set_model_from_file`. We can do the same thing for any Python function. For example, let's define a simple function takes a NumPy tensor as input."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[0. 1. 2. 3. 4. 5. 6. 7. 8.]]\n",
"Max:\n",
"8.0\n"
]
}
],
"source": [
"def max_of_tensor(array):\n",
" \"\"\"Sample torchscript script that returns the\n",
" highest element in an array.\n",
"\n",
" \"\"\"\n",
" # return the highest element\n",
" return array.max(1)[0]\n",
"\n",
"sample_array_1 = np.array([np.arange(9.)])\n",
"print(sample_array_1)\n",
"print(\"Max:\")\n",
"print(max_of_tensor(sample_array_1))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's store this function so it can be called, assigning it the key `max-of-tensor`: "
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"client.set_function(\"max-of-tensor\", max_of_tensor)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A Client can now be used to call this function where it will run wherever the database is deployed (on CPU or GPU)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[8.]\n"
]
}
],
"source": [
"client.put_tensor(\"script-data-1\", sample_array_1)\n",
"client.run_script(\n",
" \"max-of-tensor\", # key of our script\n",
" \"max_of_tensor\", # function to be called\n",
" [\"script-data-1\"],\n",
" [\"script-output\"],\n",
")\n",
"\n",
"out = client.get_tensor(\"script-output\")\n",
"\n",
"print(out)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And, as expected, we obtain the same result we obtained when we ran the function locally. To clean up, we need to tear down the DB. We do this by stopping the `Orchestrator`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"exp.stop(db)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Ensembles using SmartRedis\n",
"\n",
"In Section 1.2 we used `Ensemble`s. What would happen if `Model`s which are part of an `Ensemble` tried to put their tensors on the DB using SmartRedis? Unless we used unique keys across the running programs, several tensors (or objects) would have the same key, and this key collision would result in unexpected behavior. In other words, if in the source code of one program, a tensor with key `tensor1` was put on the DB, then each replica of the program would put a tensor with the key `tensor1`. SmartSim and SmartRedis can avoid key collision by prepending program-unique prefixes to `Model` workloads launched through SmartSim. \n",
"\n",
"Instead of creating a new Experiment for this section, we will use the previous Experiment and relaunch the Orchestrator using the `db` reference that is already defined."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"exp.start(db)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's add two replicas of the same `Model`. Basically, it is a simple producer, which puts a tensor on the DB. The code for it is in `producer.py`."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"rs_prod = exp.create_run_settings(\"python\", f\"producer.py --redis-port {REDIS_PORT}\")\n",
"ensemble = exp.create_ensemble(name=\"producer\",\n",
" replicas=2, \n",
" run_settings=rs_prod)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We add a consumer, which will just retrieve the tensors put by the two producers and check that they are what it expects."
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"rs_consumer = exp.create_run_settings(\"python\", f\"consumer.py --redis-port {REDIS_PORT}\")\n",
"consumer = exp.create_model(\"consumer\", run_settings=rs_consumer)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We need to register incoming entities, i.e. entities for which the prefix will have to be known by other entities. When we will start the `Experiment`, environment variables will be set to let all entities know which incoming entities are present."
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"consumer.register_incoming_entity(ensemble.models[0])\n",
"consumer.register_incoming_entity(ensemble.models[1])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, we attach the files to the experiments, generate them, and run!"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"00:20:48 e3fbeabfdb3e SmartSim[1408] INFO Working in previously created experiment\n",
"00:20:48 e3fbeabfdb3e SmartSim[1408] INFO Working in previously created experiment\n",
"00:20:48 e3fbeabfdb3e SmartSim[1408] INFO \n",
"\n",
"=== Launch Summary ===\n",
"Experiment: tutorial-smartredis\n",
"Experiment Path: /home/craylabs/tutorials/getting_started/tutorial-smartredis\n",
"Launcher: local\n",
"Ensembles: 1\n",
"Models: 1\n",
"Database Status: active\n",
"\n",
"=== Ensembles ===\n",
"producer\n",
"Members: 2\n",
"Batch Launch: False\n",
"\n",
"=== Models ===\n",
"consumer\n",
"Executable: /usr/bin/python\n",
"Executable Arguments: consumer.py --redis-port 6899\n",
"\n",
"\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" \r"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"00:21:02 e3fbeabfdb3e SmartSim[1408] INFO producer_0(1500): Completed\n",
"00:21:02 e3fbeabfdb3e SmartSim[1408] INFO producer_1(1505): Completed\n",
"00:21:02 e3fbeabfdb3e SmartSim[1408] INFO consumer(1510): Completed\n"
]
}
],
"source": [
"ensemble.attach_generator_files(to_copy=['producer.py'])\n",
"consumer.attach_generator_files(to_copy=['consumer.py'])\n",
"exp.generate(ensemble, overwrite=True)\n",
"exp.generate(consumer, overwrite=True)\n",
"\n",
"# start the models\n",
"exp.start(ensemble, consumer, summary=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The producers produced random NumPy tensors, and we can see that the consumer was able to retrieve both of them from the DB, by looking at its output."
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Tensor for producer_0 is: [[[[0.16503988 0.12075829 0.3565984 ]\n",
" [0.72577718 0.09396099 0.1618377 ]\n",
" [0.33099621 0.55506376 0.69916534]]]]\n",
"Tensor for producer_1 is: [[[[0.68450198 0.27678731 0.65711464]\n",
" [0.74589422 0.45886442 0.52484735]\n",
" [0.5394516 0.20950066 0.96127311]]]]\n",
"\n"
]
}
],
"source": [
"outputfile = './tutorial-smartredis/consumer/consumer.out'\n",
"\n",
"with open(outputfile, 'r') as fin:\n",
" print(fin.read())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As usual, let's shutdown the DB, by stopping the `Orchestrator`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"exp.stop(db)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}