Lesson 1: Import Datasets, Basic Statistics, Descriptive Statistics, and Statistics by Category/Group Welcome to the very first lesson for Azure ML Studio. Since Microsoft just announce the product in July of 2014, the usage of ML Studio is still quite a mystery to many analytics and business intelligence professionals. Therefore, Neal Analytics will present hands-on tutorials on this new product on a regular basis so that everyone can take advantage of all the benefits of what it can offer. Let s start easy. This lesson will illustrate: How to import datasets How to quickly obtain basic statistical information for the dataset How to obtain additional descriptive statistical information for the dataset How to do the same for each category or group in the dataset In this lesson, a dataset called chickwts from R is downloaded and saved as a csv file. Chickwts contains 71 rows with two columns named Weight and Feed. Weight is numeric while Feed involves six separate categories: horsebean, linseed, soybean, sunflower, meatmeal, and casein. Import Dataset First, let s import the data. 1. Open the internet browser of choice 2. Enter the URL: http://studio.azureml.net 3. Enter personal Log-In information
4. Once logged-in, this ML Studio should look like this 5. Click on New 6. Click on Dataset
7. Click on From Local File 8. Either enter the directory along with file name or click on Browse to locate chickwts.csv Note: Since chickwts already exist in my computer, a green check mark appears next to the Existing dataset box.
9. Since the original dataset includes a header, select Generic CSV File with a header (.csv) under Select a type for the new dataset 10. Click on check mark 11. This should return to the Home page
12. To check whether chickwts.csv has been properly imported, either: a. Click on Experiment if the program contains existing experiments i. Click on any experiment on the list
b. Or if no experiments exist and for the sake of this lesson, start a new experiment by clicking on New i. Click on Experiment
ii. The resulting screen should look like this: iii. Click on the title at the top to rename the experiment
13. Whether an existing or a new experiment is used, Click on Saved Datasets 14. chickwts.csv should be included in the dropped down list
Basic Statistics Basic Statistics include: Mean Median Min Max Standard Deviation Additional information include Unique Values Missing Values Feature Type 1. Click, hold, and drag chickwts.csv into the workspace
2. Right click on the tiny circle at the bottom of the module
3. Click on Visualize 4. Just like that, the Basic Statistics information should be listed in table form like this
Descriptive Statistics In some cases, more statistical information is desired and a special module will provide it. The information offered by this module includes: Count (Number of Values) Unique Value Count (Number of Unique Values) Missing Value Count (Number of Missing Values) Min Max Mean Mean Deviation 1 st Quartile Median 3 rd Quartile Mode Range Sample Variance Sample Standard Deviation Sample Skewness Sample Kurtosis P0.5 (0.5% Percentile) P1 (1% Percentile) P5 (5% Percentile) P95 (95% Percentile) P99 (99% Percentile) P99.5 (99.5% Percentile) 1. Continuing from where Basic Statistics left off, there are two ways to locate the Descriptive Statistics module. a. Click on Statistical Functions b. Type Descriptive Statistics in the search bar above
2. Click, hold, and drag the Descriptive Statistics module into the workspace 3. Connect the two modules by clicking and dragging the connection arrow from tiny circle at the bottom of the chickwts.csv module to the top tiny circle at the top of the Descriptive Statistics module
4. The resulting chart should look something like this 5. Click on Save (optional but recommended)
6. Click on Run 7. If the simulation ran successfully, a green check mark should appear inside the Descriptive Statistics module
8. Now, right click on the tiny circle at the bottom of the Descriptive Statistics module 9. Click on Visualize
10. The result looks like so. Scroll to the right to look up further numbers Basic and Descriptive Statistics by Category or Group Sometimes, one wishes to obtain statistical information for each category in the dataset. In Excel, this information can be done and displayed like so:
horsebean linseed soybean sunflower meatmeal casein Mean 160.2 218.75 246.4286 328.91667 276.90909 323.5833 Median 151.5 221 248 328 263 342 Min 108 141 158 226 153 216 Max 227 309 329 423 380 404 Std. Dev. 38.625841 52.2357 54.12907 48.836384 64.900623 64.43384 In ML Studio, the dataset must be split in terms of the category before basic statistics can be calculated. A special module named Split can divide the dataset into two, not multiple, based on certain settings in its property. 1. Let s begin with a clear workspace, then click, hold, and drag the chickwts.csv dataset into it 2. Search for the Split module; there are two ways a. Click on Data Transformation
i. Click on Sample and Split b. Or simply type split into the search bar to pull out the Split module
3. Click, hold, and drag the Split module into the workspace 4. Connect chickwts.csv to Split by dragging an arrow from the tiny circle at the bottom of chickwts.csv to the tiny circle at the top of Split
5. The resulting chart should look something like this: 6. Now, click on the Split module to highlight it, then in the Properties window on the right, select Regular Expression for Splitting mode 7. Under Regular expression, type \ feed horsebean
8. Click Save then Run 9. Once a green check mark appears in the Split module, click on the tiny circle at the bottom left of the module. 10. Click on Visualize
11. This should provide statistical results for the category, horsebean 12. Clicking on the tiny circle at the bottom right of the Split module, then Visualize should give the statistics for the rest of the dataset, which may have little meaning at this point.
13. To obtain Basic Statistics for all the categories, continue to drag and drop multiple Split modules into the workspace until: Number of Split Modules = Number of Categories 1 14. Connect the Split modules by dragging an arrow from the tiny circle at the bottom right of each module to the tiny circle at the top of each module like so:
15. For each Split module, click on it to highlight it, then set the following properties: Splitting Mode Regular Expression 2 nd Split Module 3 rd Split Module 4 th Split Module 5 th Split Module Regular Expression Regular Regular Regular Expression Expression Expression \ feed linseed \ feed soybean \ feed sunflower \ feed meatmeal 16. Click Save then Run
17. A green check mark must appear in each Split module 18. Clicking on the tiny circle at the bottom left of each Split module (except the last one) then clicking Visualize will show the Basic Statistics for its corresponding category. For the last Split module, the left tiny circle will show statistics for one category (Meatmeal) and the right for the last category (Casein).
19. To take a step further and obtain Descriptive Statistics, connect each tiny circle at the bottom left of the Split module (the bottom right one as well for the last Split module) to a Descriptive Statistics module so that the end resulting chart likes like this:
20. Click Save then Run 21. Make sure all the modules have green check marks
22. Now, click on each unconnected tiny circle at the bottom of each Descriptive Statistics module will provide additional statistics information for each corresponding category in the chickwts.csv dataset. Hopefully, this lesson has taught the steps needed to obtain simple and basic statistics information for any dataset that involves both numeric and categorical values. In the process of walking through this lesson, a few other consistent techniques (such as connecting modules) have been repeated and should be engrained by now. This is the launching board to more sophisticated yet simple analysis using ML Studio. Until next time.