{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Boston Wohnungsgrundstueck-Preise\n", "\n", "Hier wird exemplarisch gezeigt, wie scikit-learn für eine Aufgabe wie eine lineare Regression eingestetzt werden kann." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import sklearn\n", "import sklearn.datasets\n", "import sklearn.linear_model\n", "import sklearn.metrics\n", "import sklearn.model_selection\n", "\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Das Modul `sklearn` heißt in Lang `scikit-learn` und beinhaltet bereits einige Datensätze.\n", "Die Hauspreise von Boston sind nun ein Beispiel." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Boston House Prices dataset\n", "===========================\n", "\n", "Notes\n", "------\n", "Data Set Characteristics: \n", "\n", " :Number of Instances: 506 \n", "\n", " :Number of Attributes: 13 numeric/categorical predictive\n", " \n", " :Median Value (attribute 14) is usually the target\n", "\n", " :Attribute Information (in order):\n", " - CRIM per capita crime rate by town\n", " - ZN proportion of residential land zoned for lots over 25,000 sq.ft.\n", " - INDUS proportion of non-retail business acres per town\n", " - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)\n", " - NOX nitric oxides concentration (parts per 10 million)\n", " - RM average number of rooms per dwelling\n", " - AGE proportion of owner-occupied units built prior to 1940\n", " - DIS weighted distances to five Boston employment centres\n", " - RAD index of accessibility to radial highways\n", " - TAX full-value property-tax rate per \$10,000\n", " - PTRATIO pupil-teacher ratio by town\n", " - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town\n", " - LSTAT % lower status of the population\n", " - MEDV Median value of owner-occupied homes in \$1000's\n", "\n", " :Missing Attribute Values: None\n", "\n", " :Creator: Harrison, D. and Rubinfeld, D.L.\n", "\n", "This is a copy of UCI ML housing dataset.\n", "http://archive.ics.uci.edu/ml/datasets/Housing\n", "\n", "\n", "This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.\n", "\n", "The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic\n", "prices and the demand for clean air', J. Environ. Economics & Management,\n", "vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics\n", "...', Wiley, 1980. N.B. Various transformations are used in the table on\n", "pages 244-261 of the latter.\n", "\n", "The Boston house-price data has been used in many machine learning papers that address regression\n", "problems. \n", " \n", "**References**\n", "\n", " - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.\n", " - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.\n", " - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)\n", "\n" ] } ], "source": [ "boston = sklearn.datasets.load_boston()\n", "print(boston.DESCR)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Der Datensatz wird nun in einen Trainings-Datensatz und einen Test-Datensatz aufgeteilt." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(\n", " boston.data, boston.target, test_size=0.33)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exploration des Trainings-Datensatzes\n", "\n", "Zunächst visualisieren wir die Werte.\n", "`Y_train` ist die Zielvariable, also `MEDV`, sprich \"Median value of owner-occupied homes in \$1000's\"" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "image/png": "text/plain": [ "