Tech Don: Data science

Friday, October 16, 2020

Performing Analysis of Meteorological Data

We are going to analyze the data from the weather data-set. The dataset has hourly temperature recorded for last 10 years starting from 2006-04-01 00:00:00.000 +0200 to 2016-09-09 23:00:00.000 +0200. It corresponds to Finland, a country in the Northern Europe. You can download the dataset from Kaggle (link: https://www.kaggle.com/muthuj7/weather-dataset). We are going to use the numpy, pandas and the matplotlib libraries of Python.

We perform data cleaning and analysis for testing the Hypothesis has the Apparent temperature and humidity compared monthly across 10 years of the data indicate an increase due to Global warming.

Import Packages and Load the Data:

Here is a preview of how our data-set looks:

We have to convert the "Formatted Date" column which is in object form to "Datetime" format so that we can work further on that format.

To convert the data in to our need and resample our data, Here is how the data looks after resampling:

Now let us plot our data in a line graph, visualize, variation in Apparent Temperature and Humidity

As we can see, both the peaks and the troughs are almost same throughout the period of 10 years. Here is a plot of the average temperature and humidity of the month of April over 10 years.

Now group by the 'Formatted Date' in separate 'month' and 'year'.

Now let us plot our data in a line graph, visualize whether the average Apparent temperature for the year from 2006 to 2016 and the average humidity for the same period have increased or not. This monthly analysis has to be done for all 12 months over the 10 year period.

Hence we can conclude that global warming has caused an uncertainty in temperature over the past 10 years while the average humidity as remained constant throughout the 10 years.

Thus from the above visualization it is clear that there is a mark able change in the Average Apparent Temperature due to Global Warming. The humidity remains approximately constant throughout the time span.

Wednesday, October 14, 2020

Recognizing Handwritten Digits with scikit-learn

Introduction:

Recognizing handwritten text is a problem that can be traced back to the first automatic machines that needed to recognize individual characters in handwritten documents.

Classifying handwritten text or numbers is important for many real-world scenarios. For example, a postal service can scan postal codes on envelopes to automate the grouping of envelopes which has to be sent to the same place. This article presents recognizing the handwritten digits (0 to 9) using the famous digits data set from Scikit-Learn, using a classifier called Support Vector Machine.

Scikit-Learn:

Scikit-learn is a free software machine learning library for the Python programming language. t features various algorithms like support vector machine, random forests, and k-neighbors, and it also supports Python numerical and scientific libraries like NumPy and SciPy.

Scikit-Learn is a library for Python that contains numerous useful algorithms that can easily be implemented and altered for the purpose of classification and other machine learning tasks.

Support Vector Machine:

In machine learning, support-vector machines are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis.

Recognizing Handwritten Digits with Scikit-learn:

Scikit-learn provided multiple Support Vector Machine classifier implementations. SVC supports multiple kernel functions (used to split with non-linearly) but the training time complexity is quadratic with the number of samples. Multiclass classification is done with a one-vs-one scheme. On the other hand, LinearSVC only supports linear kernels but the training time is linear with the number of samples. The multiclass classification is done with a one-vs-others scheme.

Loading the Dataset:

The Scikit-learn library provides numerous datasets, among which we will be using a data set of images called Digits. This data set consists of 1,797 images that are 8x8 pixels in size. Each image is a handwritten digit in grayscale.

In python, the key function returns the names of the attributes of an object, in other words, which information is stored in the object in the form of other objects. Let's use this function to check what can be found in the digits object:

Visualize the image in 0 to 5

Let's start modeling using Support Vector Machine, making an instance of the model. Here the Model is learning the relationship between digits (x_train) and labels(y_train).

then predict the test set,

We use the different cases for a range of validation.

You can see that the svc estimator has learned correctly. It is able to recognize the handwritten digits, interpreting correctly all five digits of the validation set. And get the best accuracy score.