Category: Uncategorized

Bias & Variance in laymen Terms

If the machine learning model is not generalised then the model contains some kind of error.

Error= difference between actual and predicted values/classes

Formulae = sum of (actual output-predicted output), Also Error is the sum of reducible + irreducible error.

Reducible Error= bias + variance

Bias is how far is the predicted values/class from actual values/class. If the predicted value is too far away from actual value then the model is highly biased.

If values are not too far away then its low biased.

If the model is Highly biased then it won’t be able to capture the complex data and hence it UNDERFITS. (Underfitting)

If the model performs well on training dataset but does not perform well on testing or validation data which is new to model then its termed as the variance. So variance is how scattered predicted values from the actual values. If the model has High variance than the model overfits (OVERFITTING).
Oftenly termed as the model learned the noise.

Repository of Natural Language Processing

Python 2 & 3 kernel inside the jupyter notebook

While solving our problems on python, Many of us might have faced the situation of kernels, the specific package supports only python 2.7 and require python 3 and there are a lot of issues while installing python kernels and running it with Jupyter notebook side by side. Here’s my solution of running python 2 and 3 on the same machine.

System Overview: I ran the kernels on MacOS Mojave version 10.14.5

installingpython3kerneljupyter Screenshot 2019-06-03 at 10.10.10 AM
My System Configuration
installingpython3kerneljupyter Screenshot 2019-06-03 at 10.06.31 AM
Step: 1

installingpython3kerneljupyter Screenshot 2019-06-03 at 10.05.55 AM
Step: 2
installingpython3kerneljupyter Screenshot 2019-06-03 at 10.05.17 AM
installingpython3kerneljupyter Screenshot 2019-06-03 at 10.06.31 AM
installingpython3kerneljupyter Screenshot 2019-06-03 at 10.07.52 AM
installingpython3kerneljupyter Screenshot 2019-06-03 at 10.06.46 AM

Its a 2 step process:

  1. Check the available kernels:
    • jupyter kernelspec list
  2. Install the Kernel:
    • python3 -m ipykernel install –user

Repeat 1st step to check the installed kernels.

What is GTFS by the way ?

General Transit Feed Specification
  1. GTFS

General Transit Feed Specification is a common format for public transportation schedules and associated geographic information. It is the data used by google Maps.

2. Details of GTFS

GTFS is a set of text files that represent a snapshot of scheduled transit services

a. Agency.txt – Details of Agency publishing the data

b. Routes.txt – Details of Routes name and type

c. Trips.txt – Details of trip and service

d. Stops.txt – Details of location and stops name

e. Stop_times.txt – Details of arrival & Departure

f.  calender.txt – Details of availability days & dates

including other additional optional fields such as calendar dates, fare attributes, fare rules, shapes, frequencies, transfers, and feed info

Real-time GTFS

1. Trip updates – delays, cancellations, changed routes

2. Service alerts – stop moved, unforeseen events affecting a station, route or the entire network

3. Vehicle positions – information about the vehicles including location and congestion level

ile:GTFS class diagram.svg

3. what are the data we have?

Indian Railways Train Time Table

Data published by:  Ministry of Railways


Last updated: 2017

4. when we have GTFS DATA

/Users/Anurag/Desktop/Screenshot 2019-01-31 at 11.31.56 AM.png


To passengers and potential users of higher-quality information on services.

To operators and regulators from the use of analytic and monitoring tools.

To society more generally of operating in an open data ecosystem.

5. Architecture for transit

6. Use case value of application

KMRL – Kochi Metro Rail GTFS

Next Bus Delhi: Android Application of DTC buses

Data used: Delhi Open Transit Data (GTFS)

Case Studies

RPT – Rochester public Transit



7. How GTFS Transition will help government to do performance check and improvements

  • Transit network Analysis.
  • Defining route service span, travel times, headway, stop amenities, Transfer stations, and Interlined routes.
  • Fare structure
  • Planning functions native to transit agencies including service development.
  • Operational analysis




K Nearest Neighbour & Logistic Regression: Iris Dataset

K Nearest Neighbour & Logistic Regression


Predicting the Flower Species using K Nearest Neighbour & Logistic Regression on Iris Dataset

CAP Theorem: Big Data

CAP theorem

When you’ll start working with NoSQL databases, sooner or later you come across CAP theorem, The theorem published by Eric Brewer in 2000, that describe any distributed system.

In a distributed database system C stands for Consistency, A stands for Availability and P stands for Partition Tolerance.

  • Consistency – Every node in the system will have the same view of data.
  • Availability – User can read & write from any node.
  • Partition Tolerance – Your system will still Operate even if any node or server fails.

So from the above illustration while selecting your NoSQL database you can choose only two characteristics i.e. A-P, A-C, C-P.

A-P Databases: Voldemort, Cassandra, Riak, CouchDB, Dynamo

A-C Databases: PostgreSQL, MySQL

C-P Databases: HBase, MongoDB, Redis, Big Table

CAP theorem