Data

The most significant part of Data Science

6 min readJul 16, 2021

Data will talk to you if you’re willing to listen.
- Jim Bergeson

Data can be termed as God. Everything has its validation only because of data. You will be unable to claim your property, belongings if you don’t have data to prove it. Even you might fail to prove your own identity in the absence of data. Whatever we do, purchase or sell creates data. Isn’t it amazing?

Nowadays, we need data as much as we need oxygen. Data has been grown as one of the most important things in our lives. In the field of Data Science and Artificial Intelligence, we must know our data. Let’s learn about it.

What is Data?

Data is a set of factual information, such as numbers, measurements, descriptions, or observations. It can be numerical, text, image, audio, video, graph, tables, pattern, etc. Companies analyze their clients' and customers' data to understand their behavior.

Why do we need Data?

Data can provide us information and valuable insights about the behavior of a certain category of people, community or organization, etc. Big companies like Google, Facebook, Amazon, etc., read our patterns through data to know our necessities, situations, mood, lifestyle so that they can recommend similar products, music, videos that suit our choices.

Types of Data

Based on the format, Data can be categorized into two groups:
1. Structured Data
2. Unstructured Data

Structured Data

Data that has a predefined format is called structured data. Structures Data is typically stored in RDBMS. The structured data generally consists of numbers or text. Structured Data takes less time in processing in comparison to Unstructured Data. Structured Data is of two types:
a. Qualitative Data
b. Quantitative Data

Qualitative Data

Qualitative Data is also known as Categorical Data. Qualitative Data represents the characteristics of an object; for example, gender, marital status, ranking, etc.

Based on the number of values in a category, categorical variables are classified into two kinds:
When a categorical variable has exactly two values, it is called a binary or dichotomous categorical variable; for example, Male/Female, True/False, Yes/No, etc. When it has more than two categories, it is called a polytomous categorical variable; for example, Ranking: First/Second/Third, Marital Status: Married/Unmarried/Separated, etc.

Based on the measurement scales, categorical variables are again classified into the following kinds:

Nominal: A categorical variable where the order of data is meaningless is called Nominal Data. e.g., Gender, Hair Color, Bool values, Blood Types, etc.
NOTE: We can code nominal variables with numbers if we want, but the order is arbitrary, and any calculations such as computing a mean, a median, or a standard deviation would be meaningless.

Ordinal: A categorical variable where the order of the data is important is called Ordinal Data. e.g., Ranking, Grades, Levels, Floors, etc.

Interval: In interval scales, both the order and exact differences between the values are significant. e.g., temperature, pH value, credit score.

Ratio: Ratio scales contain order, exact values, and absolute zero. Hence it is used for both descriptive and inferential statistics. e.g., density, speed, etc.

Source: Hands-on Exploratory Data Analysis by Suresh Mukjiya and Usman Ahmed.

Quantitative Data

The data that can be expressed as numbers and has a sense of measurement is called quantitative data. It is also known as numerical data. Based on the values, numerical data are categorized into two groups:

Discrete Data: Data that is countable is called discrete data. It can take both numerical and categorical values, depending on usage. A variable that represents a discrete dataset is called a discrete variable. Discrete data always has a fixed at that particular time. e.g., age, number of students in a class, number of planets, etc.

Continuous Data: A variable with an infinite number of numerical values within a specific range is called continuous values. e.g., weight, height, etc.
NOTE: Percentage values are also continuous data.

Unstructured Data

Any data stored in its native format is called unstructured data. It can be an image, audio, video, or chat message. It needs extra data preprocessing to utilize the unstructured data for analysis.

Structured Data vs Unstructured Data

Collection of Data

To do the analysis work, the most important thing we need is to collect the data. Data collection can be done in several ways; let’s see a few of them.

Primary Source Data: In this type of data collection technique, we manually generate raw data. This can be done by online surveys, interviews, observations, etc. There are some advantages and disadvantages of using raw data.
Advantage:
1. We can get exactly the information that is needed.
2. Reliable and Original Data.
3. No permission issue.
4. We used to get current and fresh data.
Disadvantage:
1. Time Consuming
2. Expensive
3. Needs extra cleaning and modification before analysis.

Secondary Source Data: In this type of data collection technique, we use stored data. We use databases or open source websites to collect data and use it for analysis.
Advantage:
1. Not time-consuming.
2. Easily accessible.
3. Generally formatted into tables.
Disadvantage:
1. Need permission to access.
2. Not very much reliable.

Web Scraping: In this data collection technique, we extract data from web pages. With the help of some libraries and a little bit of knowledge of HTML, one can easily scrape data from websites. We generally use web scraping in reviews and comment analysis. Python libraries used for web scraping are request, BeautifulSoup, Pandas, Selenium.
NOTE: Not all websites entertain web scraping. You need permission for scraping data from those websites or web pages. Data theft is a crime.

Files Format

Data is stored in several formats. Let’s see some of the most commonly used data files.
1. CSV files: A comma-separated values file is a plain text file in which row values are separated with commas. Each line of the file is a data record, and commas separate the record into different fields. CSV files are generally encountered in spreadsheets and databases.
2. XLSX files: MS Excel files are stored in xlsx format. In these files, values are stored in rows and columns. XLSX files are also generally encountered in spreadsheets and databases.
3. TXT files: Text files are stored in txt format. In this type of file, we store textual data.
4. JSON files: JSON stands for JavaScript Object Notation(JSON). These files are lightweight, text-based open standard designed for exchanging data over the web.
Some more common files are images, pdfs, HTML, etc.

EndNote

This was the basic understanding of data you must know before learning Data Science. Understanding data makes a lot of work easy for us. Thanks for being with me till here.

References

Hands-on Exploratory Data Analysis with Python [Book] by Suresh Kumar Mukhiya and Usman Ahmed
Statistics for Machine Learning [Book] by Pratap Dangeti
Wikipedia pages on web scraping, data.