Data Science Dojo has added 30 datasets to its repository which is freely available for AI enthusiasts. The repository includes datasets on a diverse range of themes, difficulty levels, size and number of attributes. The datasets have been categorized according to varying difficulty levels to be suitable for everyone. They would offer the ability to challenge one's knowledge and get hands-on practice to boost their skills in areas, including but not limited to, exploratory data analysis, data visualization, data wrangling and machine learning.

The datasets below have been sorted with increasing level of difficulty for convenience. We recommend to test yourself with all the distinct datasets we’ve provided. We’ve presented a challenging question with every dataset, however, feel free to use them in any way you wish.

**1) Find out the age of Abalone from physical measurements**

**Level:** Beginner
**Recommended Use:** Regression Models
**Domain:** Automobiles

This beginner level data set has 4177 rows and 9 columns and physical measurements of abalones and the number of rings (representing age). The age of abalone is usually determined by a boring and time-consuming task. Therefore, physical measurements, which are easier to obtain, can be used to predict the age.

**2) Predict student's knowledge level**

**Level:** Beginner
**Recommended Use:** Classification/Clustering
**Domain:** Education/Web

This data set has 403 rows and 6 columns. It is a real dataset about the students' knowledge status about the subject of Electrical DC Machines.

**3) Can you predict the price of a house?**

**Level:** Beginner
**Recommended Use:** Regression Models
**Domain:** Real Estate

This data set has 414 rows and 7 columns related to various attributes of a house. It provides the market historical data set of real estate valuations which are collected from Sindian Dist., New Taipei City, Taiwan.

**4) Can you estimate location from WiFi Signal Strength?**

**Level:** Beginner
**Recommended Use:** Classification Models
**Domain:** Mobile/Location

This beginner level data set has 2000 rows and 8 columns. The data set contains wifi signal strength observed from 7 wifi devices on a smartphone collected in indoor space which could be used to estimate the location in one of the four rooms.

**5) Predict acceptability of a car**

**Level:** Beginner
**Recommended Use:** Classification Models
**Domain:** Automobile

The data set has 1728 rows and 7 columns in which car attributes such as price and technology are described across 6 attributes such as "Buying Price", "Maintenance", and "Safety" etc. There are multiple alternatives under each of the 6 attributes. Car's acceptability, the seventh attribute, is the outcome variable.

**6) Predict seminal quality of an individual**

**Level:** Beginner
**Recommended Use:** Regression/Classification Models
**Domain:** Healthcare/Life

This data set has 10 attributes. It includes semen sample of 100 volunteers, analyzed according to the WHO 2010 criteria. This data set can be used to determine if it is possible to reach a diagnosis without a laboratory approach, which includes expensive tests, sometimes uncomfortable for the patients. Attributes presented in this data set can be taken easily using a questionnaire to estimate sperm concentration.

**7) Estimate chance of Bankruptcy from Qualitative parameters by experts**

**Level:** Beginner
**Recommended Use:** Classification Models
**Domain:** Finance/Banking

This data set has 250 rows and 7 columns. This dataset contains 6 qualitative parameters from experts which could be used to predict the bankruptcy.

**8) Can you predict the fuel-efficiency of a car?**

**Level:** Intermediate
**Recommended Use:** Regression Models
**Domain:** Automobiles

This data set has 398 rows and 9 columns and provides mileage, horsepower, model year, and other technical specifications for cars.

**9) Was that chest pain an indicator of a heart disease?**

**Level:** Intermediate
**Recommended Use:** Classification Models
**Domain:** Health Sciences

This data set provides health examination data among 303 patients who were presented with chest pain and might have been suffering from heart disease. The data set has 14 attributes to find whether the diagnosed patient were found to have a heart disease or not.

**10) Predict total number of demand of orders**

**Level:** Intermediate
**Recommended Use:** Regression Models
**Domain:** Business

This intermediate level data set has 60 rows and 13 columns. The dataset was collected during 60 days, this is a real database of a brazilian logistics company. It has twelve predictive attributes and a target that is the total of orders for daily treatment.

**11) Find out if a donor will give blood in March 2007**

**Level:** Intermediate
**Recommended Use:** Classification Models
**Domain:** Business

This data set has 748 instances and 5 attributes. This is from a donor database of Blood Transfusion Service Center in Hsin-Chu City in Taiwan. The center passes their blood transfusion service bus to one university in Hsin-Chu City to gather blood donated about every three months.

**12) Forecast Pollution Level of a City**

**Level:** Intermediate
**Recommended Use:** Regression Models
**Domain:** Environment

This data set has 43824 rows and 13 columns. It contains the PM2.5 data of US Embassy in Beijing. Meanwhile, meteorological data from Beijing Capital International Airport are also included. The data set could be used for pollution level forecasting using the Air quality attributes provided. It will also offer experience in Multivariate Time Series Forecasting.

**13) Will the patient survive for at least one year after a heart attack?**

**Level:** Intermediate
**Recommended Use:** Classification Models
**Domain:** Automobiles

This data set has 132 rows and 12 columns. It provides data that could be used for classifying if patients will survive for at least one year after a heart attack. All the patients suffered heart attacks at some point in the past. Some are still alive and some are not.

**14) Estimate compressive strength of concrete**

**Level:** Intermediate
**Recommended Use:** Regression Models
**Domain:** Civil Engineering/Construction

This set has 1030 rows and 9 columns. Concrete is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age and ingredients. The actual concrete compressive strength (MPa) for a given mixture under a specific age (days) was determined from laboratory.

**15) Discover patterns relating liver disorder and alcohol consumption**

**Level:** Intermediate
**Recommended Use:** Classification/Regression/Clustering Models
**Domain:** Healthcare

This data set has 345 rows and 7 columns. The dataset does not contain any variable representing presence or absence of a liver disorder. The first five columns represent the result of various blood tests which may be of use in diagnosing alcohol-related liver disorders. The sixth represents the number of alcoholic drinks taken per day by the subject, self-reported.

**16) Predict which stock will provide greatest rate of return**

**Level:** Intermediate
**Recommended Use:** Clustering/Regression/Classification Models
**Domain:** Business/Finance

This data set has 750 rows and 16 columns. It contains weekly data for the Dow Jones Industrial Index. It has been used in computational investing research. Each record is data for a week and has the percentage of return that stock has in the following week. Ideally, this could be used to determine which stock will produce the greatest rate of return in the following week.

**17) Assess heating and cooling load requirements of building**

**Level:** Intermediate
**Recommended Use:** Regression/Classification Models
**Domain:** Energy

This data set has 768 rows and 10 columns. It can be used for assessing the heating load and cooling load requirements of buildings (that is, energy efficiency) as a function of building parameters. The buildings differ with respect to the glazing area, the glazing area distribution, and the orientation, amongst other parameters.

**18) Determine the type of glass using oxide content**

**Level:** Intermediate
**Recommended Use:** Classification Models
**Domain:** Physical

This data set has 214 rows and 10 columns. The data set provides details about 6 types of glass, defined in terms of their oxide content (i.e. Na, Fe, K, etc).

**19) Predict chance of survival**

**Level:** Intermediate
**Recommended Use:** Classification Models
**Domain:** Healthcare

This data set has 155 rows and 20 columns and provides various attributes of a patient suffering from hepatitis. This could be used to predict patient’s chance of survival or for other purposes.

**20) Find patterns from spending data at wholesale**

**Level:** Intermediate
**Recommended Use:** Classification/Clustering
**Domain:** Business/Retail

This data set has 440 rows and 8 columns. The data set refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories.

**21) Group similar travel reviews**

**Level:** Intermediate
**Recommended Use:** Clustering/Classification Models
**Domain:** Web

This data set, populated by crawling TripAdvisor.com, has 980 rows and 11 columns. The data set includes reviews on destinations in 10 categories mentioned across East Asia. Each traveler rating is mapped as Excellent(4), Very Good(3), Average(2), Poor(1), and Terrible(0) and average rating is used against each category per user.

**22) Relate returns of Istanbul Stock Exchange with other international indices**

**Level:** Intermediate
**Recommended Use:** Regression/Classification Models
**Domain:** Business/Finance

This data set has 536 rows and 9 columns. It includes returns of Istanbul Stock Exchange with seven other international indices; SP, DAX, FTSE, NIKKEI, BOVESPA, MSCE_EU, MSCI_EM. It can be used to find a predictive relationship between the ISE100 and other international stock market indices.

**23) Predict bike rental count (hourly/daily) based on the environmental & seasonal settings**

**Level:** Intermediate
**Recommended Use:** Regression Models
**Domain:** Social

This dataset, consisting of 17379 rows and 17 columns, contains the hourly and daily count of rental bikes between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information. Bike-sharing rental process is highly correlated to the environmental and seasonal settings.

**24) Detect Room Occupancy through Light, Temperature, Humidity and CO2 sensors**

**Level:** Intermediate
**Recommended Use:** Classification Models
**Domain:** Energy/Buildings

This data set has 20560 rows and 7 attributes. It provides experimental data used for binary classification (room occupancy of an office room) from Temperature, Humidity, Light and CO2. Ground-truth occupancy was obtained from time stamped pictures that were taken every minute.

**25) Estimate whether a person’s income exceeds $50K/year:**

**Level:** Intermediate
**Recommended Use:** Classification Models
**Domain:** Social/Government

This data set was extracted from the census bureau database. There are 48842 instances of data set. The data set has 15 attribute which include age, sex, education level and other relevant details of a person.

**26) Detect Autistic Spectrum Disorder (ASD) cases:**

**Level:** Advanced
**Recommended Use:** Classification Models
**Domain:** Healthcare/Social Sciences

This advanced level data set has Autistic Spectrum Disorder (ASD) Screening Test Data for 704 adults and has 21 attributes including test takers' demographics. It also has 10 questions that test takers answered in screening test. The status of test taker on ASD is determined which is recorded under Class/ASD variable.

**27) Estimate the probability of Default**

**Level:** Advanced
**Recommended Use:** Classification Models
**Domain:** Business/Finance

This data set has 30000 rows and 24 columns. The data set could be used to estimate the probability of default payment by credit card client using the data provided.

**28) Predict if a note is genuine**

**Level:** Advanced
**Recommended Use:** Classification Models
**Domain:** Banking/Finance

This advanced level data set has 1372 rows and 5 columns. Data were extracted from images from genuine and forged banknote-like specimens that were taken for the evaluation of an authentication procedure for banknotes. These were later digitized. Wavelet Transform tool were used to extract features from images.

**29) Find a short term forecast on electricity consumption of a single home**

**Level:** Advanced
**Recommended Use:** Regression/Clustering Models
**Domain:** Electricity

This data set has 2075259 rows and 9 columns. This dataset provides measurements of electric power consumption in one household with a one-minute sampling rate over a period of almost 4 years. Different electrical quantities and some sub-metering values are available.

**30) Predict the number of shares on social networks**

**Level:** Advanced
**Recommended Use:** Regression/Classification Models
**Domain:** Business/Web

This data set has 39644 rows and 61 columns. It summarizes a heterogeneous set of features about articles published by Mashable in a period of 2 years and could be used to predict the number of shares of an article in social networks.

This is a companion discussion topic for the original entry at https://blog.datasciencedojo.com/p/6b780a6b-b4cf-4e61-a2d6-e96fbebf7eb5/