Open Datasets

Discover open source datasets that gets you going to train ML models

Open Source Datasets To Get You Started with AI/ML Models

The output of your AI & ML models is only as good as the data you use to train it – so the precision that you apply to data aggregation and the tagging and identifying of that data is important!

So if you want to start a new AI/ML initiative and now you’re quickly realizing that finding high-quality training data will be one of the more challenging aspects of your project as high-quality datasets are the fuel that keeps the AI/ML engine running. We have accumulated a list of open datasets that are free to use and train your AI/ML models of the future.

Specialization	Data Type	Dataset Name	Industry / Dept.	Annotation/Use Case	Description	Link
NLP	Text	Amazon Reviews	E-commerce	Sentiment Analysis	A set of 35 Mn reviews & ratings from over last 18 years in plain text with user and product details.	Link
NLP	Text	Wikipedia Links Data	General		More than 4 Mn. articles containing 1.9 Bn. word that comprise of words and phrases as well as paragraphs.	Link
NLP	Text	Standford Sentiment Treebank	Entertainment	Sentiment Analysis	Sentiment annotations dataset for over 10,000 pieces of reviews from Rotten Tomatoes in HTML file format	Link
NLP	Text	Twitter US Airline Sentiment	Airline	Sentiment Analysis	2015 Tweets on US Airlines bifurcated into positive, negative, and neutral tones	Link
CV	Image	Imagenet	General		Dataset with over 14 Mn. images in various file formats, organized according to the WordNet hierarchy.	Link
CV	Image	Google’s Open Images	General		9 Mn. URLs to categorize public images from over 6,000 categories.	Link
NLP	Text	MIMIC Critical Care Database	Healthcare		Computational Physiology Datasets with de-identified data from 40,000 critical care patients. The dataset contains information such as demographics, vital signs, medications, etc.	Link
CV	Image	US National Travel and Tourism Office	Tourism		Provides broad photographs from the tourism industry with trustworthy databases, covering topics such as inbound and outbound travel and international tourist info.	Link
NLP	Text	Department of Transportation	Tourism		Tourism datasets that includes National Parks, driver registers, bridges & rail info etc.	Link
NLP	Audio	Flickr Audio Caption Corpus	General		Over 40k spoken captions from 8,000 photographs designed for unsupervised speech patterns	Link
NLP	Audio	Speech Commands Dataset	General	Speech Recognition, Audio Annotation	1 second long utterances from thousands of individuals, to build basic voice interface.	Link
NLP	Audio	Environmental Audio Datasets	General		Environment audio datasets that contains sound of events tables and acoustic scenes tables.	Link
NLP	Text	COVID-19 Open Research Dataset	Healthcare	Medical AI	A research dataset consisting of 45,000 scholarly articles on COVID-19 & the coronavirus family of viruses.	Link
CV	Image	Waymo Open Dataset	Automotive		The most diverse autonomous driving datasets released by Waymo	Link
CV	Image	Labelme	Public Govt.		Large set of annotated images accessible through the Labelme Matlab	Link
CV	Image	Stanford Dogs Dataset	General		Over 20,500+ images categorzied into image set of 120 different dog breeds	Link
CV	Image	Indoor Scene Recognition	General	Scene Recognition	A specific dataset consisting of 15620 images from 67 indoor categories to build scene recognition models	Link
CV	Image	VisualQA	General		A dataset that includes open-ended questions relating to 265,016 photos that require understanding of vision and language comprehension to respond.	Link
NLP	Text	Multidomain Sentiment Analysis Dataset	E-commerce	Sentiment Analysis	Dataset containing product reviews from Amazon	Link
NLP	Text	IMDB Reviews	Entertainment	Sentiment Analysis	Dataset containing 25000 movie review for sentiment analysis	Link
NLP	Text	Blogger Corpus	General	Keyprase Ananlysis	Dataset containing 681,288 blog posts from blogger.com consisting of minimum 200 occurrences of widely used English words.	Link
NLP	Text	Jeopardy	General	Chatbot Training	Dataset with more than 200,000 questions that can be used to train machine learning models to intelligently auto respond	Link
NLP	Text	SMS Spam Collection in English	Telecom	Spam Recognition	A spam message dataset consisting of 5,574 English SMS's	Link
NLP	Text	Yelp Reviews	General	Sentiment Analysis	A dataset with over 5 mn review published by Yelp	Link
NLP	Text	UCI’s Spambase	Enterprise	Spam Recognition	A large dataset of spam emails, useful for spam filtering.	Link
CV	Video, Image	Berkeley DeepDrive BDD100k	Automotive	Autonomous Vehicles	One of the largest dataset for self-driving AI containing 1,100-hours of driving experiences in over 100,000 videos from different times of the day from New York and San Francisco area.	Link
CV	Video	Comma.ai	Automotive	Autonomous Vehicles	A 7 hours highway driving dataset consisting information on car’s speed, acceleration, steering angle, and GPS coordinates	Link
CV	Video, Image	Cityscape Dataset	Automotive	Semantic Label for Autonomous Vehicle	A dataset of 5,000 pixel-level annotations plus a larger set of 20,000 weakly annotated frames in stereo video sequences, recorded from 50 different cities	Link
CV	Image	KUL Belgium Traffic Sign Dataset	Automotive	Autonomous Vehicles	Over 10000+ traffic sign annotations from the Flanders region based on physically distinct traffic signs from across Belgium.	Link
CV	Image	LISA: Laboratory for Intelligent & Safe Automobiles, UC San Diego Datasets	Automotive	Autonomous Vehicles	A rich dataset containing traffic signs, vehicles detection, traffic lights, and trajectory patterns.	Link
CV	Image	CIFAR-10	General	Object Recognition	A dataset consisitng of 50,000 images and 10,000 test images (i.e. 60,000 32×32 colour images in 10 classes) for object recognition.	Link
CV	Image	Fashion MNIST	Fashion		An image dataset that consists of 60,000 examples and a test set of 10,000 examples in 28×28 grayscale images, associated with a label from 10 classes.	Link
CV	Image	IMDB-Wiki Dataset	Entertainment	Facial Recognition	A large dataset of facial images with labels such as gender and age. Out of the total 523,051 face images, 460,723 images are obtained from 20,284 celebrities from IMDB & 62,328 from Wikipedia.	Link
CV	Video	Kinetics-700	General		For each action class, the high-quality dataset consists of 650,000 video clips and encompasses 700 human action classes with at least 600 video clips. Here, each clip lasts 10 seconds or so.	Link
CV	Image	MS Coco	General	Object detection, Segmentation	The dataset contains 328k images and has a total of 2.5 Mn instances and 91 object images to train large-scale object detection, segmentation, and data captioning related ML models.	Link
CV	Image	MPII Human Pose Dataset	General		Around 25K photographs containing over 40K individuals with annotated body joints are included in the dataset, which is used for articulating human pose estimation. Overall the dataset covers 410 human activities and each image is provided with an activity label.	Link
CV	Image	Open Images	General	Object location annotations	Image dataset with around 9 Mn images annotated with image-level labels, object bounding boxes, object segmentation etc. The dataset also consists of 16 Mn. bounding boxes for 600 object classes on 1.9 Mn images.	Link
CV	Video, Image	Argo, by Argo, USA	Automotive	Bounding Box, Optical Flow, Behavioral Label, Semantic Label, Lane Marking	A self-driving dataset that consists of HD maps with geometric & semantic metadata i.e. lane centerlines, lane direction, & driveable area. The dataset is used to train ML models, to make more accurate perception algorithms, that will help self-driving vehicles navigate safely.	Link
CV	Video	Bosch Small Traffic Lights, by Bosch North America Research	Automotive	Bounding Box	A dataset consisiting of 13427 camera images with 1280*720 resolution to build vision-based traffic light detection system. The dataset has more than 24000 annotated traffic lights.	Link
CV	Video	Brain4Cars, by Cornell Univ., United States	Automotive	Behavioral Label	A dataset comprising of an array of cabin sensors (cameras, tactile sensors, smart devices, etc.) in order to extract useful statistics about the driver alertness. Our algorithms may detect drivers who are drowsy or distracted and boost necessary alarms to improve protection.	Link
CV	Image	CULane, by Chinese Univ. of Hong Kong, Beijing, China	Automotive	Lane Marking	A Computer Vision dataset on traffic lane detection, consisiting of 55 hours of videos of which 133,235 (88880 training set, 9675 validation set, and 34680 test set) frames were extracted. It is collected by cameras mounted on six different vehicles driven by different drivers in Beijing.	Link
CV	Video	DAVIS, by Univ. of Zurich,ETH ¨ Zurich, Germany, Switzerland	Automotive		An end-to-end vehicle driving training dataset that uses a DAVIS event+frame camera. Car data such as steering, throttle, GPS, etc. are used to evaluate the fusion of frame and event data for automotive apps.	Link
CV	Video	DBNet, by Shanghai Jiao Tong Univ.,Xiamen Univ., China	Automotive	Point Cloud, LiDAR	A real-world 1000 KM driving data, that includes aligned video, point cloud, GPS and driver behavior for in-depth research on driving behaviors.	Link
CV	Video	Dr(eye)ve, by Univ. of Modena and Reggio Emilia, Modena, Italy	Automotive	Behavioral Label	Dataset containing 74 video sequences of 5 mins each, that were annotated in more than 500,000 frames. The dataset consist of Geo-referenced locations, driving speed, course, and also labels drivers gaze fixations and their temporal integration providing task-specific maps.	Link
CV	Video	ETH Pedestrian (2009), by ETH Zurich, Zurich, Switzerland	General	Bounding Box	A dataset of 74 video sequences of 5 minutes each, annotated in more than 500,000 frames. The dataset provides geo-referenced positions, driving speed, direction, and also labels gaze fixations for drivers and their temporal integration, including task-specific maps.	Link
CV	Video	Ford (2009), by Univ. of Michigan, Michigan, US	Automotive	Bounding Box, , LiDAR	A dataset compiled by an automated land vehicle armed with a Velodyne 3D-lidar scanner, two push-broom forward-looking Rieg lidars, a technical and consumer Inertial Measurement Unit (IMU), and a Point Grey Ladybug3 omnidirectional camera system.	Link
CV	Video	HCI Challenging Stereo, Bosch Corporation Research, Hildesheim, Germany	General		A dataset of several million frames from captured video scenes that include a wide range of various weather conditions, multiple layers of motion and depth; situations in the city and countryside, etc.	Link
CV	Video	JAAD, by York University, Ukraine, Canada	Automotive	Bounding Box, Behavioral Label	"JAAD is a dataset for studying joint attention in the context of autonomous driving. The focus is on pedestrian and driver behaviors at the point of crossing and factors that influence them. To this end, JAAD dataset provides a richly annotated collection of 346 short video clips (5-10 sec long) extracted from over 240 hours of driving footage from several locations in North America and Eastern Europe. Bounding boxes with occlusion tags are used for all pedestrians making this dataset suitable for pedestrian detection. Behavior annotations specify behaviors for pedestrians that interact with or require attention of the driver. For each video there are several tags (weather, locations, etc.) and timestamped behavior labels (e.g. stopped, walking, looking, etc.). In addition, a list of demographic attributes is provided for each pedestrian (e.g. age, gender, direction of motion, etc.) as well as a list of visible traffic scene elements (e.g. stop sign, traffic signal, etc.) in each frame."	Link
CV	Image	LISA Traffic Sign, by Univ. of California, San Diego, United States	Automotive	Bounding Box	The set of dataset containing videos and annotated frames containing US traffic signs. It is released in two stages, one with only the pictures and one with both pictures and videos.	Link
CV	Image	Mapillary Vistas, by Mapillary AB, Global	Automotive	Semantic Label	A street-level photography dataset for interpreting street scenes around the world with pixel-accurate and instance-specific human annotations.	Link
CV	Video, Image	Semantic KITTI, by University of Bonn, Karlsruhe, Germany	Automotive	Bounding Box, Semantic Label, Lane Marking	A dataset that includes a semantic annotation for all Odometry Benchmark sequences. The dataset annotates various types of moving and non-moving traffic: including cars, bikes, bicycles, pedestrians, and bicyclists, allowing objects in the scene to be studied.	Link
CV	Video	Stanford Track, by Stanford Univ., United States	Automotive	Object Detection / Classification LiDAR, GPS, Codes	A dataset that includes 14,000 labeled object tracks as observed by a Velodyne HDL-64E S2 LIDAR in natural street scenes, which can be used to train machine learning models for 3D Object Recognition.	Link
CV	Video, Image	The Boxy Dataset, by Bosch, United States	Automotive	Bounding Box / Vehicle Detection	A vehicle detection data set containing 2 million annotated vehicles for training and analyzing object recognition strategies for self-driving cars on motorways.	Link
CV	Video	TME Motorway, by Czech Technical Univ., Northern Italy	Automotive	Bounding Box	A Dataset of 28 clips for a total of 27 minutes bifurcated into 30,000+ vehicle annotation frames. Annotation was produced semi-automatically using the data from the laser scanner. This data collection involves variable traffic scenarios, number of lanes, road curvature and illumination, covering much of the conditions of the full acquisition.	Link
CV	Video	Unsupervised Llamas, by Bosch, United States	Automotive	Lane Marking, LiDAR	The Unsupervised Llamas dataset was annotated by generating high-definition automatic driving maps, including Lidar-based lane markers. The autonomous vehicle can be aligned against these maps and the lane markings are projected into the camera frame. The 3D projection is optimized by minimizing the discrepancy between already observed and predicted image markers.	Link
NLP	Audio	Facebook AI Multilingual LibriSpeech (MLS)	General	Audio Annotation / Speech Recognition	Facebook AI Multilingual LibriSpeech (MLS),is a large-scale, open source data set designed to help advance research in automatic speech recognition (ASR). MLS provides more than 50,000 hours of audio across 8 languages: English, German, Dutch, French, Spanish, Italian, Portuguese, and Polish.	Link

Open Datasets

Open Source Datasets To Get You Started with AI/ML Models

AI Data Services

Platform

Speciality

Industry

Resources

Company

Contact Us