Open Datasets
Discover open source datasets that gets you going to train ML models
Open Source Datasets To Get You Started with AI/ML Models
The output of your AI & ML models is only as good as the data you use to train it – so the precision that you apply to data aggregation and the tagging and identifying of that data is important!
So if you want to start a new AI/ML initiative and now you’re quickly realizing that finding high-quality training data will be one of the more challenging aspects of your project as high-quality datasets are the fuel that keeps the AI/ML engine running. We have accumulated a list of open datasets that are free to use and train your AI/ML models of the future.
Specialization | Data Type | Dataset Name | Industry / Dept. | Annotation/Use Case | Description | Link |
---|---|---|---|---|---|---|
NLP | Text | Amazon Reviews | E-commerce | Sentiment Analysis | A set of 35 Mn reviews & ratings from over last 18 years in plain text with user and product details. | Link |
NLP | Text | Wikipedia Links Data | General | More than 4 Mn. articles containing 1.9 Bn. word that comprise of words and phrases as well as paragraphs. | Link | |
NLP | Text | Standford Sentiment Treebank | Entertainment | Sentiment Analysis | Sentiment annotations dataset for over 10,000 pieces of reviews from Rotten Tomatoes in HTML file format | Link |
NLP | Text | Twitter US Airline Sentiment | Airline | Sentiment Analysis | 2015 Tweets on US Airlines bifurcated into positive, negative, and neutral tones | Link |
CV | Image | Labeled Faces In The Wild | General | Facial Recognition | Dataset containing over 13,000 cropped faces with two different pictures for facial recognition training. | Link |
CV | Video, Image | UMDFaces Dataset | General | Facial Recognition | Annotated dataset containing over 367,000 faces from over 8,000 subjects that Includes still and video images. | Link |
CV | Image | Imagenet | General | Dataset with over 14 Mn. images in various file formats, organized according to the WordNet hierarchy. | Link | |
CV | Image | Google’s Open Images | General | 9 Mn. URLs to categorize public images from over 6,000 categories. | Link | |
NLP | Text | MIMIC Critical Care Database | Healthcare | Computational Physiology Datasets with de-identified data from 40,000 critical care patients. The dataset contains information such as demographics, vital signs, medications, etc. | Link | |
CV | Image | US National Travel and Tourism Office | Tourism | Provides broad photographs from the tourism industry with trustworthy databases, covering topics such as inbound and outbound travel and international tourist info. | Link | |
NLP | Text | Department of Transportation | Tourism | Tourism datasets that includes National Parks, driver registers, bridges & rail info etc. | Link | |
NLP | Audio | Flickr Audio Caption Corpus | General | Over 40k spoken captions from 8,000 photographs designed for unsupervised speech patterns | Link | |
NLP | Audio | Speech Commands Dataset | General | Speech Recognition, Audio Annotation | 1 second long utterances from thousands of individuals, to build basic voice interface. | Link |
NLP | Audio | Environmental Audio Datasets | General | Environment audio datasets that contains sound of events tables and acoustic scenes tables. | Link | |
NLP | Text | COVID-19 Open Research Dataset | Healthcare | Medical AI | A research dataset consisting of 45,000 scholarly articles on COVID-19 & the coronavirus family of viruses. | Link |
CV | Image | Waymo Open Dataset | Automotive | The most diverse autonomous driving datasets released by Waymo | Link | |
CV | Image | Labelme | Public Govt. | Large set of annotated images accessible through the Labelme Matlab | Link | |
CV | Image | COIL100 | General | Over 100 varied objects photograped from multiple angles (i.e. 360 degree) | Link | |
CV | Image | Stanford Dogs Dataset | General | Over 20,500+ images categorzied into image set of 120 different dog breeds | Link | |
CV | Image | Indoor Scene Recognition | General | Scene Recognition | A specific dataset consisting of 15620 images from 67 indoor categories to build scene recognition models | Link |
CV | Image | VisualQA | General | A dataset that includes open-ended questions relating to 265,016 photos that require understanding of vision and language comprehension to respond. | Link | |
NLP | Text | Multidomain Sentiment Analysis Dataset | E-commerce | Sentiment Analysis | Dataset containing product reviews from Amazon | Link |
NLP | Text | IMDB Reviews | Entertainment | Sentiment Analysis | Dataset containing 25000 movie review for sentiment analysis | Link |
NLP | Text | Blogger Corpus | General | Keyprase Ananlysis | Dataset containing 681,288 blog posts from blogger.com consisting of minimum 200 occurrences of widely used English words. | Link |
NLP | Text | Jeopardy | General | Chatbot Training | Dataset with more than 200,000 questions that can be used to train machine learning models to intelligently auto respond | Link |
NLP | Text | SMS Spam Collection in English | Telecom | Spam Recognition | A spam message dataset consisting of 5,574 English SMS's | Link |
NLP | Text | Yelp Reviews | General | Sentiment Analysis | A dataset with over 5 mn review published by Yelp | Link |
NLP | Text | UCI’s Spambase | Enterprise | Spam Recognition | A large dataset of spam emails, useful for spam filtering. | Link |
CV | Video, Image | Berkeley DeepDrive BDD100k | Automotive | Autonomous Vehicles | One of the largest dataset for self-driving AI containing 1,100-hours of driving experiences in over 100,000 videos from different times of the day from New York and San Francisco area. | Link |
CV | Video | Comma.ai | Automotive | Autonomous Vehicles | A 7 hours highway driving dataset consisting information on car’s speed, acceleration, steering angle, and GPS coordinates | Link |
CV | Video, Image | Cityscape Dataset | Automotive | Semantic Label for Autonomous Vehicle | A dataset of 5,000 pixel-level annotations plus a larger set of 20,000 weakly annotated frames in stereo video sequences, recorded from 50 different cities | Link |
CV | Image | KUL Belgium Traffic Sign Dataset | Automotive | Autonomous Vehicles | Over 10000+ traffic sign annotations from the Flanders region based on physically distinct traffic signs from across Belgium. | Link |
CV | Image | LISA: Laboratory for Intelligent & Safe Automobiles, UC San Diego Datasets | Automotive | Autonomous Vehicles | A rich dataset containing traffic signs, vehicles detection, traffic lights, and trajectory patterns. | Link |
CV | Image | CIFAR-10 | General | Object Recognition | A dataset consisitng of 50,000 images and 10,000 test images (i.e. 60,000 32×32 colour images in 10 classes) for object recognition. | Link |
CV | Image | Fashion MNIST | Fashion | An image dataset that consists of 60,000 examples and a test set of 10,000 examples in 28×28 grayscale images, associated with a label from 10 classes. | Link | |
CV | Image | IMDB-Wiki Dataset | Entertainment | Facial Recognition | A large dataset of facial images with labels such as gender and age. Out of the total 523,051 face images, 460,723 images are obtained from 20,284 celebrities from IMDB & 62,328 from Wikipedia. | Link |
CV | Video | Kinetics-700 | General | For each action class, the high-quality dataset consists of 650,000 video clips and encompasses 700 human action classes with at least 600 video clips. Here, each clip lasts 10 seconds or so. | Link | |
CV | Image | MS Coco | General | Object detection, Segmentation | The dataset contains 328k images and has a total of 2.5 Mn instances and 91 object images to train large-scale object detection, segmentation, and data captioning related ML models. | Link |
CV | Image | MPII Human Pose Dataset | General | Around 25K photographs containing over 40K individuals with annotated body joints are included in the dataset, which is used for articulating human pose estimation. Overall the dataset covers 410 human activities and each image is provided with an activity label. | Link | |
CV | Image | Open Images | General | Object location annotations | Image dataset with around 9 Mn images annotated with image-level labels, object bounding boxes, object segmentation etc. The dataset also consists of 16 Mn. bounding boxes for 600 object classes on 1.9 Mn images. | Link |
CV | Video, Image | Argo, by Argo, USA | Automotive | Bounding Box, Optical Flow, Behavioral Label, Semantic Label, Lane Marking | A self-driving dataset that consists of HD maps with geometric & semantic metadata i.e. lane centerlines, lane direction, & driveable area. The dataset is used to train ML models, to make more accurate perception algorithms, that will help self-driving vehicles navigate safely. | Link |
CV | Video | Bosch Small Traffic Lights, by Bosch North America Research | Automotive | Bounding Box | A dataset consisiting of 13427 camera images with 1280*720 resolution to build vision-based traffic light detection system. The dataset has more than 24000 annotated traffic lights. | Link |
CV | Video | Brain4Cars, by Cornell Univ., United States | Automotive | Behavioral Label | A dataset comprising of an array of cabin sensors (cameras, tactile sensors, smart devices, etc.) in order to extract useful statistics about the driver alertness. Our algorithms may detect drivers who are drowsy or distracted and boost necessary alarms to improve protection. | Link |
CV | Image | CULane, by Chinese Univ. of Hong Kong, Beijing, China | Automotive | Lane Marking | A Computer Vision dataset on traffic lane detection, consisiting of 55 hours of videos of which 133,235 (88880 training set, 9675 validation set, and 34680 test set) frames were extracted. It is collected by cameras mounted on six different vehicles driven by different drivers in Beijing. | Link |
CV | Video | DAVIS, by Univ. of Zurich,ETH ¨ Zurich, Germany, Switzerland | Automotive | An end-to-end vehicle driving training dataset that uses a DAVIS event+frame camera. Car data such as steering, throttle, GPS, etc. are used to evaluate the fusion of frame and event data for automotive apps. | Link | |
CV | Video | DBNet, by Shanghai Jiao Tong Univ.,Xiamen Univ., China | Automotive | Point Cloud, LiDAR | A real-world 1000 KM driving data, that includes aligned video, point cloud, GPS and driver behavior for in-depth research on driving behaviors. | Link |
CV | Video | Dr(eye)ve, by Univ. of Modena and Reggio Emilia, Modena, Italy | Automotive | Behavioral Label | Dataset containing 74 video sequences of 5 mins each, that were annotated in more than 500,000 frames. The dataset consist of Geo-referenced locations, driving speed, course, and also labels drivers gaze fixations and their temporal integration providing task-specific maps. | Link |
CV | Video | ETH Pedestrian (2009), by ETH Zurich, Zurich, Switzerland | General | Bounding Box | A dataset of 74 video sequences of 5 minutes each, annotated in more than 500,000 frames. The dataset provides geo-referenced positions, driving speed, direction, and also labels gaze fixations for drivers and their temporal integration, including task-specific maps. | Link |
CV | Video | Ford (2009), by Univ. of Michigan, Michigan, US | Automotive | Bounding Box, , LiDAR | A dataset compiled by an automated land vehicle armed with a Velodyne 3D-lidar scanner, two push-broom forward-looking Rieg lidars, a technical and consumer Inertial Measurement Unit (IMU), and a Point Grey Ladybug3 omnidirectional camera system. | Link |
CV | Video | HCI Challenging Stereo, Bosch Corporation Research, Hildesheim, Germany | General | A dataset of several million frames from captured video scenes that include a wide range of various weather conditions, multiple layers of motion and depth; situations in the city and countryside, etc. | Link | |
CV | Video | JAAD, by York University, Ukraine, Canada | Automotive | Bounding Box, Behavioral Label | "JAAD is a dataset for studying joint attention in the context of autonomous driving. The focus is on pedestrian and driver behaviors at the point of crossing and factors that influence them. To this end, JAAD dataset provides a richly annotated collection of 346 short video clips (5-10 sec long) extracted from over 240 hours of driving footage from several locations in North America and Eastern Europe. Bounding boxes with occlusion tags are used for all pedestrians making this dataset suitable for pedestrian detection. Behavior annotations specify behaviors for pedestrians that interact with or require attention of the driver. For each video there are several tags (weather, locations, etc.) and timestamped behavior labels (e.g. stopped, walking, looking, etc.). In addition, a list of demographic attributes is provided for each pedestrian (e.g. age, gender, direction of motion, etc.) as well as a list of visible traffic scene elements (e.g. stop sign, traffic signal, etc.) in each frame." | Link |
CV | Image | LISA Traffic Sign, by Univ. of California, San Diego, United States | Automotive | Bounding Box | The set of dataset containing videos and annotated frames containing US traffic signs. It is released in two stages, one with only the pictures and one with both pictures and videos. | Link |
CV | Image | Mapillary Vistas, by Mapillary AB, Global | Automotive | Semantic Label | A street-level photography dataset for interpreting street scenes around the world with pixel-accurate and instance-specific human annotations. | Link |
CV | Video, Image | Semantic KITTI, by University of Bonn, Karlsruhe, Germany | Automotive | Bounding Box, Semantic Label, Lane Marking | A dataset that includes a semantic annotation for all Odometry Benchmark sequences. The dataset annotates various types of moving and non-moving traffic: including cars, bikes, bicycles, pedestrians, and bicyclists, allowing objects in the scene to be studied. | Link |
CV | Video | Stanford Track, by Stanford Univ., United States | Automotive | Object Detection / Classification LiDAR, GPS, Codes | A dataset that includes 14,000 labeled object tracks as observed by a Velodyne HDL-64E S2 LIDAR in natural street scenes, which can be used to train machine learning models for 3D Object Recognition. | Link |
CV | Video, Image | The Boxy Dataset, by Bosch, United States | Automotive | Bounding Box / Vehicle Detection | A vehicle detection data set containing 2 million annotated vehicles for training and analyzing object recognition strategies for self-driving cars on motorways. | Link |
CV | Video | TME Motorway, by Czech Technical Univ., Northern Italy | Automotive | Bounding Box | A Dataset of 28 clips for a total of 27 minutes bifurcated into 30,000+ vehicle annotation frames. Annotation was produced semi-automatically using the data from the laser scanner. This data collection involves variable traffic scenarios, number of lanes, road curvature and illumination, covering much of the conditions of the full acquisition. | Link |
CV | Video | Unsupervised Llamas, by Bosch, United States | Automotive | Lane Marking, LiDAR | The Unsupervised Llamas dataset was annotated by generating high-definition automatic driving maps, including Lidar-based lane markers. The autonomous vehicle can be aligned against these maps and the lane markings are projected into the camera frame. The 3D projection is optimized by minimizing the discrepancy between already observed and predicted image markers. | Link |
NLP | Audio | Facebook AI Multilingual LibriSpeech (MLS) | General | Audio Annotation / Speech Recognition | Facebook AI Multilingual LibriSpeech (MLS),is a large-scale, open source data set designed to help advance research in automatic speech recognition (ASR). MLS provides more than 50,000 hours of audio across 8 languages: English, German, Dutch, French, Spanish, Italian, Portuguese, and Polish. | Link |