Open Datasets

Discover open source datasets that gets you going to train ML models

Open datasets

Open Source Datasets To Get You Started with AI/ML Models

The output of your AI & ML models is only as good as the data you use to train it – so the precision that you apply to data aggregation and the tagging and identifying of that data is important!

So if you want to start a new AI/ML initiative and now you’re quickly realizing that finding high-quality training data will be one of the more challenging aspects of your project as high-quality datasets are the fuel that keeps the AI/ML engine running. We have accumulated a list of open datasets that are free to use and train your AI/ML models of the future.

SpecializationData TypeDataset NameIndustry / Dept.Annotation/Use CaseDescriptionLink
NLPTextAmazon ReviewsE-commerceSentiment AnalysisA set of 35 Mn reviews & ratings from over last 18 years in plain text with user and product details.Link
NLPTextWikipedia Links DataGeneralMore than 4 Mn. articles containing 1.9 Bn. word that comprise of words and phrases as well as paragraphs.Link
NLPTextStandford Sentiment TreebankEntertainmentSentiment AnalysisSentiment annotations dataset for over 10,000 pieces of reviews from Rotten Tomatoes in HTML file formatLink
NLPTextTwitter US Airline SentimentAirlineSentiment Analysis2015 Tweets on US Airlines bifurcated into positive, negative, and neutral tonesLink
CVImage Labeled Faces In The WildGeneralFacial RecognitionDataset containing over 13,000 cropped faces with two different pictures for facial recognition training.Link
CVVideo, ImageUMDFaces DatasetGeneralFacial RecognitionAnnotated dataset containing over 367,000 faces from over 8,000 subjects that Includes still and video images.Link
CVImage ImagenetGeneralDataset with over 14 Mn. images in various file formats, organized according to the WordNet hierarchy.Link
CVImage Google’s Open ImagesGeneral9 Mn. URLs to categorize public images from over 6,000 categories.Link
NLPTextMIMIC Critical Care DatabaseHealthcareComputational Physiology Datasets with de-identified data from 40,000 critical care patients. The dataset contains information such as demographics, vital signs, medications, etc.Link
CVImageUS National Travel and Tourism OfficeTourismProvides broad photographs from the tourism industry with trustworthy databases, covering topics such as inbound and outbound travel and international tourist info.Link
NLPTextDepartment of TransportationTourismTourism datasets that includes National Parks, driver registers, bridges & rail info etc.Link
NLPAudioFlickr Audio Caption CorpusGeneralOver 40k spoken captions from 8,000 photographs designed for unsupervised speech patternsLink
NLPAudioSpeech Commands DatasetGeneralSpeech Recognition, Audio Annotation1 second long utterances from thousands of individuals, to build basic voice interface.Link
NLPAudioEnvironmental Audio DatasetsGeneralEnvironment audio datasets that contains sound of events tables and acoustic scenes tables.Link
NLPTextCOVID-19 Open Research Dataset HealthcareMedical AIA research dataset consisting of 45,000 scholarly articles on COVID-19 & the coronavirus family of viruses.Link
CVImageWaymo Open Dataset AutomotiveThe most diverse autonomous driving datasets released by WaymoLink
CVImageLabelme Public Govt.Large set of annotated images accessible through the Labelme MatlabLink
CVImageStanford Dogs DatasetGeneralOver 20,500+ images categorzied into image set of 120 different dog breedsLink
CVImageIndoor Scene RecognitionGeneralScene RecognitionA specific dataset consisting of 15620 images from 67 indoor categories to build scene recognition modelsLink
CVImageVisualQAGeneralA dataset that includes open-ended questions relating to 265,016 photos that require understanding of vision and language comprehension to respond.Link
NLPTextMultidomain Sentiment Analysis DatasetE-commerceSentiment AnalysisDataset containing product reviews from AmazonLink
NLPTextIMDB ReviewsEntertainmentSentiment AnalysisDataset containing 25000 movie review for sentiment analysisLink
NLPTextBlogger CorpusGeneralKeyprase AnanlysisDataset containing 681,288 blog posts from blogger.com consisting of minimum 200 occurrences of widely used English words.Link
NLPTextJeopardyGeneralChatbot TrainingDataset with more than 200,000 questions that can be used to train machine learning models to intelligently auto respondLink
NLPTextSMS Spam Collection in EnglishTelecomSpam RecognitionA spam message dataset consisting of 5,574 English SMS'sLink
NLPTextYelp ReviewsGeneralSentiment AnalysisA dataset with over 5 mn review published by YelpLink
NLPTextUCI’s SpambaseEnterpriseSpam RecognitionA large dataset of spam emails, useful for spam filtering.Link
CVVideo, ImageBerkeley DeepDrive BDD100kAutomotiveAutonomous VehiclesOne of the largest dataset for self-driving AI containing 1,100-hours of driving experiences in over 100,000 videos from different times of the day from New York and San Francisco area.Link
CVVideoComma.aiAutomotiveAutonomous Vehicles A 7 hours highway driving dataset consisting information on car’s speed, acceleration, steering angle, and GPS coordinatesLink
CVVideo, ImageCityscape DatasetAutomotiveSemantic Label for Autonomous VehicleA dataset of 5,000 pixel-level annotations plus a larger set of 20,000 weakly annotated frames in stereo video sequences, recorded from 50 different citiesLink
CVImageKUL Belgium Traffic Sign DatasetAutomotiveAutonomous VehiclesOver 10000+ traffic sign annotations from the Flanders region based on physically distinct traffic signs from across Belgium.Link
CVImageLISA: Laboratory for Intelligent & Safe Automobiles, UC San Diego DatasetsAutomotiveAutonomous VehiclesA rich dataset containing traffic signs, vehicles detection, traffic lights, and trajectory patterns.Link
CVImageCIFAR-10GeneralObject RecognitionA dataset consisitng of 50,000 images and 10,000 test images (i.e. 60,000 32×32 colour images in 10 classes) for object recognition.Link
CVImageFashion MNISTFashionAn image dataset that consists of 60,000 examples and a test set of 10,000 examples in 28×28 grayscale images, associated with a label from 10 classes.Link
CVImageIMDB-Wiki DatasetEntertainmentFacial RecognitionA large dataset of facial images with labels such as gender and age. Out of the total 523,051 face images, 460,723 images are obtained from 20,284 celebrities from IMDB & 62,328 from Wikipedia.Link
CVVideoKinetics-700GeneralFor each action class, the high-quality dataset consists of 650,000 video clips and encompasses 700 human action classes with at least 600 video clips. Here, each clip lasts 10 seconds or so.Link
CVImageMS CocoGeneralObject detection, SegmentationThe dataset contains 328k images and has a total of 2.5 Mn instances and 91 object images to train large-scale object detection, segmentation, and data captioning related ML models.Link
CVImageMPII Human Pose DatasetGeneralAround 25K photographs containing over 40K individuals with annotated body joints are included in the dataset, which is used for articulating human pose estimation. Overall the dataset covers 410 human activities and each image is provided with an activity label.Link
CVImageOpen ImagesGeneralObject location annotationsImage dataset with around 9 Mn images annotated with image-level labels, object bounding boxes, object segmentation etc. The dataset also consists of 16 Mn. bounding boxes for 600 object classes on 1.9 Mn images.Link
CVVideo, ImageArgo, by Argo, USAAutomotiveBounding Box, Optical Flow, Behavioral Label, Semantic Label, Lane MarkingA self-driving dataset that consists of HD maps with geometric & semantic metadata i.e. lane centerlines, lane direction, & driveable area. The dataset is used to train ML models, to make more accurate perception algorithms, that will help self-driving vehicles navigate safely.Link
CVVideoBosch Small Traffic Lights, by Bosch North America ResearchAutomotiveBounding BoxA dataset consisiting of 13427 camera images with 1280*720 resolution to build vision-based traffic light detection system. The dataset has more than 24000 annotated traffic lights.Link
CVVideoBrain4Cars, by Cornell Univ., United StatesAutomotiveBehavioral LabelA dataset comprising of an array of cabin sensors (cameras, tactile sensors, smart devices, etc.) in order to extract useful statistics about the driver alertness. Our algorithms may detect drivers who are drowsy or distracted and boost necessary alarms to improve protection.Link
CVImageCULane, by Chinese Univ. of Hong Kong, Beijing, ChinaAutomotiveLane MarkingA Computer Vision dataset on traffic lane detection, consisiting of 55 hours of videos of which 133,235 (88880 training set, 9675 validation set, and 34680 test set) frames were extracted. It is collected by cameras mounted on six different vehicles driven by different drivers in Beijing.Link
CVVideoDAVIS, by Univ. of Zurich,ETH ¨ Zurich, Germany, SwitzerlandAutomotiveAn end-to-end vehicle driving training dataset that uses a DAVIS event+frame camera. Car data such as steering, throttle, GPS, etc. are used to evaluate the fusion of frame and event data for automotive apps.Link
CVVideoDBNet, by Shanghai Jiao Tong Univ.,Xiamen Univ., ChinaAutomotivePoint Cloud, LiDARA real-world 1000 KM driving data, that includes aligned video, point cloud, GPS and driver behavior for in-depth research on driving behaviors.Link
CVVideoDr(eye)ve, by Univ. of Modena and Reggio Emilia, Modena, ItalyAutomotiveBehavioral LabelDataset containing 74 video sequences of 5 mins each, that were annotated in more than 500,000 frames. The dataset consist of Geo-referenced locations, driving speed, course, and also labels drivers gaze fixations and their temporal integration providing task-specific maps.Link
CVVideoETH Pedestrian (2009), by ETH Zurich, Zurich, SwitzerlandGeneralBounding BoxA dataset of 74 video sequences of 5 minutes each, annotated in more than 500,000 frames. The dataset provides geo-referenced positions, driving speed, direction, and also labels gaze fixations for drivers and their temporal integration, including task-specific maps.Link
CVVideoFord (2009), by Univ. of Michigan, Michigan, USAutomotiveBounding Box, , LiDARA dataset compiled by an automated land vehicle armed with a Velodyne 3D-lidar scanner, two push-broom forward-looking Rieg lidars, a technical and consumer Inertial Measurement Unit (IMU), and a Point Grey Ladybug3 omnidirectional camera system.Link
CVVideoHCI Challenging Stereo, Bosch Corporation Research, Hildesheim, GermanyGeneralA dataset of several million frames from captured video scenes that include a wide range of various weather conditions, multiple layers of motion and depth; situations in the city and countryside, etc.Link
CVVideoJAAD, by York University, Ukraine, CanadaAutomotiveBounding Box, Behavioral Label"JAAD is a dataset for studying joint attention in the context of autonomous driving. The focus is on pedestrian and driver behaviors at the point of crossing and factors that influence them. To this end, JAAD dataset provides a richly annotated collection of 346 short video clips (5-10 sec long) extracted from over 240 hours of driving footage from several locations in North America and Eastern Europe. Bounding boxes with occlusion tags are used for all pedestrians making this dataset suitable for pedestrian detection. Behavior annotations specify behaviors for pedestrians that interact with or require attention of the driver. For each video there are several tags (weather, locations, etc.) and timestamped behavior labels (e.g. stopped, walking, looking, etc.). In addition, a list of demographic attributes is provided for each pedestrian (e.g. age, gender, direction of motion, etc.) as well as a list of visible traffic scene elements (e.g. stop sign, traffic signal, etc.) in each frame."Link
CVImageLISA Traffic Sign, by Univ. of California, San Diego, United StatesAutomotiveBounding BoxThe set of dataset containing videos and annotated frames containing US traffic signs. It is released in two stages, one with only the pictures and one with both pictures and videos.Link
CVImageMapillary Vistas, by Mapillary AB, GlobalAutomotiveSemantic LabelA street-level photography dataset for interpreting street scenes around the world with pixel-accurate and instance-specific human annotations.Link
CVVideo, ImageSemantic KITTI, by University of Bonn, Karlsruhe, GermanyAutomotiveBounding Box, Semantic Label, Lane MarkingA dataset that includes a semantic annotation for all Odometry Benchmark sequences. The dataset annotates various types of moving and non-moving traffic: including cars, bikes, bicycles, pedestrians, and bicyclists, allowing objects in the scene to be studied.Link
CVVideoStanford Track, by Stanford Univ., United StatesAutomotiveObject Detection / Classification LiDAR, GPS, CodesA dataset that includes 14,000 labeled object tracks as observed by a Velodyne HDL-64E S2 LIDAR in natural street scenes, which can be used to train machine learning models for 3D Object Recognition.Link
CVVideo, ImageThe Boxy Dataset, by Bosch, United StatesAutomotiveBounding Box / Vehicle DetectionA vehicle detection data set containing 2 million annotated vehicles for training and analyzing object recognition strategies for self-driving cars on motorways.Link
CVVideoTME Motorway, by Czech Technical Univ., Northern ItalyAutomotiveBounding BoxA Dataset of 28 clips for a total of 27 minutes bifurcated into 30,000+ vehicle annotation frames. Annotation was produced semi-automatically using the data from the laser scanner. This data collection involves variable traffic scenarios, number of lanes, road curvature and illumination, covering much of the conditions of the full acquisition.Link
CVVideoUnsupervised Llamas, by Bosch, United StatesAutomotiveLane Marking, LiDARThe Unsupervised Llamas dataset was annotated by generating high-definition automatic driving maps, including Lidar-based lane markers. The autonomous vehicle can be aligned against these maps and the lane markings are projected into the camera frame. The 3D projection is optimized by minimizing the discrepancy between already observed and predicted image markers.Link
NLPAudioFacebook AI Multilingual LibriSpeech (MLS)GeneralAudio Annotation / Speech RecognitionFacebook AI Multilingual LibriSpeech (MLS),is a large-scale, open source data set designed to help advance research in automatic speech recognition (ASR). MLS provides more than 50,000 hours of audio across 8 languages: English, German, Dutch, French, Spanish, Italian, Portuguese, and Polish. Link