osapa:tagging_workshop_2006_sep_paul
tagging_workshop_2006_sep_paul
<< back
Outlined notes from Paul McKenney
Thu Sep 14, 2006
Diane Peters – introduction.
Kees Cook, OSDL.
Ross Turk, OSTG (SourceForge.net)
Largest number of open-source projects.
Want users to find the projects, hence tagging.
Want community to collaborate on tagging/categorization.
Current approach:
Challenges:
Search results difficult
Only get 255 characters to describe a project
SF.net controls hierarchy, admins control content. Need to empower community! Hierarchy not related to USPTO categorization: different audience.
Not part of a larger community.
125K projects, 239K searches/day, 24M unique visitors/mo, 1.3M registered users.
Tagging must be “organic”, cannot force a “flag day”. Will take time.
Automation?
Bogus tags? Cannot assume that all members of the community are benevolent! Can patent examiners evaluate tags or produce tags? USPTO is looking into examiner-produced tags, but might not be able to publicize them.
Each SF project can use up to 5 tags out of a universe of about 500.
Syntactic or semantic analysis for automated tagging? Analyze Makefile and/or source code? Leverage libraries used? (But be careful of automatically generated makefiles/code.)
Ken Krugler - Files, files everywhere and not tag to see.
Vertical search engine for programmers. 500+ repositories, 20M files. 50K+ metadata repositories. 40M pages from key domains.
Add “code wikis” to talk about code.
Cannot reasonably -build- all of this code! Different languages, gcc versions, …
“DOAP” – XML-based description of
OSS projects – but not everyone uses this!
Significant UI issues – user might specify a version, but work only with latest version. Need to propagate tags as code is harvested by different projects, possibly with modifications.
Tags in different languages – but tags are often too short to automatically recognize the language.
Willing to make APIs available gratis to USPTO. (Google makes APIs available under Creative Commons license. Choose to throttle, put up with people “harvesting”.)
Expose tag values and usage to users? Folksonomy quality still a matter for research.
Tags on “what it does” vs. “how it does it”.
Fuzzy matching to relate good documentation to related code? Structure matching requires heuristics, which differ from language to language. Need things to be automated and scalable to 100K projects.
Ward Cunningham - Experience extracting tags from Eclipse commit comments
“200 lines of perl is my contribution” “wikiwikiweb is what will be on my gravestone” [reply from audience: will there be an “edit” button?]
Eclipse wanted tags to lead busy developers through a complex code base.
Have experimented wiht tags as a supplement to project visibility tools.
“Commits explorer” – OLAP view into CVS repository. dash.eclipse.org
“Tag Cloud” by project over time from checkin comments “tags.cgi”.
Show tags with font size based on number of uses. Considering differential tagging – month's frequency compared to long-term usage frequency. Generally, the mid-range of usage is the most interesting. The highest usage tends to be things like “fixed”.
Very small perl script.
In future, perhaps extract words from comments as well. Perhaps also from code diffs.
Want to use comment-like entities that reference multiple pieces of code – calling out relationships.
Might be applicable to other types of repositories, but could always convert to CVS…
Eric Hestenes – Peer to Patent Project
Incorporating public input to examination process.
Collecting use cases – collaborating on use cases.
Community deliberates and ranks prior-art references. Initial thought is to send in top 10. Might also send “the rest” separately categorized.
Also include examiner usage in ranking? As well, include list of other applications that community has felt that the reference was relevant to.
No need to be perfect. If five of ten are right on, the extra five “false positives” aren't a real problem.
Retain old commentary – keep track of “tree” that points backwards in time to progressively older prior art.
Why are applicants doing this to themselves? (1) expedited review (2) greater presumption of validity.
Best Practices for Manual Software Tagging
Ward Cunningham: “Archival Quality Software”
Thesaurus:
Timestamping
Forensic – Received headers. Should be OK for past work, but belt and suspenders would be good. (Generally, case law applies only going forward.)
Hashing – see Jan's email message. Publish hashes. In Germany, digital signatures must be re-signed periodically (6 years or so?).
timestamp.com.
Jan Kechel's timestamp service.
Perhaps archive.com could save plaintext.
Library of Congress keeping old source code? (IP.com does this as a final step.)
Google hashes during crawl, but does not guarantee to keep actual file indefinitely (or for any particular time period, for that matter).
Note that one must prove integrity as well as date!!!
Potential Friday Topics:
“gaming” of tags.
USPTO categories as tags.
Tagging of public non-open-source software – corporate SDKs.
Incenting people to tag software.
Overview of examination process.
Friday topics:
USPTO viewpoint and needs (Gail and Tariq)
Google synonym feature (Karl)
Sourceforge/Freshmeat Trove (Ross)
Repository Federation/Unification/“why can't we all be friends” Standard Panel (Kees, Dan, Karl, Ross, Chris)
Fri Sep 15, 2006
Gail Hayes, Tariq Hafiz: USPTO
Patent filings increasing faster than expected. Hiring 1,200 patent examiners per year, total of 4,500. 3,000 of these are “electrical”, including computer software/hardware. Exceeding space – senior examiners often work from home, which makes training more challenging. New patent examiners might or might not have knowledge of older technologies.
More explicit training required, given the very large number of new hires.
Senior examiners often have only five years of experience.
Average examiner paid ~50-60K out of college, senior examiners might get $130K.
Training/travel set-asides. Paid consultants, field trips.
Examiner process –
Most time spent on detailed description.
Searching common. Some examiners use categories, others do not.
Background provides some education.
Incentives for rejection? No publicity… [Could scan patent applications and making noise about good rejections…]
“Quota” for examiners include rejections as well as allowances.
RMS's article.
Wishlist
Call out to industry experts [policy issue!]
Training.
Associating patents with projects. Pledged patents!
Tagging should include synopsis of problem and how it was solved.
Karl Fogel (Google)
Ross Turk: SourceForge/Freshmeat “Trove”
Multi-dimensional map – tree-structured representation.
Categories: topic, user interface, translations, programming language, license, development status, database environment, intended audience, operating systems
Browsing starts with topic.
Searchable text books. O'Reilly Safari.
Mapping between USPTO categories and other locations.
Different types of tags – name as description vs. separate searchable description.
Finding prior art similar to finding useful code.
Jan Kechel: Mapping between
OSS tags and prior art.
Tariq Hafiz: Recategorization
Initiated when given subclass has more than a certain number patents (e.g., 5,000).
Also when technology changes.
Karl Fogel, Chris Conrad, Ross Turk, Kees Cook
One-stop shopping for research searching. Hosted by both producers and consumers.
Jeffrey Kruelen nomination for name “DOAP on SOAP”.
Overview
Goals:
Assist searches across entire open-source software space.
Information known to be up-to-date data source.
Extensible by third parties.
Both major repositories and bit players represented.
Consumers able to select publishers.
Common editing interface (includes tagging wizard).
RSS feeds to consumers.
Single shared project-ID namespace
Common Interchange Format (XML)
OSUOSL hosting – central master database
Change publication (e.g., RSS feed)
— Roles for existing players: Authority…
<< back
osapa/tagging_workshop_2006_sep_paul.txt · Last modified: 2016/07/19 01:22 (external edit)